Abstract
The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. We propose a new conceptualization of text plans as a sequence of question-answer (QA) pairs and enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for content selection (i.e., what to say) and planning (i.e., in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.
1 Introduction
Neural generation models are often prone to hallucination (Song et al., 2018; Maynez et al., 2020; Kryscinski et al., 2020; Gabriel et al., 2021), repetition and redundancy (Li et al., 2018; Suzuki and Nagata, 2017), and struggle to identify which content units are salient (Tan et al., 2017a). These phenomena are amplified when generating long-form text, i.e., documents with multiple paragraphs (Wiseman et al., 2017), when dealing with non-linguistic data (e.g., database tables), or very long input—which is common when summarizing multiple documents (Liu and Lapata, 2019; Perez-Beltrachini et al., 2019), books (Kryściński et al., 2021), or dialogue (Chen et al., 2022; Zhong et al., 2021). An additional challenge concerns the blackbox nature of deep learning systems, which hides the inherent complexity of modeling multiple interconnected linguistic phenomena in text generation, and makes it difficult to examine model decisions and attribute errors to specific components. The lack of modularity further affects controllability as these systems cannot be easily tailored to individual needs.
Attempts to remedy some of these issues focus on changing the way entities are represented (Puduppully et al., 2019b; Iso et al., 2019), allowing the decoder to skip low-confidence tokens to enhance faithful generation (Tian et al., 2019), modeling graph connections between document elements to better capture salience (Tan et al., 2017b; Liu and Lapata, 2019), encoding documents hierarchically (Celikyilmaz et al., 2018; Liu and Lapata, 2019; Rohde et al., 2021), learning latent alignments between the input and the target text (Xu et al., 2021), adopting sparse attention mechanisms (Child et al., 2019; Beltagy et al., 2020), and introducing content selection (Gehrmann et al., 2018; Dou et al., 2021) and planning components (Puduppully et al., 2019a; Moryossef et al., 2019b; Narayan et al., 2021; Wiseman et al., 2018).
In this paper we also aim to render conditional generation more modular via an intermediate, plan-based representation. While autoregressive models of language predict one token at a time, there is evidence that in humans some degree of planning occurs at a higher level than individual words (Levelt, 1993; Guhe, 2007). A long tradition in natural language generation views planning as a central component to identifying important content and structuring it appropriately (Reiter and Dale, 2000), however, there is less agreement on how plans should be represented. Common examples include discourse trees (Mellish et al., 1998), entity transitions (Kibble and Power, 2004; Barzilay and Lapata, 2008), sequences of propositions (Karamanis, 2004), and schemas (McKeown, 1985).
Our work proposes a new conceptualization of text plans as a sequence of question-answer pairs. Specifically, we draw inspiration from the “Questions under Discussion” (QUD) theory of discourse structure, which posits that one way of articulating the structure of a text is to identify the questions and sub-questions that are raised and answered by subsequent spans of text (Carlson, 1983; Ginzburg, 1994; Van Kuppevelt, 1995; Larson, 2002; Roberts, 2012; Riester, 2019). Theoretical models of QUD assume that discourse contains implicit questions for each of the assertions made, which are thereby turned into answers. These questions and answers can be understood in terms of their use in moving a discourse forward to achieve communicative goals. We propose to make QUDs explicit by exploiting state-of-the-art question generation technology (Alberti et al., 2019; Lu and Lu, 2021) and use them as an intermediate representation layer for conditional generation, i.e., a question-answering (QA) blueprint operating as a proxy for both content selection (i.e., what to say) and planning (i.e., in what order).
Table 1 illustrates a plan for generating a Wikipedia abstract from the AQuaMuSe dataset (Kulkarni et al., 2020). We enhance existing datasets (e.g., for summarization) with similar blueprints which we obtain automatically. We then convert input-output pairs into input-blueprint-output tuples and propose to learn encoder-decoder models from these augmented annotations. We develop three models that vary in how they integrate blueprints in the generation process and their ability to handle long outputs. Aside from generating blueprints and their corresponding text in one go, we propose a new architecture that iteratively plans and generates a sentence at a time, conditioning on the input and the output sentences generated so far. We do not generate a global blueprint, rather, our planning process is incremental and informed by generation, which we argue affords greater control over the output and its fluency. Moreover, the model is better equipped for long-form generation, since it does not have to (autoregressively) decode the blueprint and its summary in one go, avoiding the risk of exceeding the maximum decoder length.
We instantiate our models with a Transformer (Vaswani et al., 2017) encoder-decoder architecture and perform experiments on summarization datasets representing different information seeking tasks, application domains, and user requirements.1 In all cases, we empirically demonstrate that blueprint models are more factual than alternatives which do not resort to planning; we also observe that QA blueprints are a better representation compared to plans based on entity chains (Narayan et al., 2021), allowing tighter control of the output, and providing a comprehensive explanation for model predictions (if the plan is erroneous, then the summary will be too).
2 Related Work
Questions under Discussion
The QUD-based approach to discourse structure assumes an open-ended inventory of possible questions and sub-questions (Van Kuppevelt, 1995). Recent efforts (De Kuthy et al., 2018; Westera et al., 2020; Riester, 2019) have nevertheless shown that it is possible to manually annotate documents with QUDs, i.e., to formulate a question for every assertion expressed in a text. De Kuthy et al. (2020) even go as far as to partially automate QUD annotation in German by automatically generating all potentially relevant questions for a given sentence. Related work (Ko et al., 2020) focuses on the generation of inquisitive questions that reflect general text understanding and free-form open-ended questions (Ko et al., 2021). Our work builds upon QUD and related discourse structure theories, although, we do not directly implement any of them in particular. We adopt question answering as a good way of spelling out the connection between the information structure of a sentence and the discourse in which the sentence can function.
QA Pairs as a Proxy for Annotation Labels
Question-answer pairs have been previously used as a proxy for expressing semantic content. QA-SRL (He et al., 2015) is a representation based on QA pairs that has been shown to capture the vast majority of arguments and modifiers in PropBank (Palmer et al., 2005) and NomBank (Meyers et al., 2004). Instead of using a pre-defined role lexicon, QA-SRL labels semantic roles with questions whose answers denote the argument bearing the role. Follow-on work uses QA pairs to represent discourse relations (Pyatkin et al., 2020) and to capture overlap or redundancy at the propositional level (Brook Weiss et al., 2021). We also employ QA pairs as an abstraction of propositional content, however, we do not target specific relation types, or make any linguistic assumptions about them (e.g., discourse relations vs semantic roles).
Question-Answering in Summarization
QA pairs have been used for evaluating summaries (Deutsch and Roth, 2021b; Eyal et al., 2019; Durmus et al., 2020; Wang et al., 2020), specifically as a means of estimating the information overlap between a reference summary and a system-generated one. QA-based signals have also been incorporated in the training of summarization models, using reinforcement learning (Arumae and Liu, 2018, 2019; Scialom et al., 2019) or as a way of identifying salient content in the input document (Deutsch and Roth, 2021a). Cao and Wang (2022) introduce the task of hierarchical question-summary generation, where a source document is condensed into multiple summaries, each answering a different question. Questions are organized hierarchically into broad questions and more specific sub-questions that are learned from manual annotations. Our model outputs a QA-based plan and a single summary for a given document, although it is possible to generate different summaries from different plans for the same document. Our QA pairs are obtained automatically and they are not stuctured.
Planning in Encoder-Decoder Models
Various recent efforts have developed planning modules in the context of data-to-text generation. In most cases, the plans are specific to the input, which varies from tables and records to RDF tuples. For instance, Puduppully et al. (2019a) learn a plan corresponding to a sequence of records, and generate a summary conditioned on it. Narayan et al. (2020) treat content selection as a task similar to extractive summarization; they first extract sentence plans and then verbalize them one-by-one. Moryossef et al. (2019a, b) propose a symbolic planning stage followed by a neural realization stage. Other work (Puduppully and Lapata, 2021; Puduppully et al., 2022) advocates macro planning, where document content is organized into a sequence of paragraph plans which are verbalizations of tabular input. Our work is closest to Narayan et al. (2021), who also target summarization applications and learn an intermediate plan to guide generation. We adopt a more elaborate plan representation based on QA blueprints, and interface decoding with plan generation similarly to Narayan et al. (2020).
3 Text Generation with Blueprints
3.1 Problem Formulation
Let d denote the input to the model which could be a document (or multiple documents), a dialogue history, or even database tables. The model will learn to generate blueprint b for output s (e.g., a summary) and the output itself. The blueprint b is an ordered set of question-answer pairs {(q1, a1),(q2, a2),…,(qm, am)}. Unsurprisingly, such blueprints are not naturally occurring in existing datasets that typically consist of (d, s) pairs. In the following we explain how we automatically augment training examples (d, s) into tuples (d, b, s) with blueprints (Section 3.2) and then describe how we devise blueprint models based on them (Section 3.3).
3.2 Blueprint Annotation
We first explain how question-answer pairs are automatically (over-)generated for output s, and subsequently filtered to create blueprint b. We illustrate the different filtering stages via the example in Table 2.
Overgenerated Question-Answer Pairs . | . | RT . | RH . | CO . |
---|---|---|---|---|
Q1: What is a high performance variant of the Ford Mustang? | A1: The Shelby Mustang | ✓ | ✗ | |
Q2: What is the high performance variant of the Ford Mustang called? | A2: Shelby | ✓ | ✗ | |
Q3: What is a high performance variant of the Ford Mustang? | A3: Shelby Mustang | ✓ | ✗ | |
Q4: What is a Shelby Mustang? | A4: a high performance variant | ✓ | ✗ | |
Q5: The Shelby Mustang is a high performance variant of what? | A5: the Ford Mustang | ✓ | ✓ | ✗ |
Q6: The Shelby Mustang is a high performance variant of what? | A6: Ford Mustang | ✓ | ✗ | |
Q7: The Shelby Mustang is a high performance variant of what Ford model? | A7: Mustang | ✓ | ✗ | |
Q8: Who built the Shelby Mustang from 1965 to 1968? | A8: Shelby American | ✓ | ✓ | ✗ |
Q9: During what years was the Shelby Mustang built by Shelby American? | A9: 1965 to 1968 | ✓ | ✓ | ✓ |
Q10: In what year did Ford take over production of the Shelby Mustang? | A10: 1969 | ✓ | ✗ | |
Q11: What was the final year that Shelby American built the Mustang? | A11: 1970 | ✗ | ||
Q12: Who built the Shelby Mustang from 1969 to 1970? | A12: Ford | ✓ | ✓ | ✓ |
Q13: What event in 2005 led to the revival of the Shelby Mustang? | A13: the introduction | ✗ | ||
Q14: What generation of Mustang was introduced in 2005? | A14: the fifth generation | ✓ | ✗ | |
Q15: What generation of Mustang was introduced in 2005? | A15: fifth | ✓ | ✗ | |
Q16: In what year was the fifth generation of the Ford Mustang introduced? | A16: 2005 | ✓ | ✓ | ✓ |
Q17: What name was brought back for the 2005 Ford Mustang? | A17: the Shelby nameplate | ✓ | ✗ | |
Q18: What was the Shelby Mustang revived as? | A18: a new high-performance model | ✓ | ✓ | ✓ |
[The Shelby Mustang is a high performance variant of the Ford Mustang] which [was built by Shelby American] [from 1965 to 1968,] and [from 1969 to 1970 by Ford.] [Following the introduction of the fifth generation Ford Mustang in 2005,] [the Shelby nameplate was revived as a new high-performance model, this time designed and built by Ford.] |
Overgenerated Question-Answer Pairs . | . | RT . | RH . | CO . |
---|---|---|---|---|
Q1: What is a high performance variant of the Ford Mustang? | A1: The Shelby Mustang | ✓ | ✗ | |
Q2: What is the high performance variant of the Ford Mustang called? | A2: Shelby | ✓ | ✗ | |
Q3: What is a high performance variant of the Ford Mustang? | A3: Shelby Mustang | ✓ | ✗ | |
Q4: What is a Shelby Mustang? | A4: a high performance variant | ✓ | ✗ | |
Q5: The Shelby Mustang is a high performance variant of what? | A5: the Ford Mustang | ✓ | ✓ | ✗ |
Q6: The Shelby Mustang is a high performance variant of what? | A6: Ford Mustang | ✓ | ✗ | |
Q7: The Shelby Mustang is a high performance variant of what Ford model? | A7: Mustang | ✓ | ✗ | |
Q8: Who built the Shelby Mustang from 1965 to 1968? | A8: Shelby American | ✓ | ✓ | ✗ |
Q9: During what years was the Shelby Mustang built by Shelby American? | A9: 1965 to 1968 | ✓ | ✓ | ✓ |
Q10: In what year did Ford take over production of the Shelby Mustang? | A10: 1969 | ✓ | ✗ | |
Q11: What was the final year that Shelby American built the Mustang? | A11: 1970 | ✗ | ||
Q12: Who built the Shelby Mustang from 1969 to 1970? | A12: Ford | ✓ | ✓ | ✓ |
Q13: What event in 2005 led to the revival of the Shelby Mustang? | A13: the introduction | ✗ | ||
Q14: What generation of Mustang was introduced in 2005? | A14: the fifth generation | ✓ | ✗ | |
Q15: What generation of Mustang was introduced in 2005? | A15: fifth | ✓ | ✗ | |
Q16: In what year was the fifth generation of the Ford Mustang introduced? | A16: 2005 | ✓ | ✓ | ✓ |
Q17: What name was brought back for the 2005 Ford Mustang? | A17: the Shelby nameplate | ✓ | ✗ | |
Q18: What was the Shelby Mustang revived as? | A18: a new high-performance model | ✓ | ✓ | ✓ |
[The Shelby Mustang is a high performance variant of the Ford Mustang] which [was built by Shelby American] [from 1965 to 1968,] and [from 1969 to 1970 by Ford.] [Following the introduction of the fifth generation Ford Mustang in 2005,] [the Shelby nameplate was revived as a new high-performance model, this time designed and built by Ford.] |
Question-Answer Generation
We generate QA pairs following an approach similar to Honovich et al. (2021, 2022). We convert the SQuAD reading comprehension dataset (Rajpurkar et al., 2018b) to a question generation dataset by concatenating the answer and context (with separators) and fine-tuning a sequence-to-sequence transformer model to predict the question. Specifically, we fine-tune the T5-11B checkpoint from Raffel et al. (2020); questions are decoded with a beam size of 4. During training, answer candidates are the answers provided in the SQuAD annotation. At inference time, answer candidates (i.e., base noun phrases and named entities) are identified in the output s using SpaCy2 and questions are generated with the SQuAD trained system. This procedure yields a large list of QA pairs (see in Table 2 the questions generated for the summary at the bottom), which we reduce using the filtering explained below.
Question-Answer Blueprints
Initially, we apply a Round-trip Consistency check (Alberti et al., 2019), which discards questions if they yield answers different from those used to generate them. In Table 2, Q11 is discarded as the answer it is paired with is wrong (1968 was the final year that Shelby American built the Mustang, not 1970). The same is the case for Q13, where the answer to the question ought to have been the introduction of the of the fifth generation Ford Mustang.
To decrease the number of QA pairs further, we chunk the text (bottom block in Table 2) into propositions—a proposition is a sub-sentential unit which represents a single claim or fact (Stanovsky et al., 2018; Ernst et al., 2022), We use propositions instead of sentences since the latter can be too long and contain multiple facts. We split text into propositions based on punctuation (period, comma, and semicolon), coordination (e.g., and, but), relative pronouns (e.g., that, who), and prepositions (e.g., at, by). Following this simple approach, the summary in Table 2 is split into six propositions, shown within square brackets. We next match each proposition to a single QA pair heuristically, following a two-stage approach.
We first find the question whose answer is at the rightmost position within a proposition. If there are multiple such questions, we select the one with the longest answer. This first stage, which we call Rheme, is motivated by the theme-rheme structure (Vallduví and Vilkuna, 1998) of natural language sentences: Already known information (i.e., the theme) is usually placed first while new information (i.e., the rheme) is placed later in a sentence or phrase (Kruijff-Korbayová and Steedman, 2003). Following this idea, Rheme selection prioritizes new-information seeking questions. As can be seen in Table 2, it eliminates several questions (e.g., Q1–Q4) as their answers are not the right most element in the obtained propositions. Questions Q5 and Q6 are identical, however we retain Q5 as it yields the longest answer.
The second stage, which we call Coverage, prioritizes the selection of informative QA pairs by selecting non-overlapping ones. Specifically, we first convert s to a bag of tokens and select the QA pair with the highest lexical overlap. We then remove the overlapping tokens from s, and repeat this greedy selection process until the bag is empty or the overlap is zero. Table 2 shows how Coverage further eliminates QA pairs Q5 and Q8. The remaining four QA pairs constitute the final blueprint b. Rather than defaulting to a random order, we sort these based on the location of the answer spans in s (see the final order in Table 1).
3.3 Blueprint Models
We devised three seq-to-seq models, which differ in the way the output and its blueprint are generated.
End-to-End Model
A straightforward approach would be to take d as input and learn to first predict blueprint b as p(b|d), and then generate output s as p(s|b). However, this approach crucially relies on the blueprint being accurate and capturing all required information, which might be overly optimistic, given that blueprints (for training) are generated automatically. Moreover, pipeline architectures are known to suffer from error propagation, which in our case would undoubtedly affect generation performance, the final stage of the pipeline.
Rather than modeling the blueprint and output generation stages separately, we train an encoder-decoder model to encode d and generate b;s (i.e., the concatenation of the blueprint and output sequence) in one go. Essentially, the decoder first predicts blueprint b and then continues to generate output s, using both b and d. We prefix b and s with special markers “Plan:” and “Summary:”, respectively. In particular, we predict b as a1;q1;…;am;qm, namely, a (concatenated) sequence of answer-question pairs.3 The model is trained with the standard maximum-likelihood objective to generate the augmented target b;s. Interestingly, in this end-to-end model the blueprint functions as a macro-plan, i.e., a global sketch of the content and organization of the output.
Multi-task Model
It is generally challenging for encoder-decoder models to generate long output sequences (Ko and Li, 2020; Tan et al., 2021). The end-to-end model sketched above further amplifies this problem because it ultimately aims to generate sequence b;s rather than just s, increasing the sequence length by 220% (see Table 3).
. | AQuM . | WCSum . | SS-FD . |
---|---|---|---|
# queries | 8,162 | — | — |
# examples | |||
train | 6,599 | 165,000 | 3,673 |
dev | 714 | 8,723 | 338 |
test | 849 | 9,166 | 337 |
source | |||
# docs | 6.46 | 135.56 | 1.00 |
# words | 12,986.88 | 7455.75 | 8051.74 |
# sentences | 339.62 | 307.80 | 804.01 |
# words/doc | 2,008.38 | 52.02 | 8051.74 |
target (original) | |||
# words | 114.07 | 115.61 | 126.73 |
# sentences | 3.65 | 4.74 | 5.26 |
novel unigrams | 0.02 | 0.13 | 0.17 |
novel bigrams | 0.13 | 0.54 | 0.66 |
novel trigrams | 0.24 | 0.78 | 0.92 |
novel 4-grams | 0.31 | 0.86 | 0.98 |
target (+blueprint) | |||
# QA-Pairs | 8.16 | 9.56 | 28.10 |
# words | 272.68 | 291.28 | 597.90 |
. | AQuM . | WCSum . | SS-FD . |
---|---|---|---|
# queries | 8,162 | — | — |
# examples | |||
train | 6,599 | 165,000 | 3,673 |
dev | 714 | 8,723 | 338 |
test | 849 | 9,166 | 337 |
source | |||
# docs | 6.46 | 135.56 | 1.00 |
# words | 12,986.88 | 7455.75 | 8051.74 |
# sentences | 339.62 | 307.80 | 804.01 |
# words/doc | 2,008.38 | 52.02 | 8051.74 |
target (original) | |||
# words | 114.07 | 115.61 | 126.73 |
# sentences | 3.65 | 4.74 | 5.26 |
novel unigrams | 0.02 | 0.13 | 0.17 |
novel bigrams | 0.13 | 0.54 | 0.66 |
novel trigrams | 0.24 | 0.78 | 0.92 |
novel 4-grams | 0.31 | 0.86 | 0.98 |
target (+blueprint) | |||
# QA-Pairs | 8.16 | 9.56 | 28.10 |
# words | 272.68 | 291.28 | 597.90 |
To mitigate this problem, we propose a multi-task model optimized to perform two separate tasks. Let a and q denote an ordered sequence of answers (a1,…, am) and corresponding questions (q1,…, qm), in blueprint b. The model is trained to generate (a) the answer plan concatenated with output sequence a;s, and (b) the answer plan concatenated with questions a;q. In particular, we train a single encoder-decoder model to encode input d, while the decoder first predicts answer plan a (as p(a|d)) and then continues to generate output s (as p(s|a, d)) or corresponding questions q (as p(q|a, d)), depending on the task. We prefix a, q, and s with special markers “Plan:”, “Questions:”, and “Summary:”, respectively. We further prefix input d with “Generate Summary:” or “Generate Questions:” to instruct our model to generate output s or questions q, respectively. We sample data points from these two tasks with equal probability and train the model with the standard maximum-likelihood objective.
During inference, we use a two-step process to generate output s′ and its blueprint b′ for input d. We first prefix d with “Generate Summary:” and generate a′;s′, i.e., answer plan a′ followed by output sequence s′. We then prefix d with “Generate Questions:”, prompt our decoder with the predicted answer plan a′ and generate corresponding questions q′ for blueprint b′. The multi-task model alleviates the length issue discussed above by learning to generate a;s instead of b;s. However, this comes at the expense of generation quality, since the model now conditions on the answers only, not question-answer pairs. As such, it can be viewed as an extension of FROST (Narayan et al., 2021) with the plan being a sequence of answer spans rather than entity chains. This model also creates a macro-plan of the output, however, less detailed compared to the end-to-end model.
Iterative Model
Rather than predicting a global plan (i.e., answer plan a or blueprint b) prior to generating output s, we employ an incremental approach that interleaves planning with text generation. Let output s consist of n sentences {s1, s2,…, sn}; then, the corresponding blueprint b can be represented as {b1, b2,…, bn}, where consists of k question-answer pairs for sentence si. We train our model to iteratively plan and generate one sentence at a time, conditioning on the input and the output sentences generated so far. In particular, we train an encoder-decoder model where the encoder first encodes input d, while the decoder takes summary {s1,…, si} generated so far as a prompt and generates blueprint bi +1 for next sentence si +1, followed by sentence si +1 itself.
The iterative model is trained on quadruples {(d, ϕ, b1, s1),…,(d, s1, i, bi +1, si +1),…,(d, s1, n−1, bn, sn),(d, s, bend, send)}, where ϕ is an empty context placeholder used to predict the first blueprint b1 and corresponding first sentence s1, (n + 1) is the blueprint length, and s1, i = {s1,…, si} are the output sentences generated so far; bend and send are special tokens marking the end of the output prediction. We prefix s1, i, bi, and si with special markers “Context:”, “Plan:”, and “Next Sentence:”, respectively. We train the model with the standard maximum-likelihood objective to predict s1, i;bi;si, however, we do not compute the loss for predicting context s1, i to avoid over-optimizing for sentences that appear at the beginning of the output.
The iterative approach does not create a global macro plan. Rather, it learns micro content plans and verbalizes them one-by-one, conditioning on previously generated sentences but not on previously generated QA pairs. Athough it does not have a global document view like the end-to-end model, the iterative decoder cannot exceed the output sequence length as it plans and predicts one sentence at a time as bi;si, instead of generating b;s in one go. And unlike the multi-task model, each sentence si is generated by conditioning on the full blueprint bi (consisting of questions and answers).
4 Experimental Setup
4.1 Datasets
We evaluated our model on benchmarks representative of long-form question answering and summarization. Our datasets vary in terms of the input given to the generation model (e.g., multiple documents or one, web pages, or dialogue transcripts), the user’s information need (e.g., answering a question or aggregating information), and summary style (e.g., genuinely abstractive vs extractive). Common features among them are very long inputs and multi-sentence output summaries. We summarize various dataset statistics in Table 3.
AQuaMuSe
(Kulkarni et al., 2020, 2021) is a query-focused multi-document summarization dataset; it was created with the intent of simulating how a search engine might synthesize documents of high relevance to a user query. It consists of Google Natural Questions (Kwiatkowski et al., 2019) paired with web documents extracted from Common Crawl and long-form answers from Wikipedia. We approach this task as a generative QA problem where we take the query and associated web documents and generate a long-form answer to the query. We work on the split from Kulkarni et al. (2021); on average, each instance has 6.46 web documents (2,008 tokens per document), leading to very long input (12,987 tokens).
WikiCatSum
(Perez-Beltrachini et al., 2019) is a topic-focused multi-document summarization dataset where the goal is to generate Wikipedia abstracts (i.e., lead article sections) from a large set of webpages related to an entity or a topic. It focuses on three entities, namely, Films (59,973 instances), Companies (62,545 instances), and Animals (60,816 instances). In experiments, we collate the different data subsets into one, which we refer to collectively as WikiCatSum. The input webpages are truncated to the first 800 tokens.
SummScreen-FD
(Chen et al., 2022) is a recently released dialogue summarization dataset. It contains transcripts of TV episodes (e.g., Game of Thrones, CSI Las Vegas) and corresponding (community authored) summaries. The original dataset is divided into two complementary subsets; we use the ForeverDreaming (FD) subset released as part of the Scrolls benchmark (Shaham et al., 2022), which incorporates episodes from 88 different shows. SummScreen-FD is a challenging testbed for several reasons. Plot details are often expressed indirectly in conversations between characters and are scattered across the entire transcript. The summarization task is highly compressive, a transcript the size of a book (on average 8,000 tokens; see Table 3) is condensed into a few sentences, and the evaluation of such summaries comes with its own challenges (e.g., it is not realistic to expect humans to read the transcript to be able to assess their quality).
We further analyze the characteristics of these datasets in Table 3. Long-form answers in AQuaMuSe are mostly extractive with only 2%, 13%, 24%, and 31% novel unigrams, bigrams, trigrams, and 4-grams, respectively. In comparison, summaries in WikiCatSum and SummScreen-FD are more abstractive; WikiCatSum abstracts have 13% novel unigrams, 54% bigrams, 78% trigrams, and 86% 4-grams, whereas in SummScreen-FD summaries 17% unigrams, 66% bigrams, 92% trigrams, and 98% 4-grams were not seen in the training. Interestingly, SummScreen-FD summaries have far more propositions than AQuaMuSe or WikiCatSum targets, leading to a much higher number for QA pairs in their blueprints (28.10 vs 8.16 or 9.56). This in turn makes the generation task for end-to-end models very challenging. The average summary length together with the blueprint annotations (i.e., b;s) for SummScreen-FD is almost twice the size of WikiCatSum and AQuaMuSe (597.90 vs 291.28 and 272.68). The majority of questions in AQuaMuSe and WikiCatSum are what questions (76.0% and 74.2%, respectively), followed by who, where, when, and how questions. For SummScreen-FD, what and who questions are most popular (50.1% and 42.9%, respectively).
4.2 Comparison Systems
All our experiments used LongT5 (Guo et al., 2021), an extension of the original T5encoder (Raffel et al., 2020) with global-local attention sparsity patterns to handle long inputs. We compared a vanilla LongT54 model (xl, 3B parameters) fine-tuned on our datasets (with a maximum input sequence length of 4,096 tokens and a maximum output length of 512 tokens) against several blueprint variants. These include an end-to-end LongT5 model (E2E) which first decodes blueprint b and then continues to decode output s; a LongT5 multitask model (Multitask) which jointly learns to predict the answer plan followed by either the output s or the questions in b; and a LongT5 iterative model (Iterative) which plans and generates one sentence at a time.
In addition, we implemented a two-stage model (2-Stage), which first creates blueprint b given input d and then generates output s given b and d as input. Finally, we also fine-tuned T5 (xl, 3B parameters) on our datasets with a maximum input sequence length of 1,024 tokens and a maximum output length of 256 tokens, as a baseline. We present these comparisons in Table 4 together with the performance of various state-of-the-art systems.
We fine-tuned all our models with a leaning rate of 0.001 and a batch size of 128, for 50K steps. We select best checkpoints using average Rouge performance on validation sets. During inference, we use beam search with size 5 and alpha 0.8.
5 Automatic Evaluation
In this section we present experimental results using automatic evaluation metrics that assess overall summary (and blueprint) quality. Moreover, we quantify the extent to which automatically generated output is grounded to the blueprint and faithful to the input document/s.
5.1 Metrics
Summary and Blueprint Quality
We evaluate summary quality automatically using (summary-level) Rouge F1 (Lin and Hovy, 2003). We report only RougeLSum5 in Table 4 for the sake of brevity. We also use RougeLSum to evaluate the quality of the automatically generated blueprint, i.e., the QA pairs and their order against the reference blueprint.
Informativeness and Grounding
We evaluate informativeness using QA-based metrics. Specifically, following the reading comprehension literature (Rajpurkar et al., 2016, 2018b), we quantify the extent to which the generated text can answer all questions from its reference (Informativeness) and predicted blueprint (Grounding). Following Stelmakh et al. (2022), we use a RoBERTa model (Liu et al., 2019) fine-tuned on SQuAD-V2 for question-answering in both cases.6 Given generated text s′ and question-answer pair (qi, ai) from the (reference or predicted) blueprint, we apply our question-answering model to s′ to predict answer ai′ to question qi. We then compute the token-level F1 score between predicted answer ai′ and ground truth answer ai, and report the average.
Faithfulness
Hallucinations are a widely known issue with neural abstractive summarization (Song et al., 2018; Maynez et al., 2020; Kryscinski et al., 2020; Gabriel et al., 2021), especially when a sentence combines content from multiple sources (Lebanoff et al., 2019).
Following previous work (Maynez et al., 2020; Falke et al., 2019; Narayan et al., 2022; Honovich et al., 2022; Dušek and Kasner, 2020), we quantify the extent to which generated summaries are faithful to their input using textual entailment. We resort to textual entailment for two reasons; firstly, it is a relatively intuitive metric, all information in a summary should be entailed by the source or at least not conflict with it; secondly, recent studies (Maynez et al., 2020; Fischer et al., 2022) have shown that it correlates with human judgments of faithfulness across summarization datasets and tasks.
Following Honovich et al. (2022), we trained an entailment model by fine-tuning T5-11B (Raffel et al., 2020) on the Adversarial NLI dataset (ANLI; Nie et al., 2020). For each sentence (hypothesis) in the summary, we compute its entailment probability given the input (premise) and report the average across all sentences to obtain an overall score (Maynez et al., 2020).
We further validated our ANLI entailment scores against human judgments of faithfulness elicited as part of SummEval (Fabbri et al., 2021), a recently released dataset for assessing automated summarization metrics. Our entailment predictions correlate well with human ratings, achieving a Spearman’s rank correlation of ρ = 0.774.
5.2 Results
Why LongT5 for Blueprint Models
All the tasks we are dealing with require modeling input of highly complex nature, which is often very long (see Table 3). Our results in Table 4 (see Rouge/summary column) demonstrate that T5 models always fall behind LongT5, underscoring the importance of sparse attention mechanisms for modeling long inputs. In fact, LongT5 sets a new state of the art on AQuaMuSe and SummScreen-FD. On WikiCatSum, it is slightly worse than Reflect (Song et al., 2022), an extract-then-abstract model which has a dedicated content selection module. Similar content selection techniques could also benefit LongT5, however, we leave this to future work. We henceforth use LongT5 as a base model for fine-tuning our blueprint models.
Blueprint Models and Rouge
Compared to LongT5, blueprint variants slightly underperform on AQuaMuse, but score better on WikiCatSum and SummScreen-FD (see Multitask model). All differences between LongT5 and blueprint models are statistically significant using paired bootstrap resampling; p < 0.05). For a fair comparison, we always use a maximum decoder length of 512 tokens. With the exception of AQuaMuse, E2E is inferior to other blueprint models, which is not surprising since it has to generate much longer text (recall it predicts b;s rather than simply s). Overall, Multitask is significantly better than other blueprint models on WikiCatSum but on par with Iterative on SummScreen-FD.
Similar patterns emerge when evaluating the predicted blueprints against reference QA pairs, with Iterative significantly outperforming the other two variants on SummScreen-FD. This could be due to the fact that SummScreen-FD summaries have far more propositions than AQuaMuSe or WikiCatSum targets; it is better to predict them one sentence at a time, rather than all together. With regard to WikiCatSum, the difference between Multitask and Iterative is not significant (although Multitask has a slight numerical advantage) and both systems are significantly better than E2E. On AQuaMuSe, Multitask is significantly better than E2E and Iterative.
Note that all 2-Stage models are significantly worse in comparison to blueprint variants, when evaluating either their blueprints or summaries (in terms of Rouge). While our models learn to optimize blueprints and summaries together, 2-Stage models are faced with the harder task of predicting the blueprint solely based on the input (text-to-data). Since the blueprints learned by the first stage are of poor quality, the summaries generated in the second stage are also inferior.
Blueprint Models and Informativeness
Our blueprint annotation of reference summaries naturally provides a more principled alternative to Rouge. We can now use QA pairs in reference blueprints to evaluate the informativeness of predicted summaries. Results follow a pattern overall similar to Rouge, however, this approach reveals the complexity of the different generation tasks better than Rouge. While we were able to achieve reasonably high Rouge across datasets, we are far from generating informative summaries. On SummScreen-FD, in particular, we achieve a maximum Rouge score of 31.88, but are able to answer correctly only 7.59% of reference questions using the predicted summaries.
Across datasets, LongT5 performs on par with Multitask, the difference between the two models is not statistically significant, and the same is true of Iterative on SummScreen.
Blueprint Models and Grounding
The E2E and Iterative variants are significantly better than Multitask in generating texts grounded to their predicted blueprints (see ground. column in Table 4). This is because both models generate text conditioned on their blueprints; E2E first predicts blueprint b and then continues to generate output s using both b and the input, whereas Iterative plans and generates one sentence at a time as bi;si. This is not the case with Multitask, which generates s conditioned on answer spans only. E2E performs slightly better than Iterative on AQuaMuSe and WikiCatSum (differences are not statistically significant) but struggles on SummScreen-FD, where summaries are longer with more facts/propositions, requiring inference over long-range dependencies, and common sense reasoning. Iterative seems the best option for grounded generation without sacrificing informativeness (Iterative is most informative amongst blueprint models on SummScreen-FD, second best on AQuaMuSe, and third best on WikiCatSum).
Iterative Is Most Faithful Model
As far as faithfulness is concerned, Iterative performs consistently better than E2E and Multitask, as well as T5 and LongT5 models where text is generated from scratch without any planning (pairwise differences between Iterative and comparison systems are all significant with the exception of E2E on AQuaMuse). On SummScreen-FD, Iterative brings large gains on faithfulness without sacrificing informativeness (both in terms of Rouge and QA-F1). The ANLI score for Iterative is 20.84, whereas it is below 10 for E2E and Multitask. E2E outperforms LongT5 on AQuaMuSe and WikiCatSum, but gains are smaller compared to Iterative.
We show examples of system output in Table 5, highlighting propositions that are not grounded to the input in
. E2E summaries are shorter, which is somewhat expected; the model has to decode both the plan and the summary and in cases where the blueprint is large (e.g., in SummScreen-FD), there is no more room to decode the summary. Multitask is more verbose, however, the plan (a sequence of answer spans) is less detailed and as a result the summary less accurate (Jackpot’s pretzels is a restaurant, not a killer). Iterative contains many details in the summary, more than the reference, which are not hallucinations. Both r2t-Bart and LongT5 are rather loose with the facts and generate multiple hallucinations.Blueprint Models are Controllable
Our conceptualization of text plans as QA pairs brings inherent controllability to the generation process. By changing the blueprint, we can control content selection (i.e., what to say) and planning (i.e., in what order) without retraining the model or introducing additional control mechanisms. We provide an example in Table 6 where the plan predicted by the E2E model has been edited to render it more coherent and factual. As can be seen, the model is able to change its output according to the modified plan. Another example is shown in Table 7, where the output is rendered shorter by removing QA pairs from the predicted plan.
We are also able to control the faithfulness of predicted summaries as follows. We take the predicted plan and remove question-answer pairs (E2E, Iterative) or answer spans (Multitask) that cannot be answered based on the input. We then prompt our decoder with the modified plan and generate a new summary (or sentence for Iterative). In Table 8, we quantitatively evaluate +drop variants, which are controlled for faithfulness against vanilla blueprint models. We observe improvements in entailment scores across the board (see column entail. in the table), with the Iterative+drop performing best. Improvements on abstractive datasets (WikiCatSum and SummScreen-FD) are larger compared to AQuaMuSe which is mostly extractive (see Table 3). The minor drop in Rouge and informativeness is somewhat expected as the models now zoom in on information they can reliably talk about, improving the consistency of the output.
Finally, we also experiment with creating simple summaries, by forcing the Iterative model to generate from a single question-answer pair on each iteration (see +Q1 variant in Table 8). In the example shown in Table 9, Iterative+Q1 produces simple summary sentences, each focusing on a single information element. Interestingly, as far as the Iterative model is concerned, +Q1 variants are as faithful as +drop ones even if they do not explicitly control for faithfulness (across datasets the differences between the two models are not statistically significant). This suggests that controlling for simplicity might be sufficient to reduce hallucinations, however, at the expense of informativeness (Rouge scores for +Q1 variants tend to be significantly worse compared to +drop counterparts).
Most of the controllability cases we illustrate here are fully automatic and could be conceptualized as system flags that users select according to requirements (e.g., low tolerance for hallucinations, shorter summaries for small screen displays). Another potential use case would be to generate summaries for a set of questions provided by the user. Their input might be articles retrieved as an answer to a query, or in an educational context several chapters on a topic (e.g., cell biology). However, we leave this to future work.
5.3 Ablation Studies
As described in Section 3.2, we construct blueprint annotations using the Rheme- and Coverage-based selection strategies. Table 10 presents various ablations that provide rationales for these annotation choices. For the sake of brevity, we report experiments with the E2E model trained (for 50,000 steps) on AQuaMuSe. We observe very similar trends on the other two datasets. As can be seen, it is empirically better to form blueprints from answer-question pairs rather than predicting the questions first and then their answers which is more natural (at least to humans). We further assessed whether sorting the QA pairs based on how they appear in the summary matters by defaulting to a random ordering (see −Sorted in the table). Removing either Rheme or Coverage has a small negative impact on the summaries but not their blueprints, while removing them both is detrimental to summary quality, while the absence of Sorting mostly affects the quality of the blueprint. It is not surprising that sorting is most important to generating a blueprint with correctly ordered propositions.
E2E . | Rouge (RLSum) . | ||
---|---|---|---|
summary . | blueprint . | both . | |
QA Plan, Rheme, Covg, Sorted | 48.75 | 39.06 | 44.31 |
AQ Plan, Rheme, Covg, Sorted | 50.86 | 39.95 | 45.60 |
−Sorted, Random | 50.79 | 36.08 | 43.43 |
−Rheme | 47.16 | 40.70 | 44.19 |
−Coverage | 47.02 | 41.37 | 44.79 |
−Rheme, −Coverage | 18.05 | 42.54 | 40.90 |
E2E . | Rouge (RLSum) . | ||
---|---|---|---|
summary . | blueprint . | both . | |
QA Plan, Rheme, Covg, Sorted | 48.75 | 39.06 | 44.31 |
AQ Plan, Rheme, Covg, Sorted | 50.86 | 39.95 | 45.60 |
−Sorted, Random | 50.79 | 36.08 | 43.43 |
−Rheme | 47.16 | 40.70 | 44.19 |
−Coverage | 47.02 | 41.37 | 44.79 |
−Rheme, −Coverage | 18.05 | 42.54 | 40.90 |
6 Human-based Evaluation
In addition to automatic evaluation, we conducted three human-based studies assessing different dimensions of output quality. Wishing to avoid well-documented issues7 with automated bots on Amazon Mechanical Turk and crowdworkers running through HITs as quickly as possible without paying attention to the tasks, we used a few trained annotators. They were given task-specific instructions and went through several pilots to iron out disagreements on edge cases.8
6.1 Summary Quality
Our first study assessed overall summary quality. Specifically, we asked our annotators to select the best among three system summaries taking into account how much they deviated from the reference in terms of informativeness (are the summaries on topic or emphasize irrelevant details?) and overall fluency. We adapted the definition of fluency provided in Howcroft et al. (2020): Does the text ‘flow well’ or is it a sequence of unconnected parts?
We conducted our annotation study on 100 instances, each randomly sampled from AQuaMuse, WikiCatSum, and SumScreen. We collected ratings from three annotators (after two rounds of pilot studies to improve agreement) for the output of seven systems. Overall, we obtained 100 (instances) x 3 (datasets) x 6 (systems) x 3 (annotators) = 5,400 annotations. Annotator agreement was 97.11%. Our results are presented in Table 11. We report on percentage of times each system was ranked best.
In general, we observe that LongT5 and blueprint models based on it are perceived as significantly better than previous state-of-the-art models (i.e., SiBERT and r2t-Bart). On AQuaMuse, LongT5 is rated overall best, followed by E2E and Multitask (however, differences between them are not statistically significant). On WikiCatSum, E2E is rated best bus is not significantly different compared to the other models. On SummScreen, our Iterative variant is rated best followed by LongT5. These results mirror the difficulty of the task (see Table 3), the longer the input/output, the better Iterative performs.
6.2 Blueprint Quality
We further evaluated the predicted plans more directly. Participants were shown QA blueprints and asked to assess whether they tell a coherent story (are they all relevant and ordered comprehensively?) using a 3-point scale (where 3 is best and 1 is worst). They were also asked to evaluate whether the plans have redundant QA pairs; a QA pair is redundant if it does not add new information to the plan. We collected judgments for the same instances used in our summary quality evaluation from three annotators whose overall agreement was 97.87% and obtained a total of 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 annotations.
Table 12 shows the results of this study. We report mean scores per dataset for all blueprint models. As an upper bound, we further elicited annotations for blueprints automatically created from gold standard reference summaries (see row Gold in the table). E2E generates the most coherent blueprints: Differences between E2E and all comparison systems are statistically significant with the exception of the gold standard. This is not surprising, since all QA pairs in E2E are generated together, whereas in Multitask the spans and their corresponding questions are generated separately. Iterative only generates QA pairs for a sentence at a time and thus we would not expect it to be more coherent than models which generate a global document plan. With regard to redundancy, Iterative blueprints are generally most redundant, which is again down to not having a global view of previously generated QA pairs. Iterative further underscores issues with our question generation technology which is far from perfect, for example, several QA pairs are different on the surface but actually semantically equivalent, however, we have no means of detecting this without robust coreference resolution.
6.3 Blueprint Grounded Generation
We next examine whether model summaries are grounded to their blueprints. Specifically, we asked our annotators to decide whether each QA pair in the blueprint is mentioned in the summary, and report the number of times it isn’t. Ideally, we would like the summary to follow the blueprint as closely as possible. For QA pairs mentioned in the summary, we further asked our annotators to highlight whether the intent of the question was preserved or contradicted (we report the number of contradictions). Finally, we also asked participants to decide whether the summary has additional information which cannot be found in its blueprint, using a 3-point scale (where 3 is for summaries with lots of new information and 1 is for summaries with no new information). We elicited annotations for blueprint models, and, as an upper bound, for gold summaries and blueprints extrapolated from them. We obtained 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 judgments.
The results of our grounding experiments are summarized in Table 13. Across datasets, we observe that Iterative summaries are most grounded. Iterative blueprints have the least number of questions that are absent from or contradict their generated texts. Iterative summaries also display the least amount of new information in relation to their blueprints. Iterative+drop is slightly less grounded compared to Iterative, however, this is not entirely surprising since we prompt the Iterative model with externally modified blueprints (see Iterative+drop in Table 13). Note that Iterative+drop summaries are deemed more faithful than Iterative summaries in automatic evaluation. The entailment scores improve for all three datasets (see Table 4).
7 Conclusion
In this work we proposed a novel plan-based approach to conditional generation. We conceptualized text plans as a sequence of QA pairs operating as a proxy for what to say and in what order. We developed Transformer-based models that generate by conditioning on a global QA blueprint plan (E2E, Multitask) or iteratively by planning and generating one sentence at a time (Iterative). Experimental results across three challenging datasets demonstrate that blueprint models are inherently more informative than vanilla sequence-to-sequence approaches without a planning component. Among the three presented here (E2E, Multitask, Iterative), we find that Iterative is the best choice for grounded generation and suggests a promising direction for long-form generation.
Blueprint models offer several advantages compared to blackbox generation. Model predictions can be examined, and errors can be traced back to the blueprint, which in turn can reveal whether the output is informative and faithful to its input. The formulation of the blueprint plan as question-answer pairs makes it intuitive and user-friendly. We have discussed how blueprint models might be used in a human-in-the-loop setting, where users interact with and influence model predictions directly, e.g., by editing the blueprint length and content (as different blueprints lead to different outputs). In the future, we would like to use blueprints more directly to advance methods for training language models using reward learning (Sutton and Barto, 2018), e.g., based on whether the output answers the blueprint questions. Rather than eliciting expensive human feedback (Stiennon et al., 2020), blueprints could provide a cheaper automatic alternative. Finally, although we focused primarily on the generation problem in this work, we believe blueprints might also be useful as a general-purpose approach to retrieving and organizing important content, especially when faced with many and very long inputs.
Acknowledgments
We thank the action editor and our reviewers for their valuable feedback. The human rating process was managed by Muqthar Mohammad, Kiranmai Chennuru, Ashwin Kakarla and their team; without them this work would not have been possible. Thanks for invaluable support from Sheila de Guia and Suneet Dhingra.
Notes
Our models, training data and predictions are available at https://github.com/google-research/google-research/tree/master/text_blueprint.
Predicting b as q1; a1; …; qm; am is more natural, but, it led to inferior performance. See the ablation experiments in Section 5.3.
We used the publicly released checkpoints from https://github.com/google-research/longt5.
RougeLSum is very similar to ROUGE-L; while the latter is calculated on the summary as a whole, RougeLSum interprets newlines as sentence boundaries.
This is a high performing model reaching 86.8% exact-match accuracy and 89.8% F1 on SQuAD.
We release our instructions and annotation templates together with our data and models.
References
Author notes
Action Editor: Mark Johnson