Conditional Generation with a Question-Answering Blueprint

Abstract The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. We propose a new conceptualization of text plans as a sequence of question-answer (QA) pairs and enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for content selection (i.e., what to say) and planning (i.e., in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.


Introduction
Neural generation models are often prone to hallucination (Song et al., 2018;Maynez et al., 2020;Kryscinski et al., 2020;Gabriel et al., 2021), repetition and redundancy (Li et al., 2018;Suzuki and Nagata, 2017), and struggle to identify which content units are salient (Tan et al., 2017a).These phenomena are amplified when generating longform text, i.e., documents with multiple paragraphs (Wiseman et al., 2017), when dealing with non-linguistic data (e.g., database tables), or very long input which is common when summarizing multiple documents (Liu and Lapata, 2019;Perez-Beltrachini et al., 2019), books (Kryściński et al., 2021), or dialogue (Chen et al., 2022;Zhong et al., 2021).An additional challenge concerns the blackbox nature of deep learning systems which hides the inherent complexity of modeling multiple interconnected linguistic phenomena in text generation, and makes it difficult to examine model decisions and attribute errors to specific components.The lack of modularity further affects controllability as these systems cannot be easily tailored to individual needs.
In this paper we also aim to render conditional generation more modular via an intermediate, plan-based representation.While autoregressive models of language predict one token at a time, there is evidence that in humans some degree of planning occurs at a higher level than individual words (Levelt, 1993;Guhe, 2007).A long tradition in natural language generation views planning as a central component to identifying important content and structuring it appropriately (Reiter and Dale, 2000), however, there is less agreement Table 1: Question-answering (QA) blueprint for AQuaMuSe summary.QA pairs were obtained from a state-of-the-art question generation and answer identification system (Alberti et al., 2019).
Our work proposes a new conceptualization of text plans as a sequence of question-answer pairs.Specifically, we draw inspiration from the "Questions under Discussion" (QUD) theory of discourse structure which posits that one way of articulating the structure of a text is to identify the questions and sub-questions that are raised and answered by subsequent spans of text (Carlson, 1983;Ginzburg, 1994;Van Kuppevelt, 1995;Larson, 2002;Roberts, 2012;Riester, 2019).Theoretical models of Questions under Discussion assume that discourse contains implicit questions for each of the assertions made which are thereby turned into answers.These questions and answers can be understood in terms of their use in moving a discourse forward to achieve communicative goals.We propose to make QUDs explicit by exploiting state-of-the-art question generation technology (Alberti et al., 2019;Lu and Lu, 2021) and use them as an intermediate representation layer for conditional generation, i.e., a questionanswering (QA) blueprint operating as a proxy for both content selection (i.e., what to say) and planning (i.e., in what order).
Table 1 illustrates a plan for generating a Wikipedia abstract from the AQuaMuSe dataset (Kulkarni et al., 2020).
We enhance existing datasets (e.g., for summarization) with similar blueprints which we obtain automatically.We then convert input-output pairs into inputblueprint-output tuples and propose to learn encoder-decoder models from these augmented annotations.We develop three models which vary in how they integrate blueprints in the generation process and their ability to handle long outputs.Aside from generating blueprints and their correspond-ing text in one go, we propose a new architecture which iteratively plans and generates a sentence at a time, conditioning on the input and the output sentences generated so far.We do not generate a global blueprint, rather our planning process is incremental and informed by generation, which we argue affords greater control over the output and its fluency.Moreover, the model is better equipped for long-form generation, since it does not have to (autoregressively) decode the blueprint and its summary in one go, avoiding the risk of exceeding the maximum decoder length.
We instantiate our models with a Transformer (Vaswani et al., 2017) encoder-decoder architecture and perform experiments on summarization datasets representing different information seeking tasks, application domains, and user requirements. 1In all cases, we empirically demonstrate that blueprint models are more factual than alternatives which do not resort to planning; we also observe that QA blueprints are a better representation compared to plans based on entity chains (Narayan et al., 2021), allowing tighter control of the output, and providing a comprehensive explanation for model predictions (if the plan is erroneous, then the summary will be too).

Related Work
Questions under Discussion The Question Under Discussion (QUD)-based approach to discourse structure assumes an open-ended inventory of possible questions and sub-questions (Van Kuppevelt, 1995).Recent efforts (De Kuthy et al., 2018;Westera et al., 2020;Riester, 2019) have nevertheless shown it is possible to manually annotate documents with QUDs, i.e., to formulate a question for every assertion expressed in a text.De Kuthy et al. (2020) even go as far as to par-tially automate QUD annotation in German by automatically generating all potentially relevant questions for a given sentence.Related work (Ko et al., 2020) focuses on the generation of inquisitive questions that reflect general text understanding and free-form open-ended questions (Ko et al., 2021).Our work builds upon QUD and related discourse structure theories, however, we do not directly implement any of them in particular.We adopt question answering as a good way of spelling out the connection between the information structure of a sentence and the discourse in which the sentence can function.
QA Pairs as a Proxy for Annotation Labels Question-answer pairs have been previously used as a proxy for expressing semantic content.QA-SRL (He et al., 2015) is a representation based on QA pairs which has been shown to capture the vast majority of arguments and modifiers in Prop-Bank (Palmer et al., 2005) and NomBank (Meyers et al., 2004).Instead of using a pre-defined role lexicon, QA-SRL labels semantic roles with questions, whose answers denote the argument bearing the role.Follow-on work uses QA pairs to represent discourse relations (Pyatkin et al., 2020) and to capture overlap or redundancy at the propositional level (Brook Weiss et al., 2021).We also employ QA pairs as an abstraction of propositional content, however, we do not target specific relation types, or make any linguistic assumptions about them (e.g., discourse relations vs semantic roles).
Question-Answering in Summarization QA pairs have been used for evaluating summaries (Deutsch and Roth, 2021b;Eyal et al., 2019;Durmus et al., 2020;Wang et al., 2020), specifically as a means of estimating the information overlap between a reference summary and a systemgenerated one.QA-based signals have also been incorporated in the training of summarization models, using reinforcement learning (Arumae andLiu, 2018, 2019;Scialom et al., 2019) or as a way of identifying salient content in the input document (Deutsch and Roth, 2021a).Cao and Wang (2022) introduce the task of hierarchical questionsummary generation, where a source document is condensed into multiple summaries each answering a different question.Questions are organized hierarchically into broad questions and more specific sub-questions which are learned from manual annotations.Our model outputs a QA-based plan and a single summary for a given document, although it is possible to generate different summaries from different plans for the same document.Our QA pairs are obtained automatically and they are not stuctured.
Planning in Encoder-Decoder Models Various recent efforts have developed planning modules in the context of data-to-text generation.In most cases, the plans are specific to the input which varies from tables and records to RDF tuples.For instance, Puduppully et al. (2019a) learn a plan corresponding to a sequence of records, and generate a summary conditioned on it.Narayan et al. (2020) treat content selection as a task similar to extractive summarization, they first extract sentence plans and then verbalize them one-by-one.Moryossef et al. (2019a,b) propose a symbolic planning stage followed by a neural realization stage.Other work (Puduppully and Lapata, 2021;Puduppully et al., 2022) advocates macro planning, where document content is organized into a sequence of paragraph plans which are verbalizations of tabular input.Our work is closest to Narayan et al. (2021) who also target summarization applications and learn an intermediate plan to guide generation.We adopt a more elaborate plan representation based on QA blueprints, and interface decoding with plan generation similarly to Narayan et al. (2020).

Problem Formulation
Let d denote the input to the model which could be a document (or multiple documents), a dialogue history, or even database tables.The model will learn to generate blueprint b for output s (e.g., a summary) and the output itself.The blueprint b is an ordered set of question-answer pairs {(q 1 , a 1 ), (q 2 , a 2 ), . . ., (q m , a m )}.Unsurprisingly, such blueprints are not naturally occurring in existing datasets which typically consist of (d, s) pairs.In the following we explain how we automatically augment training examples (d, s) into tuples (d, b, s) with blueprints (Section 3.2) and then describe how we devise blueprint models based on them (Section 3.3).

Blueprint Annotation
We first explain how question-answer pairs are automatically (over-)generated for output s, and sub-  2.
Question-Answer Generation We generate QA pairs following an approach similar to Honovich et al. (2021Honovich et al. ( , 2022)).We convert the SQuAD reading comprehension dataset (Rajpurkar et al., 2018b) to a question generation dataset by concatenating the answer and context (with separators) and fine-tuning a sequence-to-sequence transformer model to predict the question.Specifically, we fine-tune the T5-11B checkpoint from Raffel et al. (2020); questions are decoded with a beam size of 4.
During training, answer candidates are the answers provided in the SQuAD annotation.At inference time, answer candidates (i.e., base noun phrases and named entities) are identified in the output s using SpaCy2 and questions are generated with the SQuAD trained system.This procedure yields a large list of QA pairs (see in Table 2 the questions generated for the summary at the bottom), which we reduce using the filtering explained below.
Question-Answer Blueprints Initially, we apply a Round-trip Consistency check (Alberti et al., 2019) which discards questions if they yield answers different from those used to generate them.In Table 2, Q 11 is discarded as the answer it is paired with is wrong (1968 was the final year that Shelby American built the Mustang, not 1970 ).
The same is the case for Q 13 , where the answer to the question ought to have been the introduction of the of the fifth generation Ford Mustang.
To decrease the number of QA pairs further, we chunk the text (bottom block in Table 2) into propositions -a proposition is a sub-sentential unit which represents a single claim or fact (Stanovsky et al., 2018;Ernst et al., 2022), We use propositions instead of sentences since the latter can be too long and contain multiple facts.
We split text into propositions based on punctuation (period, comma, and semicolon), coordination (e.g., and, but), relative pronouns (e.g., that, who), and prepositions (e.g., at, by).Following this simple approach, the summary in Table 2 is split into six propositions, shown within square brackets.We next match each proposition to a single QA pair heuristically, following a two-stage approach.
We first find the question whose answer is at the rightmost position within a proposition.If there are multiple such questions, we select the one with the longest answer.This first stage, which we call Rheme, is motivated by the theme-rheme structure (Vallduví and Vilkuna, 1998) of natural language sentences: already known information (i.e., the theme) is usually placed first while new information (i.e., the rheme) is placed later in a sentence or phrase (Kruijff-Korbayová and Steedman, 2003).Following this idea, Rheme selection prioritizes new-information seeking questions.As can be seen in Table 2, it eliminates several questions (e.g., Q 1 -Q 4 ) as their answers are not the right most element in the obtained propositions.Questions Q 5 and Q 6 are identical, however we retain Q 5 as it yields the longest answer.
The second stage, which we call Coverage, prioritizes the selection of informative QA pairs by selecting non-overlapping ones.Specifically, we first convert s to a bag of tokens and select the QA pair with the highest lexical overlap.We then remove the overlapping tokens from s, and repeat this greedy selection process until the bag is empty or the overlap is zero.Table 2 shows how Coverage further eliminates QA pairs Q 5 and Q 8 .The remaining four QA pairs constitute the final blueprint b.Rather than defaulting to a random order, we sort these based on the location of the answer spans in s (see the final order in Table 1).

Blueprint Models
We devised three seq-to-seq models, which differ in the way the output and its blueprint are generated.
End-to-End Model A straightforward approach would be to take d as input and learn to first predict blueprint b as p(b|d), and then generate output s as p(s|b).However, this approach crucially relies on the blueprint being accurate and capturing all required information, which might be overly optimistic, given that blueprints (for training) are generated automatically.Moreover, pipeline architectures are known to suffer from error propagation, which in our case would undoubtedly affect generation performance, the final stage of the pipeline.
Rather than modeling the blueprint and output generation stages separately, we train an encoderdecoder model to encode d and generate b; s (i.e., the concatenation of the blueprint and output sequence) in one go.Essentially, the decoder first predicts blueprint b and then continues to generate output s, using both b and d.We prefix b and s with special markers "Plan:" and "Summary:", respectively.In particular, we predict b as a 1 ; q 1 ; . . .; a m ; q m , namely a (concatenated) sequence of answer-question pairs. 3The model is trained with the standard maximum-likelihood objective to generate the augmented target b; s.Interestingly, in this end-to-end model the blueprint functions as a macro-plan, i.e., a global sketch of the content and organization of the output.
Multi-task Model It is generally challenging for encoder-decoder models to generate long output sequences (Ko and Li, 2020;Tan et al., 2021).The end-to-end model sketched above further amplifies this problem because it ultimately aims to generate sequence b; s rather than just s, increasing the sequence length by 220% (see Table 3).
To mitigate this problem, we propose a multitask model optimized to perform two separate tasks.Let a and q denote an ordered sequence of answers (a 1 , . . ., a m ) and corresponding questions (q 1 , . . ., q m ), in blueprint b.The model is trained to generate (a) the answer plan concatenated with output sequence a; s, and (b) the answer plan concatenated with questions a; q.In particular, we train a single encoder-decoder model to encode input d, while the decoder first predicts answer plan a (as p(a|d)) and then continues to generate output s (as p(s|a, d)) or corresponding questions q (as p(q|a, d)), depending on the task.We prefix a, q, and s with special markers "Plan:", "Questions:", and "Summary:", respectively.We further prefix input d with "Generate Summary:" or "Generate Questions:" to instruct our model to generate output s or questions q, respectively.We sample data points from these two tasks with equal probability and train the model with the standard maximum-likelihood objective.
During inference, we use a two-step process to generate output s ′ and its blueprint b ′ for input d.We first prefix d with "Generate Summary:" and generate a ′ ; s ′ , i.e., answer plan a ′ followed by output sequence s ′ .We then prefix d with "Generate Questions:", prompt our decoder with the predicted answer plan a ′ and generate corresponding questions q ′ for blueprint b ′ .The multi-task model alleviates the length issue discussed above by learning to generate a; s instead of b; s.However, this comes at the expense of generation quality, since the model now conditions on the answers only, not question-answer pairs.As such, it can be viewed as an extension of FROST (Narayan et al., 2021) with the plan being a sequence of answer spans rather than entity chains.This model also creates a macro-plan of the output, however, less detailed compared to the end-to-end model.
Iterative Model Rather than predicting a global plan (i.e., answer plan a or blueprint b) prior to generating output s, we employ an incremental approach which interleaves planning with text generation.Let output s consist of n sentences {s 1 , s 2 , . . ., s n }; then, the corresponding blueprint b can be represented as {b 1 , b 2 , . . ., b n }, where b i : {(a i j+1 , q i j+1 ), . . ., (a i j+k , q i j+k )} consists of k question-answer pairs for sentence s i .We train our model to iteratively plan and generate one sentence at a time, conditioning on the input and the output sentences generated so far.In particular, we train an encoder-decoder model where the encoder first encodes input d, while the decoder takes summary {s 1 , . . ., s i } generated so far as a prompt and generates blueprint b i+1 for next sentence s i+1 , followed by sentence s i+1 itself.
The iterative model is trained on quadruples where φ is an empty context placeholder used to predict the first blueprint b 1 and corresponding first sentence s 1 , (n + 1) is the blueprint length, and s 1,i = {s 1 , . . ., s i } are the output sentences generated so far; b end and s end are special tokens marking the end of the output prediction.We prefix s 1,i , b i , and s i with special markers "Context:", "Plan:", and "Next Sentence:", respectively.We train the model with the standard maximumlikelihood objective to predict s 1,i ; b i ; s i , however, we do not compute the loss for predicting context s 1,i to avoid over-optimizing for sentences that appear at the beginning of the output.
The iterative approach does not create a global macro plan.Rather, it learns micro content plans and verbalizes them one-by-one, conditioning on previously generated sentences but not on previously generated QA pairs.Athough it does not have a global document view like the end-to-end model, the iterative decoder cannot exceed the output sequence length as it plans and predicts one sentence at a time as b i ; s i , instead of generating b; s in one go.And unlike the multi-task model, each sentence s i is generated by conditioning on the full blueprint b i (consisting of questions and answers).

Datasets
We evaluated our model on benchmarks representative of long-form question answering and summarization.Our datasets vary in terms of the input given to the generation model (e.g., multiple .We quantify the abstractiveness of the target by measuring the proportion of n-grams unseen in the source.We also report statistics on the target length augmented with the blueprint (number of QA pairs and words in total).
documents or one, web pages, or dialogue transcripts), the user's information need (e.g., answering a question or aggregating information), and summary style (e.g., genuinely abstractive vs extractive).Common features among them are very long inputs and multi-sentence output summaries.We summarize various dataset statistics in Table 3.
AQuaMuSe (Kulkarni et al., 2020(Kulkarni et al., , 2021) is a query-focused multi-document summarization dataset; it was created with the intent of simulating how a search engine might synthesize documents of high relevance to a user query.It consists of Google Natural Questions (Kwiatkowski et al., 2019) paired with web documents extracted from Common Crawl and long-form answers from Wikipedia.We approach this task as a generative QA problem where we take the query and associated web documents and generate a long-form answer to the query.We work on the split from  (Shaham et al., 2022) which incorporates episodes from 88 different shows.SummScreen-FD is a challenging testbed for several reasons.Plot details are often expressed indirectly in conversations between characters and are scattered across the entire transcript.The summarization task is highly compressive, a transcript the size of a book (on average 8,000 tokens; see Table 3) is condensed into a few sentences, and the evaluation of such summaries comes with its own challenges (e.g., it is not realistic to expect humans to read the transcript to be able to assess their quality).
We further analyze the characteristics of these datasets in Table 3. Long-form answers in AQua-MuSe are mostly extractive with only 2%, 13%, 24%, and 31% novel unigrams, bigrams, trigrams, and 4-grams, respectively.In comparison, summaries in WikiCatSum and SummScreen-FD are more abstractive; WikiCatSum abstracts have 13% novel unigrams, 54% bigrams, 78% trigrams, and 86% 4-grams, whereas in SummScreen-FD summaries 17% unigrams, 66% bigrams, 92% trigrams, and 98% 4-grams were not seen in the training.Interestingly, SummScreen-FD summaries have far more propositions than AQuaMuSe or Wi-kiCatSum targets, leading to a much higher number for QA pairs in their blueprints (28.10 vs 8.16 or 9.56).This in turn makes the generation task for end-to-end models very challenging.The average summary length together with the blueprint annotations (i.e., b; s) for SummScreen-FD is almost twice the size of WikiCatSum and AQua-MuSe (597.90 vs 291.28 and 272.68).The majority of questions in AQuaMuSe and WikiCatSum are what questions (76.0% and 74.2%, respectively), followed by who, where, when, and how questions.For SummScreen-FD, what and who questions are most popular (50.1% and 42.9%, respectively).

Comparison Systems
All our experiments used LONGT5 (Guo et al., 2021), an extension of the original T5 encoder (Raffel et al., 2020) with global-local attention sparsity patterns to handle long inputs.We compared a vanilla LONGT54 model (xl, 3B parameters) fine-tuned on our datasets (with a maximum input sequence length of 4,096 tokens and a maximum output length of 512 tokens) against several blueprint variants.These include an end-toend LONGT5 model (E2E) which first decodes blueprint b and then continues to decode output s; a LONGT5 multitask model (MULTITASK) which jointly learns to predict the answer plan followed by either the output s or the questions in b; and a LONGT5 iterative model (ITERATIVE) which plans and generates one sentence at a time.
In addition, we implemented a two-stage model (2-STAGE) which first creates blueprint b given input d and then generates output s given b and d as input.Finally, we also fine-tuned T5 (xl, 3B parameters) on our datasets with a maximum input sequence length of 1,024 tokens and a maximum output length of 256 tokens, as a baseline.We present these comparisons in Table 4 together with the performance of various state-of-the-art systems.
We finetuned all our models with a leaning rate of 0.001 and a batch size of 128, for 50K steps.We select best checkpoints using average Rouge performance on validation sets.During inference, we use beam search with size 5 and alpha 0.8.

Automatic Evaluation
In this section we present experimental results using automatic evaluation metrics which assess overall summary (and blueprint) quality.Moreover, we quantify the extent to which automati-cally generated output is grounded to the blueprint and faithful to the input document/s.

Metrics
Summary and Blueprint Quality We evaluate summary quality automatically using (summarylevel) Rouge F1 (Lin and Hovy, 2003).We report only RougeLSum5 in Table 4 for the sake of brevity.We also use RougeLSum to evaluate the quality of the automatically generated blueprint, i.e., the QA pairs and their order against the reference blueprint.
Informativeness and Grounding We evaluate informativeness using QA-based metrics.Specifically, following the reading comprehension literature (Rajpurkar et al., 2016(Rajpurkar et al., , 2018a)), we quantify the extent to which the generated text can answer all questions from its reference (Informativeness) and predicted blueprint (Grounding).Following Stelmakh et al. (2022), we use a RoBERTa model (Liu et al., 2019) finetuned on SQuAD-V2 for question-answering in both cases. 6Given generated text s ′ and question-answer pair (q i , a i ) from the (reference or predicted) blueprint, we apply our question-answering model to s ′ to predict answer a ′ i to question q i .We then compute the tokenlevel F1 score between predicted answer a ′ i and ground truth answer a i , and report the average.
Following previous work (Maynez et al., 2020;Falke et al., 2019;Narayan et al., 2022;Honovich et al., 2022;Dušek and Kasner, 2020), we quantify the extent to which generated summaries are faithful to their input using textual entailment.We resort to textual entailment for two reasons; firstly, it is a relatively intuitive metric, all information in a summary should be entailed by the source or at least not conflict with it; secondly recent studies (Maynez et al., 2020;Fischer et al., 2022), have shown that it correlates with human judgments of faithfulness across summarization datasets and tasks.
Following Honovich et al. (2022), we trained an entailment model by fine-tuning T5-11B (Raffel et al., 2020) on the Adversarial NLI dataset (ANLI; Nie et al. 2020).For each sentence (hypothesis) in the summary, we compute its entailment probability given the input (premise) and report the average across all sentences to obtain an overall score (Maynez et al., 2020).
More formally, let E denote a textual entailment model that predicts E(a, b), namely that text b is entailed by text a.The faithfulness score F of summary s containing sentences s 1 , . . ., s 2 with respect to input D is computed as: where n is the number of sentences in the summary.If the input is longer than the T5 maximum encode length, we split it, calculate the entailment probability per split, and take the maximum.We convert probabilities to binary labels using a threshold (1 if > 0.5, and 0, otherwise).We further validated our ANLI entailment scores against human judgements of faithfulness elicited as part of SummEval (Fabbri et al., 2021), a recently released dataset for assessing automated summarization metrics.Our entailment predictions correlate well with human ratings, achieving a Spearman's rank correlation of ρ = 0.774.

Results
Why LONGT5 for Blueprint Models All the tasks we are dealing with require modeling input of highly complex nature which is often very long (see Table 3).Our results in Table 4 (see Rouge/summary column) demonstrate that T5 models always fall behind LONGT5, underscoring the importance of sparse attention mechanisms for modeling long inputs.In fact, LONGT5 sets a new state of the art on AQuaMuSe and SummScreen-FD.On WikiCatSum, it is slightly worse than RE-FLECT (Song et al., 2022), an extract-then-abstract model which has a dedicated content selection module.Similar content selection techniques could also benefit LONGT5, however, we leave this to future work.We henceforth use LONGT5 as a base model for finetuning our blueprint models.Kulkarni et al. (2021).BART and REFLECT (extractthen-abstract) results are taken from Song et al. (2022).Hybrid R2T-BART (content selection + generation) results are taken from Chen et al. (2022).Best results for each task are boldfaced.Scores that are not significantly different (using paired bootstrap resampling; p < 0.05) from the best score in each column are marked with a dagger ( †).
Blueprint Models and Rouge Compared to LONGT5, blueprint variants slightly underperform on AQuaMuse, but score better on Wiki-CatSum and SummScreen-FD (see MULTITASK model).All differences between LONGT5and blueprint models are statistically significant using paired bootstrap resampling; p < 0.05).For a fair comparison, we always use a maximum decoder length of 512 tokens.With the exception of AQua-Muse, E2E is inferior to other blueprint models, which is not surprising since it has to generate much longer text (recall it predicts b; s rather than simply s).Overall, MULTITASK is significantly better than other blueprint models on WikiCatSum but on par with ITERATIVE on SummScreen-FD.
Similar patterns emerge when evaluating the predicted blueprints against reference QA pairs, with ITERATIVE significantly outperforming the other two variants on SummScreen-FD.This could be due to the fact that SummScreen-FD summaries have far more propositions than AQua-MuSe or WikiCatSum targets; it is better to predict them one sentence at a time, rather than all together.With regard to WikiCatSum, the difference between MULTITASK and ITERATIVE is not significant (although MULTITASK has a slight numerical advantage) and both systems are significantly better than E2E.On AQuaMuSe MULTITASK is significantly better than E2E and ITERATIVE.
Note that all 2-STAGE models are significantly worse in comparison to blueprint variants, when evaluating either their blueprints or summaries (in terms of Rouge).While our models learn to optimize blueprints and summaries together, 2-STAGE models are faced with the harder task of predicting the blueprint solely based on the input (text-to-data).Since the blueprints learned by the first stage are of poor quality, the summaries generated in the second stage are also inferior.
Blueprint Models and Informativeness Our blueprint annotation of reference summaries naturally provides a more principled alternative to Rouge.We can now use QA pairs in reference blueprints to evaluate the informativeness of predicted summaries.Results follow a pattern overall similar to Rouge, however, this approach reveals the complexity of the different generation tasks better than Rouge.While we were able to achieve reasonably high Rouge across datasets, we are far from generating informative summaries.On SummScreen-FD, in particular, we achieve a maximum Rouge score of 31.88, but are able to answer correctly only 7.59% of reference questions using the predicted summaries.
Across datasets, LONGT5 performs on par with MULTITASK, the difference between the two models is not statistically significant, and the same is true of ITERATIVE on SummScreen.
Reference Summary Grissom and Catherine investigate the death of a man found in a dumpster, who is found to have had a severe eating disorder.Meanwhile, Nick and Sara investigate the death of a couple at the brink of a bitter divorce, in which evidence seems to point to the couple's dog.

R2T-BART
Catherine Catherine, Sara and Grissom investigate the death of a man who was found dead in a garbage bin.The victim's wife, Lori Tinsley, was a poker player.The man's brother, Jesse Menyel, was also found dead.The case is complicated by the fact that the victim had a gun in his apartment.Meanwhile, Nick and Warrick investigate the murder of a woman who died in a car crash.The woman's husband, Greg Colletti, is the suspect.LONGT5 Grissom and Catherine investigate when a man is found dead in a dumpster.They soon discover a lot more went on in the kitchen than cooking.Meanwhile Nick and Sara are called to the scene of a double homicide.The victims are a husband and his wife who were both in the process of selling off their rare records.Suspicion quickly falls on the wife's ex-boyfriend, but the evidence increasingly points to the husband.E2E Grissom, Nick and Catherine investigate when a man is found dead in a dumpster.Meanwhile a couple are found dead in their home.Sara and Nick investigate a case involving a record collection.MULTITASK Grissom and Catherine investigate when a man is found dead in the garbage.Their investigation leads to Jackpot's pretzels as the killer.Meanwhile Sara and Nick are called to a double murder when a husband and his wife are found dead in their home.When the husband and wife were found dead, their extensive record collection was also missing.ITERATIVE Grissom, Catherine and David investigate when a man is found dead in a dumpster.The man had been eating at a restaurant called Aunt Jackpot's Pretzels.They discover that he ate himself to death.Meanwhile Nick and Sara look into the disappearance of a husband and wife who are found dead in their home.Also missing is a record collection that the husband had been collecting.They discover that the wife's neck was slashed in the attack.CSIs track down Missy Halter, a woman who helped them find the records.
Table 5: System output and reference summary for SummScreen-FD (CSI S6.E9, "Dog Eat Dog").Propositions which are not grounded to the input are in red.Generated questions from blueprint models are not shown due to space constraints.
Blueprint Models and Grounding The E2E and ITERATIVE variants are significantly better than MULTITASK in generating texts grounded to their predicted blueprints (see ground.column in Table 4).This is because both models generate text conditioned on their blueprints; E2E first predicts blueprint b and then continues to generate output s using both b and the input, whereas, IT-ERATIVE plans and generates one sentence at a time as b i ; s i .This is not the case with MULTI-TASK which generates s conditioned on answer spans only.E2E performs slightly better than IT-ERATIVE on AQuaMuSe and WikiCatSum (differences are not statistically significant) but struggles on SummScreen-FD, where summaries are longer with more facts/propositions, requiring inference over long-range dependencies, and common sense reasoning.ITERATIVE seems the best option for grounded generation without sacrificing informativeness (ITERATIVE is most informative amongst blueprint models on SummScreen-FD, second best on AQuaMuSe, and third best on WikiCatSum).

ITERATIVE Is Most Faithful Model
As far as faithfulness is concerned, ITERATIVE performs consistently better than E2E and MULTITASK, as well as T5 and LONGT5 models where text is generated from scratch without any planning (pairwise differences between ITERATIVE and comparison systems are all significant with the exception of E2E on AQuaMuse).On SummScreen-FD, IT-ERATIVE brings large gains on faithfulness without sacrificing informativeness (both in terms of Rouge and QA-F1).The ANLI score for ITER-ATIVE is 20.84, whereas it is below 10 for E2E and MULTITASK.E2E outperforms LONGT5 on AQuaMuSe and WikiCatSum, but gains are smaller compared to ITERATIVE.
We show examples of system output in Table 5, highlighting propositions which are not grounded to the input in red.E2E summaries are shorter which is somewhat expected; the model has to decode both the plan and the summary and in cases where the blueprint is large (e.g., in SummScreen-FD), there is no more room to decode the summary.MULTITASK is more verbose, however, the plan (a sequence of answer spans) is less detailed and as a result the summary less accurate (Jackpot's pretzels is a restaurant, not a killer).ITERATIVE contains many details in the summary, more than the reference, which are not hallucinations.Both R2T-BART and LONGT5 are rather loose with the facts and generate multiple hallucinations.
Blueprint Models are Controllable Our conceptualization of text plans as QA pairs brings inherent controllability to the generation process.By changing the blueprint, we can control content selection (i.e., what to say) and planning (i.e., in what order) without retraining the model or introducing additional control mechanisms.We provide an example in Table 6 where the plan predicted by the E2E model has been edited to render it more coherent and factual.As can be seen, the model is able to change its output according to the modified plan.Another example is shown in Table 7 where the output is rendered shorter by removing QA pairs from the predicted plan.
Q 1 : What breed existed but is no longer extinct?now extinct?
A 1 : Old English Bulldogs Q 2 : Along with the Old English Bulldog and Toy Bulldog, what breed was considered extinct at the end of the 19th century?What was the Old English Bulldog bred for?A 2 : Bullenbeisser Fighting in public arenas Old English Bulldogs refers to a breed of dog that once existed but is no extinct.At the end of the 19th century, three breeds -the Old English Bulldog, Toy Bulldog, and Bullenbeisser -were considered extinct.
Old English Bulldogs refers to a breed of dog that existed but is now extinct.It was bred for fighting in public arenas.
Table 6: Example of plan/summary generated by our E2E blueprint model as answer to the question "What is the difference between an Old English Bulldog and an English Bulldog?" (AQua-Muse test set); user edits to the plan and updated summary are shown in red.
We are also able to control the faithfulness of predicted summaries as follows.We take the predicted plan and remove question-answer pairs (E2E, ITERATIVE) or answer spans (MULTITASK) that cannot be answered based on the input.We then prompt our decoder with the modified plan and generate a new summary (or sentence for IT-ERATIVE).In Table 8, we quantitatively evaluate +drop variants which are controlled for faithfulness against vanilla blueprint models.We observe improvements in entailment scores across the board (see column entail. in the table), with the ITERATIVE+drop performing best.Improvements on abstractive datasets (WikiCatSum and SummScreen-FD) are larger compared to AQua-MuSe which is mostly extractive (see Table 3).The minor drop in Rouge and informativeness is somewhat expected as the models now zoom in on information they can reliably talk about, improving the consistency of the output.
Finally, we also experiment with creating simple summaries, by forcing the ITERATIVE model to generate from a single question-answer pair on each iteration (see +Q1 variant in Table 8).In the example shown in Table 9, ITERATIVE+Q1 produces simple summary sentences, each focusing on a single information element.Interestingly, as far as the ITERATIVE model is concerned, +Q1 variants are as faithful as +drop ones even if they do not explicitly control for faithfulness (across datasets the differences between the two models are not statistically significant).This suggests that controlling for simplicity might be sufficiently to reduce hallucinations, however, at the expense of A 7 : diverse roots Q 8 : What started to develop between 500 BCE and 300 CE?
A 8 : This Hindu synthesis Q 9 : When did the Vedic period end?
A 9 : 500 BCE Hinduism is an Indian religion and dharma, or a way of life, widely practiced in the Indian subcontinent.Hinduism has been called the oldest religion in the world, and some practitioners and scholars refer to it as Santāna Dharma, "the eternal tradition", or the "eternal way", beyond human history.Scholars regard Hinduism as a fusion or synthesis of various Indian cultures and traditions, with diverse roots and no founder.This "Hindu synthesis" started to develop between 500 BCE and 300 CE, following the Vedic period (1500 BCE to 500 BCE).
Hinduism is an Indian religion and dharma, or a way of life, widely practiced in the Indian subcontinent.
Table 7: Example of plan/summary generated by the E2E blueprint model as answer to the question "What section of the world or country is hinduism usually found in? (AQuaMuse test set); the part of the plan which is removed by the user is highlighted in ; the shorter summary generated from the elided plan is shown in red.
informativeness (Rouge scores for +Q1 variants tend to be significantly worse compared to +drop counterparts).
Most of the controllability cases we illustrate here are fully automatic and could be conceptualized as system flags which users select according to requirements (e.g., low tolerance for hallucinations, shorter summaries for small screen displays).Another potential use case would be to generate summaries for a set of questions provided by the user.Their input might be articles retrieved as an answer to a query, or in an educational context several chapters on a topic (e.g., cell biology).However, we leave this to future work.

Ablation Studies
As described in Section 3.2, we construct blueprint annotations using the Rheme-and Coverage-based selection strategies.be seen, it is empirically better to form blueprints from answer-question pairs rather than predicting the questions first and then their answers which is more natural (at least to humans).We further assessed whether sorting the QA pairs based on how they appear in the summary matters by defaulting to a random ordering (see −Sorted in the table ).Removing either Rheme or Coverage has a small negative impact on the summaries but not their blueprints, while removing them both is detrimental to summary quality, while the absence of Sorting mostly affects the quality of the blueprint.It is not surprising that sorting is most important to generating a blueprint with correctly ordered propositions.

Human-based Evaluation
In addition to automatic evaluation, we conducted three human-based studies which assessed different dimensions of output quality.Wishing to avoid well-documented issues7 with automated bots on Amazon Mechanical Turk and crowdworkers run-   ning through HITs as quickly as possible without paying attention to the tasks, we used a few trained annotators.They were given task-specific instructions and went through several pilots to iron out disagreements on edge cases.8

Summary Quality
Our first study assessed overall summary quality.Specifically, we asked our annotators to select the best among three system summaries taking into account how much they deviated from the reference in terms of informativeness (are the summaries on topic or emphasize irrelevant details?) and overall fluency.We adapted the definition of fluency provided in Howcroft et al. (2020): does the text 'flow well' or is it a sequence of unconnected parts?
We conducted our annotation study on 100 instances, each randomly sampled from AQuaMuse, WikiCatSum, and SumScreen.We collected ratings from three annotators (after two rounds of pilot studies to improve agreement) for the output of seven systems.Overall, we obtained 100 (instances) x 3 (datasets) x 6 (systems) x 3 (annotators) = 5,400 annotations.Annotator agreement was 97.11%.Our results are presented in Table 11.We report on percentage of times each system was ranked best.
In general, we observe that LONGT5 and blueprint models based on it are perceived as significantly better than previous state-of-the-art models (i.e., SIBERT and R2T-BART).On AQua-Muse, LONGT5 is rated overall best, followed by E2E and MULTITASK(however, differences between them are not statistically significant).On WikiCatSum, E2E is rated best bus is not significantly different compared to the other models.On SummScreen, our ITERATIVE variant is rated best followed by LONGT5.These results mirror the difficulty of the task (see Table 3), the longer the input/output, the better ITERATIVE performs.

Blueprint Quality
We further evaluated the predicted plans more directly.Participants were shown QA blueprints and asked to assess whether they tell a coherent story (are they all relevant and ordered comprehensively?) using a 3-point scale (where 3 is best and 1 is worst).They were also asked to evaluate whether the plans have redundant QA pairs; a QA pair is redundant if it does not add new information to the plan.We collected judgments for the same instances used in our summary quality evaluation from three annotators whose overall agreement was 97.87%.Obtained a total of 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 annotations.
Table 12 shows the results of this study.We report mean scores per dataset for all blueprint models.As an upper bound, we further elicited annotations for blueprints automatically created from gold standard reference summaries (see row Gold in the table).E2E generates the most coherent blueprints: differences between E2E and all comparison systems are statistically significant with the exception of the gold standard.This is not surprising, since all QA pairs in E2E are generated together, whereas in MULTITASK the spans and their corresponding questions are generated separately.ITERATIVE only generates QA pairs for a sentence at a time and thus we would not expect it to be more coherent than models which generate a global document plan.With regard to redundancy, ITERATIVE blueprints are generally most redundant, which is again down to not having a global view of previously generated QA pairs.ITERA-TIVE further underscores issues with our question generation technology which is far from perfect, for example, several QA pairs are different on the surface but actually semantically equivalent, however, we have no means of detecting this without robust coreference resolution.

Blueprint Grounded Generation
We next examine whether model summaries are grounded to their blueprints.Specifically, we asked our annotators to decide whether each QA pair in the blueprint is mentioned in the summary, and report the number of times it isn't.Ideally, we would like the summary to follow the blueprint as closely as possible.For QA pairs mentioned in the summary, we further asked our annotators to highlight whether the intent of the question was preserved or contradicted (we report the number of contradictions).Finally, we also asked participants to decide whether the summary has additional information which cannot be found in its blueprint, using a 3-point scale (where 3 is for summaries with lots of new information and 1 is for summaries with no new information).We elicited annotations for blueprint models, and, as an upper bound, for gold summaries and blueprints extrapolated from them.We obtained 100 (instances) x 3 (datasets) x 5 (systems) x 3 (raters) = 4,500 judgments.
The results of our grounding experiments are summarized in Table 13.Across datasets, we observe that ITERATIVE summaries are most grounded.ITERATIVE blueprints have the least number of questions that are absent from or contradict their generated texts.ITERATIVE summaries also display the least amount of new information in relation to their blueprints.ITERATIVE+drop is slightly less grounded compared to ITERA-TIVE, however, this is not entirely surprising since we prompt the ITERATIVE model with externally modified blueprints (see ITERATIVE+drop in Table 13).Note that ITERATIVE+drop summaries are deemed more faithful than ITERATIVE summaries in automatic evaluation.The entailment scores improve for all three datasets (see Table 4).

Conclusion
In this work we proposed a novel plan-based approach to conditional generation.We conceptualized text plans as a sequence of QA pairs operating as a proxy for what to say and in what order.We developed Transformer-based models which generate by conditioning on a global QA blueprint plan (E2E, MULTITASK) or iteratively by planning and generating one sentence at a time (ITERATIVE).Experimental results across three challenging datasets demonstrate that blueprint models are inherently more informative than vanilla sequence-to-sequence approaches without a planning component.Amongst the three presented here (E2E, MULTITASK, ITERATIVE), we find that ITERATIVE is the best choice for grounded generation and a promising direction for long-form generation.
Blueprint models offer several advantages compared to blackbox generation.Model predictions can be examined, and errors can be traced back to the blueprint which in turn can reveal whether the output is informative and faithful to its input.The formulation of the blueprint plan as question-answer pairs makes it intuitive and userfriendly.We have discussed how blueprint models might be used in a human-in-the-loop setting, where users interact with and influence model predictions directly, e.g., by editing the blueprint length and content (as different blueprints lead to different outputs).In the future, we would like to use blueprints more directly to advance methods for training language models using reward learning (Sutton and Barto, 2018), e.g., based on whether the output answers the blueprint questions.Rather than eliciting expensive human feedback (Stiennon et al., 2020), blueprints could provide a cheaper automatic alternative.Finally, although we focused primarily on the generation problem in this work, we believe blueprints might also be useful as a general-purpose approach to retrieving and organizing important content, especially when faced with many and very long inputs.; proportion of QA pairs with information contradictory to the summary (Contra; lower is better), and mean scores for new information present in the summary (NewInfo; lower is better).The best results for each task are boldfaced.Systems in each column are marked with † when they are not statistically significant from the best system; unmarked pariwise differences from the best system are significant (p < 0.01; using a Friedman's ANOVA test with post-hoc Wilcoxon signed-rant test, Bonferroni corrected for multiple comparisons).

Q 1 :
Hinduism is an Indian religion and what else?A 1 : dharma Q 2 : Hinduism is a way of what?A 2 : life Q 3 : Hinduism has been called what in the world?A 3 : the oldest religion Q 4 : Who call Hinduism Sanatana Dharma?A 4 : some practitioners Q 5 : What does Sanatana Dharma mean?A 5 : the eternal tradition Q 6 : Scholars regard Hinduism as a fusion of various Indian cultures and what else?A 6 : traditions Q 7 : Hinduism has no founder and what else?

Table 2 :
P 3 and [from 1969 to 1970 by Ford.]P 4 [Following the introduction of the fifth generation Ford Mustang in 2005,] P 5 [the Shelby nameplate was revived as a new high-performance model, this time designed and built by Ford.]P 6 Generation of QA pairs for summary in Figure1and blueprint annotation.We split the summary into propositions P and select no more than one QA pair per proposition.RT, RH, and CO are shorthands for Round Trip, Rheme, and Coverage.Questions that pass/fail each filter are marked with ✓/✗.

Table 3 :
Summary statistics for the datasets used in this work (AQuM, WCSum, and SS-FD are shorthands for AQuaMuse, WikiCatSum and ScreenSumm-FD, respectively).We report on the number of queries, size of training, development, and test set, and average source and target length (in terms of documents, words, sentences, and words per document)

Table 4 :
Results on AQuaMuSe, WikiCatSum, and SummScreen-FD test sets.Baseline and earlier SOTA models are presented in the top block and all blueprint models are shown in the bottom block.Models marked with * generate extractive summaries.HIBERT, TextRank, and SIB-ERT results on AQuaMuSe are taken from

Table 8 :
Table 10 presents various ablations which provide rationales for these annotation choices.For the sake of brevity, we report experiments with the E2E model trained (for 50,000 steps) on AQuaMuSe.We observe very similar trends on the other two datasets.As can † 54.16 † 34.59 83.64 † +drop 58.74 † 52.71 † 34.64 83.98 † MULTITASK 59.24 † 54.88 37.87 † 83.37 † +drop 59.25 53.79 † 37.57 84.98 † ITERATIVE 56.48 52.92 † 36.04 84.77 † Controllability results on the AQua-MuSe, WikiCatSum and SummScreen-FD test sets.Lighter blue color means more control.Best results for each metric are boldfaced.Scores that are not significantly different (using paired bootstrap resampling; p < 0.05) from the best score for each column are marked with a dagger ( †).

Table 9 :
System output from ITERATIVE and IT-

Table 10 :
E2E model trained on AQuaMuSe with different selection and sorting (validation set).

Table 13 :
Human evaluation results for blueprint grounded generation on AQuaMuse, WikiCatSum, and SummScreen-FD test sets.Proportion of QA pairs not mentioned in the summary (Absent; lower is better)