Data-to-text Generation with Variational Sequential Planning

Abstract We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input. We focus on generating long-form text, that is, documents with multiple paragraphs, and propose a neural model enhanced with a planning component responsible for organizing high-level information in a coherent and meaningful way. We infer latent plans sequentially with a structured variational model, while interleaving the steps of planning and generation. Text is generated by conditioning on previous variational decisions and previously generated text. Experiments on two data-to-text benchmarks (RotoWire and MLB) show that our model outperforms strong baselines and is sample-efficient in the face of limited training data (e.g., a few hundred instances).


Introduction
Data-to-text generation refers to the task of generating textual output from non-linguistic input such as database tables, spreadsheets, or simulations of physical systems (Reiter andDale, 1997, 2000;Gatt and Krahmer, 2018).Recent progress in this area (Mei et al., 2016;Lebret et al., 2016;Wiseman et al., 2017) has been greatly facilitated by the very successful encoder-decoder neural architecture (Sutskever et al., 2014) and the development of large scale datasets.ROTOWIRE (Wiseman et al., 2017) and MLB (Puduppully et al., 2019b) constitute such examples.They both focus on the sports domain which has historically drawn attention in the generation community (Barzilay and Lapata, 2005;Tanaka-Ishii et al., 1998;Robin, 1994) and consider the problem of generating long target texts from database records.
Figure 1 (reproduced from Puduppully and Lapata, 2021) provides a sample from the MLB dataset which pairs human written summaries (Table C) with major league baseball game statistics.These are mostly scores (collectively referred to as box score) which summarize the performance of teams and players, e.g., batters, pitchers, or fielders (Table A) and a play-by-play description of the most important events in the game (Table B).Game summaries in MLB are relatively long (540 tokens on average) with multiple paragraphs (15 on average).The complexity of the input and the length of the game summaries pose various challenges to neural models which, despite producing fluent output, are often imprecise, prone to hallucinations, and display poor content selection (Wiseman et al., 2017).Attempts to address these issues have seen the development of special-purpose modules which keep track of salient entities (Iso et al., 2019;Puduppully et al., 2019b), determine which records (see the rows in Tables A and B) should be mentioned in a sentence and in which order (Puduppully et al., 2019a;Narayan et al., 2020), and reconceptualize the input in terms of paragraph plans (Puduppully and Lapata, 2021) to facilitate document-level planning (see Table D in Figure 1).Specifically, Puduppully and Lapata (2021) advocate the use of macro plans for improving the organization of document content and structure.A macro plan is a sequence of paragraph plans, and each paragraph plan corresponds to a document paragraph.A macro plan is shown in Table E (Figure 1).Examples of paragraph plans are given in Table D where <V(entity)> verbalizes records pertaining to entities and <V(inning-T/B)> verbalizes records for the Top/Bottom side of an inning.Verbalizations are sequences of record types followed by their values.Document paragraphs are shown in Table C and have the same color as their corresponding plans in Table E.During training, Puduppully and Lapata (2021) learn to predict a macro plan from a pool of paragraph plans, and produce a game summary based on it.Continuing with our example in Figure 1, plan (E)  (C) KANSAS CITY, Mo. -Brad Keller kept up his recent pitching surge with another strong outing.<P> Keller gave up a home run to the first batter of the game -Cedric Mullins -but quickly settled in to pitch eight strong innings in the Kansas City Royals' 9-2 win over the Baltimore Orioles in a matchup of the teams with the worst records in the majors.<P> Keller (7-5) gave up two runs and four hits with two walks and four strikeouts to improve to 3-0 with a 2.16 ERA in his last four starts.<P> Ryan O'Hearn homered among his three hits and drove in four runs, Whit Merrifield scored three runs, and Hunter Dozier and Cam Gallagher also went deep to help the Royals win for the fifth time in six games on their current homestand.<P> With the score tied 1-1 in the fourth, Andrew Cashner (4-13) gave up a sacrifice fly to Merrifield after loading the bases on two walks and a single.Dozier led off the fifth inning with a 423-foot home run to left field to make it 3-1.<P> The Orioles pulled within a run in the sixth when Mullins led off with a double just beyond the reach of Dozier at third, advanced to third on a fly ball and scored on Trey Mancini's sacrifice fly to the wall in right.<P> . . .and B. Paragraph plans in the first column correspond to a single entity or event.Paragraph plans in the second column describe combinations of entities or events.<V(entity)> verbalizes records pertaining to entities and <V(inning-T/B)> verbalizes records for the Top/Bottom side of an inning.Paragraph plans correspond to paragraphs in Table C. Table E contains the macro plan for the document in Table C.A macro plan is a sequence of paragraph plans.Plan-document correspondences are highlighted using the same color.

(B)
from paragraph plans (D), to give rise to game summary (C).
The intermediate macro plan renders generation more interpretable (differences in the output can be explained by differences in macro planning).It also makes modeling easier, the input is no longer a complicated table but a sequence of paragraph plans which in turn allows us to treat data-to-text generation as a sequence-to-sequence learning problem.Nevertheless, decoding to a long document remains challenging for at least two rea-sons.Firstly, the macro plan may be encoded as a sequence but a very long one (more than 3,000 tokens) which the decoder has to attend to at each time step in order to generate a summary tokenby-token.Secondly, the prediction of the macro plan is conditioned solely on the input (i.e., pool of paragraph plans (D) in Figure 1) and does not make use of information present in the summaries.We hypothesize that planning would be more accurate were it to consider information available in the table (and corresponding paragraph plans) and the generated summary, more so because the plans are coarse-grained and there is a one-to-many relationship between a paragraph plan and its realization.For example, we can see that the plan for <V(B.Keller)> results in two very different realizations in the summary in Figure 1 (see first and third paragraph).
In this work, we present a model which interleaves macro planning with text generation (see Figure 2 for a sketch of the approach).We begin by selecting a plan from a pool of paragraph plans (see Table D in Figure 1), and generate the first paragraph by conditioning on it.We select the next plan by conditioning on the previous plan and the previously generated paragraph.We generate the next paragraph by conditioning on the currently selected plan, the previously predicted plan, and generated paragraph.We repeat this process until the final paragraph plan is predicted.We model the selection of paragraph plans as a sequential latent variable process which we argue is intuitive since content planing is inherently latent.Contrary to Puduppully and Lapata (2021), we do not a priori decide on a global macro plan.Rather our planning process is incremental and as a result less rigid.Planning is informed by generation and vice versa, which we argue should be mutually beneficial (they are conditioned on each other).
During training, the sequential latent model can better leverage the summary to render paragraph plan selection more accurate and take previous decisions into account.We hypothesize that the interdependence between planning and generation allows the model to cope with diversity.In general, there can be many ways in which the input table can be described in the output summary, i.e., different plans give rise to equally valid game summaries.The summary in Figure 1 (Table C) focuses on the performance of Brad Keller who is a high scoring pitcher (first three paragraphs).An equally plausible summary might have discussed a high scoring batter first (e.g., Ryan O'Hearn).Also notice that the summary describes innings in chronological order.However, another ordering might have been equally plausible, for example, describing innings where the highest runs are scored first or innings which are important in flipping the outcome of the match.In the face of such diversity, there may never be enough data to learn an accurate global plan.It is easier to select a paragraph plan from the pool once some of the summary is known, and different plans can be predicted for the same input.In addition, the proposed model is end-to-end differentiable and gradients for summary prediction also inform plan prediction.
Our contributions can be summarized as follows: (1) we decompose data-to-text generation into sequential plan selection and paragraph generation.The two processes are interleaved and generation proceeds incrementally.We look at what has been already generated, make a plan on what to discuss next, realize the plan, and repeat; (2) in contrast to previous models (Puduppully et al., 2019a;Puduppully and Lapata, 2021) where content plans are monolithic and determined in advance, our approach is more flexible, it simplifies modeling (we do not need to learn alignments between paragraph plans and summary paragraphs), and leads to sample efficiency in low resource scenarios; (3) our approach scales better for tasks involving generation of long multi-paragraph texts, as we do not need to specify the document plan in advance; (4) experimental results on English and German ROTOWIRE (Wiseman et al., 2017;Hayashi et al., 2019), and MLB (Puduppully et al., 2019b) show that our model is well-suited to long-form generation and generates more factual, coherent, and less repetitive output compared to strong baselines.
We share our code and models in the hope of being useful for other tasks (e.g., story generation, summarization)1 .

Related Work
A long tradition in natural language generation views content planning as a central component to identifying important content and structuring it appropriately (Reiter and Dale, 2000).Earlier work has primarily made use of hand-crafted content plans with some exceptions which pioneered learning-based approaches.For instance, Duboue and McKeown (2001) learn ordering constraints on the content plan, while Kan and McKeown (2002) learn content planners from semantically annotated corpora, and Konstas and Lapata (2013) predict content plans using grammar rules whose probabilities are learnt from training data.
More recently, there have been attempts to equip encoder-decoder models (Bahdanau et al., 2015;Wiseman et al., 2017) with content planning modules.Puduppully et al. (2019a) introduce micro planning: they first learn a content plan corresponding to a sequence of records, and then generate a summary conditioned on it.Narayan et al. (2020) treat content selection as a task similar to extractive summarization.Specifically, they post-process Pudupully et al.'s (2019a) micro-plans with special tokens identifying the beginning and end of a sentence.Their model first extracts sentence plans and then verbalizes them one-by-one by conditioning on previously generated sentences.Moryossef et al. (2019b,a) propose a two-stage approach which first predicts a document plan and then generates text based on it.The input to their model is a set of RDF Subject, Object, Predicate tuples.Their document plan is a sequence of sentence plans where each sentence plan contains a subset of tuples in a specific order.Text generation is implemented using a sequence-to-sequence model enhanced with attention and copy mechanisms (Bahdanau et al., 2015).They evaluate their model on the WebNLG dataset (Gardent et al., 2017) where the outputs are relatively short (24 tokens on average).
Our approach is closest to Puduppully and Lapata (2021) who advocate macro planning as a way of organizing high-level document content.Their model operates over paragraph plans which are verbalizations of the tabular input and predicts a document plan as a sequence of paragraph plans.In a second stage, the summary is generated from the predicted plan making use of attention enriched with a copy mechanism.We follow their formulation of content planning as paragraph plan prediction.Our model thus operates over larger content units compared to related work (Puduppully et al., 2019a;Narayan et al., 2020) and performs the tasks of micro-and macro-planning in one go.In contrast to Puduppully and Lapata (2021), we predict paragraph plans and their corresponding paragraphs jointly in an incremental fashion.Our approach is reminiscent of psycholinguistic models of speech production (Levelt, 1993;Taylor and Taylor, 1990;Guhe, 2020) which postulate that different levels of processing (or modules) are responsible for language generation; these modules are incremental, each producing output as soon as the information it needs is available and the output is processed immediately by the next module.
We assume plans form a sequence of paragraphs which we treat as a latent variable and learn with a structured variational model.Sequential latent variables (Chung et al., 2015;Fraccaro et al., 2016;Goyal et al., 2017) have previously found application in modeling attention in sequence-to-sequence networks (Shankar and Sarawagi, 2019), document summarization (Li et al., 2017), controllable generation (Li and Rush, 2020;Fu et al., 2020), and knowledge-grounded dialogue (Kim et al., 2020).In the context of data-to-text generation, latent variable models have been primarily used to inject diversity in the output.Shao et al. (2019) generate a sequence of groups (essentially a subset of the input) which specifies the content of the sentence to be generated.Their plans receive no feedback from text generation, they cover a small set of input items, and give rise to relatively short documents (approximately 100 tokens long).Ye et al. (2020) use latent variables to disentangle the content from the structure (operationalized as templates) of the output text.Their approach generates diverse output output by sampling from the template-specific sample space.They apply their model to singlesentence generation tasks (Lebret et al., 2016;Reed et al., 2018).

Model
Following Puduppully and Lapata (2021), we assume that at training time our model has access to a pool of paragraph plans E (see Table D in Figure 1) which represent a clustering of records.We explain how paragraph plans are created from tabular input in Section 4. Given E, we aim to generate a sequence of paragraphs y = [y 1 , ..., y T ] that describe the data following a sequence of chosen plans z = [z 1 , ..., z T ].Let y t denote a paragraph, which can consist of multiple sentences, and T the count of paragraphs in a summary.With a slight abuse of notation, superscripts denote indices rather than exponentiation.So, y t i refers to the i-th word in the t-th paragraph.A plan z = [z 1 , ..., z T ] is a Step 1 to t-1 Step t Step t+1 and after list of discrete variables where z t = j means that we choose the j-th item from pool E of candidate plans to guide the generation of paragraph y t .

Generation with Latent Plans
The core technique of our model is learning the sequence of latent plans that guides long document generation.We consider a conditional generation setting where the input E is a set of paragraph plans and the output y 1:T are textual paragraphs verbalizing the selected sequence z = z 1:T .Our goal is to induce variables z that indicate which paragraphs are being talked about and in which order.Similar to previous work (Li and Rush, 2020;Fu et al., 2020), we model this process as a conditional generative model that produces both y and z and factorizes as: where θ denotes the model parameters and < t all indices smaller than t.We believe this formulation is intuitive, simulating incremental document generation: inspect y <t (what has been already said), make a plan z t about what to say next, realize this plan by generating a new paragraph y t , and so on.
Inference Model We are interested in the posterior distribution p θ (z|y, E), i.e., the probability over plan sequences z for a known text y and input E. This distribution is intractable to compute in general as the summation of all possible plan sequences z is exponentially complex: We use variational inference (Kingma and Welling, 2014;Rezende et al., 2014) to approximate the posterior with a parametrized distribution q φ (z|y, E) from which we sample values of z that are likely to produce y (see Doersch 2016 for a tutorial on this topic).Specifically, we employ an autoregressive inference model factorized as: Note that a major difference between q above and p in Equation ( 1) is that p generates y t under the guidance of z t (conceptually z t → y t ) while q infers z t given observed y t (conceptually y t → z t ).
Neural Parametrization At step t, we start with the encoding of previous paragraphs y <t and plans z <t (see Figure 3 left).Following Yang et al. (2016), we use a Bi-directional LSTM (BiLSTM) with a self-attention layer to encode paragraph y t as a vector r t y at step t: where q text is a trainable query vector, which is randomly initialized and learnt along with the rest of the parameters.Attn(•) returns the attention probability and output vector over BiLSTM representation y t with query vector q text .2Our model uses the output vector.Next, we encode r <t y with LSTM text as: We encode candidate plans in pool E = [e 1 , .., e N ] with a BiLSTM, similar to the paragraph encoding shown in Equation ( 4), and select one of them at each step.Let r t z denote a plan embedding at step t.We encode r <t z using LSTM plan as: The currently selected plan is parametrized as: where h t−1 summarizes information in y <t and z <t , FF plan (•) denotes a feed-forward layer, and Attn(•) returns the attention probability (and output vector) of choosing a plan from E with current state h t−1 .Here, we use the attention distribution, which serves essentially as a copy mechanism.Then, a plan z t is sampled from p (we use greedy decoding in our experiments), and its representation r t z is used to update LSTM plan (Figure 3 right): We guide the generation of y t with current plan z t and decode each word y t i sequentially with an LSTM gen decoder which makes use of beam search.Let s i denote the i-th decoder state (initialized with the plan encoding).We update it as: Note that we feed h t−1 y , representing the context of previous paragraphs, as additional input similar to Serban et al. (2017).Let r t z,1 , ..., r t z,l denote the encoding of tokens of the current plan where r t z,k is the output of the BiLSTM plan encoder and l the length of the chosen plan.We generate the next word as: where c denotes the context vector.In Equation 11, we use the output vector from Attn(•).FF gen (•) represents a feed-forward layer.In addition, we equip the decoder with copy attention (See et al., 2017) to enable copying tokens from z t .As part of this, we learn a probability for copy based on s i (Gehrmann et al., 2018).Once paragraph y t has been generated, we obtain its encoding r t y with Equation ( 4), and update LSTM text (Figure 3 middle): We parametrize the variational model so that it shares the LSTMs for encoding y and E with the generative model: where FF v (•) represents a feed-forward layer.Note that Equation ( 14) differs from Equation ( 7) in that it uses the updated h t y instead of the previous h t−1 y because now y t is observed.The variational distribution is again parametrized by the attention probability.Essentially, p and q are strongly tied to each other with the shared LSTM encoders.
Although we primarily focus on the inference, and how the latent plan can improve the generation of long documents, we note that the model sketched above could be parametrized differently, e.g., by replacing the encoder and decoder with pretrained language models like BART (Lewis et al., 2020).However, we leave this to future work.
Training We optimize the standard evidence lower bound (ELBO) loss: where log p θ (y|E) is the log-evidence from the data, and D(q φ (z|y, E) p θ (z|y, E)) is the Kullback-Leibler divergence between q φ and the true posterior p θ .The objective eventually decomposes to a summation of the reconstruction probability p θ (y t |•) and the ratio between p θ (z t |•) and q φ (z t |•) at each step.
Advantageously, we can exploit oracle plans (see Table E in Figure 1 and the description in Section 4 for how these were created) to obtain weak labels z * which we use as distant supervision to the inference model: Such distant supervision is essential for stabilizing training (it would be extremely challenging to optimize the model in a fully unsupervised way) and for mitigating posterior collapse.We use Gumbel-Softmax (Maddison et al., 2017;Jang et al., 2017) for differentiable sampling (reparameterization) from q.The model is trained with scheduled sampling (Bengio et al., 2015), and follows the curriculum learning strategy using linear decay scheduling.
During earlier stages of training predicted plans are less accurate, and we thus sample from oracle plans at a rate which decays linearly with training: where c is the slope of the decay at training step k.

Experimental Setup
Data We performed experiments on the RO-TOWIRE (Wiseman et al., 2017)  All three datasets were preprocessed following the method of Puduppully and Lapata (2021).A paragraph plan for an entity is constructed by verbalizing its records in a fixed sequence of record type followed by its value.For example, pitcher B.Keller from Figure 1 would be verbalized as <PLAYER> B.Keller <H/V> V <W> 7 <L> 5 <IP> 8 <PH> 4 . . . .We denote this using the shorthand <V(B.Keller)>.The paragraph plan for an event is the verbalization of the players in the event followed by the verbalization of play-byplays.Candidate paragraph plans E are obtained by enumerating entities and events and their combinations (see Table D in Figure 1).Oracle macro plans are obtained by matching the mentions of entities and events in the gold summary with the input table.We make use of these oracle macro plans during training.The versions of MLB and ROTOWIRE released by Puduppully and Lapata (2021) contain paragraph delimiters for gold summaries; we preprocessed the German ROTOWIRE in a similar fashion.
Table 1 also shows the average length of the macro plan in terms of the number of paragraph plans it contains.This is 10.6 for ROTOWIRE, 15.1 for MLB, and 9.5 for German RotoWire.
Training Configuration We train our model with the AdaGrad optimizer (Duchi et al., 2011) and tune parameters on the development set.We use a learning rate of 0.15.We learn a joint subword vocabulary (Sennrich et al., 2016) for paragraph plans and summaries with 6K merge operations for ROTOWIRE, 16K merge operations for MLB, and 2K merge operations for German RO-TOWIRE.The model is implemented on a fork of OpenNMT-py (Klein et al., 2017).For efficiency, we batch using summaries instead of individual paragraphs.Batch sizes for MLB, ROTOWIRE, and German-ROTOWIRE are 8, 5, and 1 respectively.We set λ to 2 in Equation ( 18).In Equation ( 19), c is 1/100000 for MLB, 1/50000 for ROTOWIRE, and 1/30000 for German-ROTOWIRE.We set the temperature of Gumbel-Softmax to 0.1.
During inference in MLB, similar to Puduppully and Lapata (2021), we block the repetition of paragraph plan bigrams (i.e., we disallow the repetition of (z t , z t+1 )) and select the paragraph plan with the next higher probability in Equation ( 8).In addition, we block consecutive repetitions, and more than two repetitions of a unigram.During training we observed high variance in the length of paragraphs y t since the same plan can result in a shorter or longer paragraph.For example, <V(B.Keller)> corresponds to two paragraphs (first and third paragraph) with different lengths in Figure 1.We found that this encourages the model to be conservative and generate relatively short output.We control the paragraph length (Fan et al., 2018) by creating discrete bins, each containing approximately an equal number of paragraphs.
During training, we prepend the embedding of the bin to the current plan r t z (see Equation ( 11)).For inference, bins are tuned on the validation set.
We run inference for 15 paragraphs on RO-TOWIRE and German ROTOWIRE, and for 20 paragraphs on MLB; we stop when the model predicts the end of paragraph plan token EOP.Unlike previous work (Wiseman et al., 2017;Puduppully et al., 2019a,b, inter alia), we do not make use of truncated Back Propagation Through Time (BPTT; Williams and Peng, 1990), as we incrementally generate paragraphs instead of long documents.
System Comparisons We compared our model with: (1) a Template-based generator which creates a document consisting of template sentences.We used Wiseman et al.'s (2017) system on RO-TOWIRE and Puduppully et al.'s (2019b) system on MLB.They are both similar in that they describe team scores followed by player specific statistics and a concluding statement.In MLB, the template additionally describes play-by-play details.We also created a template system for German ROTOWIRE following a similar approach.
(2) ED+CC, the best performing model of Wiseman et al. (2017).It consists of an encoder-decoder model equipped with attention and copy mechanisms.(3) NCP+CC, the micro planning model of Puduppully et al. (2019a).It first creates a content plan by pointing to input records through the use of Pointer Networks (Vinyals et al., 2015).The content plan is then encoded with a BiLSTM and decoded using another LSTM with an attention and copy mechanism.(4) ENT, the entity model of Puduppully et al. (2019b).It creates entity-specific representations which are updated dynamically.At each time step during decoding, their model makes use of hierarchical attention by attending over entity representations and the records corresponding to these.(5) MACRO, the two-stage planning model of Puduppully and Lapata (2021), which first makes use of Pointer Networks (Vinyals et al., 2015) to predict a macro plan from a set of candidate paragraph plans.The second stage takes the predicted plan as input and generates the game summary with a sequence-to-sequence model enhanced with attention and copy mechanisms.In addition, we compare with a variant of Macro enhanced with length control (+Bin).

Results
Our experiments were designed to explore how the proposed model compares to related approaches which are either not enhanced with planning modules or non-incremental.We also investigated the sample efficiency of these models and the quality of the predicted plans when these are available.The majority of our results focus on automatic evaluation metrics.We also follow previous work (Wiseman et al., 2017;Puduppully et al., 2019a,b;Puduppully and Lapata, 2021) in eliciting judgments to evaluate system output.

Automatic Evaluation
We evaluate model output using BLEU (Papineni et al., 2002) with the gold summary as a reference.We also report model performance against the Information Extraction (IE) metrics of Wiseman et al. ( 2017) which are defined based on the output of an IE model which extracts entity (team and player names) and value (numbers) pairs from the summary and predicts the type of relation between them.
Let ŷ be the gold summary and y be the model output.Relation Generation (RG) measures the precision and count of relations obtained from y that are found in the input table.Content Selection (CS) measures the precision, recall, and F-measure of relations extracted from y also found in ŷ.And Content Ordering (CO) measures the complement of the Damerau-Levenshtein distance between relations extracted from y and ŷ.Higher values are better for RG Precision, CS F-measure, CO, and BLEU.We reuse the IE model from Puduppully et al. (2019a) for ROTOWIRE, Puduppully and Lapata (2021) for MLB, and Hayashi et al. (2019) for German ROTOWIRE.Our computation of IE metrics for all systems includes duplicate records (Puduppully and Lapata, 2021).
In addition to IE-based metrics, we report the number of errors made by systems according to Number (incorrect number in digits, number spelled in words, etc.), Name (incorrect names of teams, players, days of week, etc.), and Word (errors in usage of words) following the classification of Thomson and Reiter (2020).We detect such errors automatically using the system of Kasner et al. (2021)  To examine the importance of latent sequential planning, we also present a variant of our model which uniformly samples a plan from the pool E instead of Equation ( 8) (see row w(ith) Uniform in Table 2).This version obtains lower values compared to SeqPlan across all metrics underscoring the importance of sequential planning.We also present two variants of SeqPlan (a) one which makes use of oracle (instead of predicted) plans during training to generate y t ; essentially, it replaces z t with z * in Equation ( 12) (row w(ith) Oracle in Table 2) and (b) a two stage model which trains the planner (Equation ( 15)) and generator (Equation ( 12)) separately (row 2-stage in Table 2); in this case, we use greedy decoding to sample z t from Equation ( 15) instead of Gumbel-Softmax and replace z t with z * in Equation ( 12).Both variants are comparable to SeqPlan in terms of RG P but worse in terms of CS F, CO, and BLEU.
Furthermore, we evaluate the accuracy of the inferred plans by comparing them against oracle plans, using the CS and CO metrics (computed over the entities and events in the plan) 4 .Table 4 shows that SeqPlan achieves higher CS F and CO scores than Macro.Again, this indicates planning is beneficial, particularly when taking the table and the generated summary into account.
English and German ROTOWIRE Results on ROTOWIRE are presented in Table 3 (top).In addition to Templ, ED+CC, NCP+CC, and ENT, we compare with the models of Wiseman et al. (2017) (WS-2017) and Rebuffel et al. (2020) (RBF-2020).WS-2017 is the best performing model of Wiseman et al. (2017).Note that ED+CC is an improved re-implementation of WS-2017.RBF-2020 represents the current state-of-the-art on ROTOWIRE, and comprises of a Transformer encoder-decoder architecture (Vaswani et al., 2017) with hierarchical attention on entities and their records.The models of Saleh et al. (2019), Iso et al. (2019), and Gong et al. (2019) are not comparable as they make use of information additional to the table such as previous/next games or the author of the game summary.The model of Narayan et al. (2020) is also not comparable as it relies on a pretrained language model (Rothe et al., 2020) to generate the summary sentences.
Table 3 (bottom) shows our results on German ROTOWIRE.We compare against NCP+CC's en-  and slightly higher in terms of BLEU.
On German, our model is best across metrics achieving an RG P of 91.8% which is higher by 42% (absolute) compared to of Macro.In fact, the RG P of SeqPlan is superior to Saleh et al. (2019) whose model is pretrained with additional data and is considered state of the art (Hayashi et al., 2019).RG# is lower mainly because of a bug in the German IE which excludes number records.RG# for NCP+CC and Macro is too high because the summaries contain a lot of repetition.The same record will repeat at least once with NCP+CC and three times with Macro, whereas only 7% of the records are repeated with SeqPlan.
Table 4 evaluates the quality of the plans inferred by our model on the ROTOWIRE dataset.As can be seen, SeqPlan is slightly worse than Macro in terms of CS F and CO.We believe this is because summaries in ROTOWIRE are somewhat formulaic, with a plan similar to Templ: an opening statement is followed by a description of the top scoring players, and a conclusion describing the next match.Such plans can be learnt well by Macro without access to the summary.MLB texts show a lot more diversity in terms of length, and the sequencing of entities and events.The learning problem is also more challenging, supported by the fact that the template system does not do very well in this domain (i.e., it is worse in BLEU, CS F, and CO compared to ROTOWIRE).In German ROTOWIRE, SeqPlan plans achieve higher CS F and CO than Macro.
Table 5 reports complementary automatic metrics on English ROTOWIRE aiming to assess the factuality of generated output.We find that Templ has the least Number, Name, and double-double errors.This is expected as it simply reproduces   facts from the table.SeqPlan and Macro have similar Number errors, and both are significantly better than other neural models.SeqPlan has significantly more Name errors than Macro, and significantly fewer than other neural models.Inspection of Name errors revealed that these are mostly due to incorrect information about next games.Such information is not part of the input and models are prone to hallucinate.SeqPlan fares worse as it attempts to discuss next games for both teams while Macro focuses on one team only.In terms of double-double errors, SeqPlan is comparable to Macro, ENT and NCP+CC, and significantly better than WS-2017, ED+CC, and RBF-2020.

Sample Efficiency
We also evaluated whether SeqPlan is more sample efficient in comparison to Macro, by examining how RG P varies with (training) data size.As shown in Figure 4, the difference between SeqPlan and Macro is more pronounced when relatively little data is available.  .ference in RG P decreases.The slope of increase in RG P for Macro is higher for ROTOWIRE than MLB.We hypothesize this is because MLB has longer summaries with more paragraphs, and is thus more difficult for Macro to learn alignments between paragraph plans and text paragraphs in the game summary.

Human Evaluation
We used the Amazon Mechanical Turk (AMT) crowdsourcing platform for our judgment elicitation study.To ensure consistent ratings (van der Lee et al., 2019) we required that raters have completed at least 1,000 tasks, and have at least 98% approval rate.Participants were restricted to English speaking countries (USA, UK, Canada, Australia, Ireland, or New Zealand) and were allowed to provide feedback or ask questions.Raters were paid an average of 0.35$ for each task, ensuring that the remuneration is higher than the minimum wage per hour in the US.We compared SeqPlan with Gold, Templ, ED+CC, and Macro; we did not compare against ENT as previous work (Puduppully and Lapata, 2021) has shown that it performs poorly against Macro.For ROTOWIRE, we additionally compared against RBF-2020.
Supported and Contradicted Facts Our first eliciation study provided raters with box scores (and play-by-plays in the case of MLB), along with sentences randomly extracted from game summaries.We asked them to count supported and contradicting facts (ignoring hallucinations).Participants were given a cheatsheet to help them understand box score and play-by-play statistics as well as examples of sentences with the correct count of supported and contradicting facts.This evaluation was conducted on 40 summaries (20 for each dataset), with four sentences per summary, each rated by three participants.For MLB, this resulted in 300 tasks (5 systems × 20 summaries × 3 raters) and for ROTOWIRE in 360 (6 systems × 20 summaries × 3 raters).Altogether, we had 177 participants.The agreement between raters using Krippendorff's α for supported facts and contradicting facts was 0.43.Templ achieves the lowest count of contradicting facts and the highest count of supported facts for both the datasets.This is no surprise as it essentially regurgitates facts (i.e., records) from the table.On MLB, all systems display a comparable count of supported facts (differences are not statistically significant), with the exception of Templ which contains significantly more.In terms of contradicting facts, SeqPlan performs on par with Macro, Gold and Templ, and is significantly better than ED+CC.On ROTOWIRE, in terms of supported facts, Seq-Plan performs on par with the other neural models, is significantly higher than Gold, and significantly lower than Templ.In terms of contradicting facts, SeqPlan performs on par with Macro, Gold and Templ, and significantly better than ED+CC and RBF-2020.

Coherence, Grammaticality, and Conciseness
In our second study, raters were asked to choose the better summary from a pair of summaries based on Coherence (is the summary well structured and well organized and does it have a natural ordering of the facts?),Conciseness (does the summary avoid unnecessary repetition including whole sentences, facts or phrases?), and Grammaticality (is the summary written in well-formed English?).For this study, we required that the raters be able to comfortably comprehend summaries of NBA/MLB games.We obtained ratings using Best-Worst scaling (Louviere and Woodworth, 1991;Louviere et al., 2015), an elicitation paradigm shown to be more accurate than Likert scales.The score for a system is obtained by the number of times it is rated best minus the number of times it is rated worst (Orme, 2009).Scores range between −100 (absolutely worst) and +100 (absolutely best); higher is better.We assessed 40 summaries from the test set (20 for each dataset).Each summary pair was rated by three participants.For MLB, we created 1,800 tasks (10 system pairs × 20 summaries × 3 raters × 3 dimensions) and 2,700 for ROTOWIRE (15 pairs of systems × 20 summaries × 3 raters × 3 dimensions).Altogether, 377 raters participated in this task.The agreement between the raters using Krippendorff's α was 0.49.
On MLB, SeqPlan is significantly more coherent than ED+CC and Templ, and is comparable with Gold and Macro.A similar picture emerges with grammaticality.SeqPlan is as concise as Gold, Macro and Templ, and significantly better than ED+CC.On ROTOWIRE, SeqPlan is significantly more coherent than Templ and ED+CC, but on par with Macro, RBF-2020 and Gold.In terms of conciseness, SeqPlan is comparable with Gold, Macro, RBF-2020, and ED+CC, and significantly better than Templ.In terms of grammaticality, SeqPlan is comparable with Macro, RBF-2020, and ED+CC, significantly better than Templ, and significantly worse than Gold.

Discussion
In this work, we proposed a novel sequential latent variable model for joint macro planning and generation.Key in our approach is the creation of a latent plan in a sequential manner, while interleaving the prediction of plans and the generation of corresponding paragraphs.We proposed to deconstruct monolithic long document generation into smaller units (paragraphs in our case) which affords flexibility and better communication between planning and generation.Taken together, the results of automatic and human evaluation suggest that SeqPlan performs best in terms of factuality and coherence, it generates diverse, and overall fluent summaries and is less data-hungry compared to strong systems like Macro and NCP+CC.As SeqPlan does not have to learn alignments between the macro plan and the output text, it is better suited <P> The Cardinals have won four of five, and have won four in a row.<P> The Cardinals have won four of five, including a threegame sweep by the Brewers.<P> Brian Barton led off the ninth with a pinch-hit double off Derrick Turnbow (0-1) and moved to third on Cesar Izturis' sacrifice bunt.Schumaker drove in Barton with a double down the left-field line.<P> Ryan Braun, who had two hits, led off the eighth with a double off Ryan Franklin (1-1).Braun went to third on a wild pitch and scored on Corey Hart's triple into the right-field corner.<P> Albert Pujols was intentionally walked to load the bases with one out in the eighth, and Guillermo Ankiel flied out.Troy Glaus walked to load the bases for Kennedy, who hit a sacrifice fly off Guillermo Mota.<P> Ryan Franklin (1-1) got the win despite giving up a run in the eighth.Ryan Braun led off with a double and scored on Corey Hart's one-out triple.<P> Jason Isringhausen pitched a perfect ninth for his seventh save in nine chances.He has converted his last six save opportunities and has n't allowed a run in his last three appearances.<P> The Brewers lost for the seventh time in eight games.<P> Wainwright allowed two runs and four hits in seven innings.He walked four and struck out six. <P> Brewers manager Ron Roenicke was ejected by home plate umpire Bill Miller for arguing a called third strike.<P> The Cardinals took a 2-0 lead in the third.Albert Pujols walked with two outs and Rick Ankiel walked.Glaus then lined a two-run double into the left-field corner.<P> The Brewers tied it in the third.Jason Kendall led off with a double and scored on Rickie Weeks' double.Ryan Braun's RBI single tied it at 2. <P> Villanueva allowed two runs and three hits in seven innings.He walked four and struck out one.
Table 7: Predicted macro plan (top) and generated output from our model.Transitions between paragraph plans are shown using →.Paragraphs are separated with <P> delimiters.Entities and events in the summary corresponding to the macro plan are boldfaced.
for long-form generation.Potential applications include summarizing books (Kryściński et al., 2021) where the output can be longer than 1,000 tokens or generating financial reports (Kogan et al., 2009;Händschke et al., 2018) where the output exceeds 9,000 tokens.Existing approaches for long-form generation summarize individual paragraphs independently (Kryściński et al., 2021) or adopt a hierarchical approach (Wu et al., 2021) where summaries of paragraphs form the basis of chapter summaries which in turn are composed into a book summary.Table 7 gives an example of SeqPlan output.We see that the game summary follows the macro plan closely.In addition, the paragraph plans and the paragraphs exhibit coherent ordering.Manual inspection of SeqPlan summaries reveals that a major source of errors in MLB relate to attention diffusing over long paragraph plans.As an example, consider the following paragraph produced by SeqPlan "Casey Kotchman had three hits and three RBIs , including a two-run double in the second inning that put the Angels up 2-0.Torii Hunter had three hits and drove in a run ."In reality, Torii Hunter had two hits but the model incorrectly generates hits for Casey Kotchman.The corresponding paragraph plan is 360 tokens long and attention fails to discern important tokens.A more sophisticated encoder, e.g., based on Transformers (Vaswani et al., 2017), could make attention more focused.In RO-TOWIRE, the majority of errors involve numbers (e.g., team attributes) and numerical comparisons.Incorporating pre-executed operations such as min, max (Nie et al., 2018) could help alleviate these errors.
Finally, it is worth mentioning that although the template models achieve highest RG precision for both MLB and ROTOWIRE (Tables 2 and 3), this is mainly because they repeat facts from the table.Template models score low against CS F, CO, and BLEU metrics.In addition, they obtain lowest scores in Grammaticality and Coherence (Table 6) which indicates that they are poor at selecting records from the table and ordering them correctly in a fluent manner.

Figure 2 :
Figure 2: Conceptual sequence of interleaved planning and generation steps.The paragraph plan and its corresponding paragraph have the same color.

Figure 3 :
Figure 3: Model workflow.Solid arrows show dependencies between random variables.Dashed arrows show the computation graph whose backbone consists of an LSTM text and an LSTM plan .Note that the variational model and the generative model are tied closely with the shared LSTM.To generate long documents, the model observes what has been already generated, decides on a plan about what to discuss next, uses this plan to guide next stage generation, and repeats until the end.

Figure 4 :
Figure 4: Sample efficiency for (a) MLB and (b) RO-TOWIRE datasets.SeqPlan and Macro are trained on different portions (%) of the training dataset and performance is measured with RG P%.
which scored best against gold standard human annotations of the same type(Thomson and

Table 5 :
Number, Name, and double-double (Word) errors per example.Systems significantly different from SeqPlan are marked with an asterisk * (using a one-way ANOVA with posthoc Tukey HSD tests; p ≤ 0.05).

Table 6 :
Average number of supported (#Supp) and contradicting (#Contra) facts in game summaries and bestworst scaling evaluation for Coherence (Coher), Conciseness (Concis), and Grammaticality (Gram).Lower is better for contradicting facts; higher is better for Coherence, Conciseness, and Grammaticality.Systems significantly different from SeqPlan are marked with an asterisk * (using a one-way ANOVA with posthoc Tukey HSD tests; p ≤ 0.05).

Table 6 (
columns #Supp and #Contra) presents our results.Lower is better for contradicting facts.In case of supporting facts, the count should neither be too high nor too low.A high count of supporting facts indicates indicates poor content selection.A low count of supporting facts with a high count of contradicting facts indicates low accuracy of generation.