Abstract
We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input. We focus on generating long-form text, that is, documents with multiple paragraphs, and propose a neural model enhanced with a planning component responsible for organizing high-level information in a coherent and meaningful way. We infer latent plans sequentially with a structured variational model, while interleaving the steps of planning and generation. Text is generated by conditioning on previous variational decisions and previously generated text. Experiments on two data-to-text benchmarks (RotoWire and MLB) show that our model outperforms strong baselines and is sample-efficient in the face of limited training data (e.g., a few hundred instances).
1 Introduction
Data-to-text generation refers to the task of generating textual output from non-linguistic input such as database tables, spreadsheets, or simulations of physical systems (Reiter and Dale, 1997, 2000; Gatt and Krahmer, 2018). Recent progress in this area (Mei et al., 2016; Lebret et al., 2016; Wiseman et al., 2017) has been greatly facilitated by the very successful encoder-decoder neural architecture (Sutskever et al., 2014) and the development of large scale datasets. RotoWire (Wiseman et al., 2017) and MLB (Puduppully et al., 2019b) constitute such examples. They both focus on the sports domain, which has historically drawn attention in the generation community (Barzilay and Lapata, 2005; Tanaka-Ishii et al., 1998; Robin, 1994) and consider the problem of generating long target texts from database records.
Figure 1 (reproduced from Puduppully and Lapata, 2021) provides a sample from the MLB dataset, which pairs human written summaries (Table C in Figure 1) with major league baseball game statistics. These are mostly scores (collectively referred to as box score) which summarize the performance of teams and players, for example, batters, pitchers, or fielders (Table A in Figure 1) and a play-by-play description of the most important events in the game (Table B in Figure 1). Game summaries in MLB are relatively long (540 tokens on average) with multiple paragraphs (15 on average). The complexity of the input and the length of the game summaries pose various challenges to neural models which, despite producing fluent output, are often imprecise, prone to hallucinations, and display poor content selection (Wiseman et al., 2017). Attempts to address these issues have seen the development of special-purpose modules that keep track of salient entities (Iso et al., 2019; Puduppully et al., 2019b), determine which records (see the rows in Tables A and B in Figure 1) should be mentioned in a sentence and in which order (Puduppully et al., 2019a; Narayan et al., 2020), and reconceptualize the input in terms of paragraph plans (Puduppully and Lapata, 2021) to facilitate document-level planning (see Table D in Figure 1).
Example from the MLB dataset reproduced from Puduppully and Lapata (2021) with the authors’ permission. Table A is typically referred to as box score. It summarizes the data of the game per team and player. Table B reports statistics pertaining to innings or play-by-play scores. Table C contains the game summary. Paragraphs in Table C are separated with blue < P > delimiters. Table D contains paragraph plans obtained from Tables A and B. Paragraph plans in the first column correspond to a single entity or event. Paragraph plans in the second column describe combinations of entities or events. < V(entity) > verbalizes records pertaining to entities and < V(inning-T/B) > verbalizes records for the Top/Bottom side of an inning. Paragraph plans correspond to paragraphs in Table C. Table E contains the macro plan for the document in Table C. A macro plan is a sequence of paragraph plans. Plan-document correspondences are highlighted using the same color.
Example from the MLB dataset reproduced from Puduppully and Lapata (2021) with the authors’ permission. Table A is typically referred to as box score. It summarizes the data of the game per team and player. Table B reports statistics pertaining to innings or play-by-play scores. Table C contains the game summary. Paragraphs in Table C are separated with blue < P > delimiters. Table D contains paragraph plans obtained from Tables A and B. Paragraph plans in the first column correspond to a single entity or event. Paragraph plans in the second column describe combinations of entities or events. < V(entity) > verbalizes records pertaining to entities and < V(inning-T/B) > verbalizes records for the Top/Bottom side of an inning. Paragraph plans correspond to paragraphs in Table C. Table E contains the macro plan for the document in Table C. A macro plan is a sequence of paragraph plans. Plan-document correspondences are highlighted using the same color.
Specifically, Puduppully and Lapata (2021) advocate the use of macro plans for improving the organization of document content and structure. A macro plan is a sequence of paragraph plans, and each paragraph plan corresponds to a document paragraph. A macro plan is shown in Table E (Figure 1). Examples of paragraph plans are given in Table D where < V(entity) > verbalizes records pertaining to entities and < V(inning-T/B) > verbalizes records for the Top/Bottom side of an inning. Verbalizations are sequences of record types followed by their values. Document paragraphs are shown in Table C and have the same color as their corresponding plans in Table E. During training, (Puduppully and Lapata, 2021) learn to predict a macro plan from a pool of paragraph plans, and produce a game summary based on it. Continuing with our example in Figure 1, plan (E) is obtained from paragraph plans (D), to give rise to game summary (C).
The intermediate macro plan renders generation more interpretable (differences in the output can be explained by differences in macro planning). It also makes modeling easier, the input is no longer a complicated table but a sequence of paragraph plans, which in turn allows us to treat data-to-text generation as a sequence-to-sequence learning problem. Nevertheless, decoding to a long document remains challenging for at least two reasons. Firstly, the macro plan may be encoded as a sequence but a very long one (more than 3,000 tokens), which the decoder has to attend to at each time step in order to generate a summary token-by-token. Secondly, the prediction of the macro plan is conditioned solely on the input (i.e., pool of paragraph plans (D) in Figure 1) and does not make use of information present in the summaries. We hypothesize that planning would be more accurate were it to consider information available in the table (and corresponding paragraph plans) and the generated summary, more so because the plans are coarse-grained and there is a one-to-many relationship between a paragraph plan and its realization. For example, we can see that the plan for < V(B.Keller) > results in two very different realizations in the summary in Figure 1 (see first and third paragraph).
In this work, we present a model which interleaves macro planning with text generation (see Figure 2 for a sketch of the approach). We begin by selecting a plan from a pool of paragraph plans (see Table D in Figure 1), and generate the first paragraph by conditioning on it. We select the next plan by conditioning on the previous plan and the previously generated paragraph. We generate the next paragraph by conditioning on the currently selected plan, the previously predicted plan, and generated paragraph. We repeat this process until the final paragraph plan is predicted. We model the selection of paragraph plans as a sequential latent variable process, which we argue is intuitive since content planing is inherently latent. Contrary to Puduppully and Lapata (2021), we do not a priori decide on a global macro plan. Rather, our planning process is incremental and as a result less rigid. Planning is informed by generation and vice versa, which we argue should be mutually beneficial (they are conditioned on each other).
Conceptual sequence of interleaved planning and generation steps. The paragraph plan and its corresponding paragraph have the same color.
Conceptual sequence of interleaved planning and generation steps. The paragraph plan and its corresponding paragraph have the same color.
During training, the sequential latent model can better leverage the summary to render paragraph plan selection more accurate and take previous decisions into account. We hypothesize that the interdependence between planning and generation allows the model to cope with diversity. In general, there can be many ways in which the input table can be described in the output summary, that is, different plans give rise to equally valid game summaries. The summary in Figure 1 (Table C) focuses on the performance of Brad Keller, who is a high scoring pitcher (first three paragraphs). An equally plausible summary might have discussed a high scoring batter first (e.g., Ryan O’Hearn). Also notice that the summary describes innings in chronological order. However, another ordering might have been equally plausible, for example, describing innings where the highest runs are scored first or innings that are important in flipping the outcome of the match. In the face of such diversity, there may never be enough data to learn an accurate global plan. It is easier to select a paragraph plan from the pool once some of the summary is known, and different plans can be predicted for the same input. In addition, the proposed model is end-to-end differentiable and gradients for summary prediction also inform plan prediction.
Our contributions can be summarized as follows: (1) We decompose data-to-text generation into sequential plan selection and paragraph generation. The two processes are interleaved and generation proceeds incrementally. We look at what has been already generated, make a plan on what to discuss next, realize the plan, and repeat; (2) in contrast to previous models (Puduppully et al., 2019a; Puduppully and Lapata, 2021), where content plans are monolithic and determined in advance, our approach is more flexible, it simplifies modeling (we do not need to learn alignments between paragraph plans and summary paragraphs), and leads to sample efficiency in low resource scenarios; (3) our approach scales better for tasks involving generation of long multi-paragraph texts, as we do not need to specify the document plan in advance; and (4) experimental results on English and German RotoWire (Wiseman et al., 2017; Hayashi et al., 2019), and MLB (Puduppully et al., 2019b) show that our model is well-suited to long-form generation and generates more factual, coherent, and less repetitive output compared to strong baselines.
We share our code and models in the hope of being useful for other tasks (e.g., story generation, summarization).1
2 Related Work
A long tradition in natural language generation views content planning as a central component to identifying important content and structuring it appropriately (Reiter and Dale, 2000). Earlier work has primarily made use of hand-crafted content plans with some exceptions that pioneered learning-based approaches. For instance, Duboue and McKeown (2001) learn ordering constraints on the content plan, while Kan and McKeown (2002) learn content planners from semantically annotated corpora, and Konstas and Lapata (2013) predict content plans using grammar rules whose probabilities are learned from training data.
More recently, there have been attempts to equip encoder-decoder models (Bahdanau et al., 2015; Wiseman et al., 2017) with content planning modules. Puduppully et al. (2019a) introduce micro planning: They first learn a content plan corresponding to a sequence of records, and then generate a summary conditioned on it. Narayan et al. (2020) treat content selection as a task similar to extractive summarization. Specifically, they post-process Pudupully et al.’s Puduppully et al. (2019a) micro-plans with special tokens identifying the beginning and end of a sentence. Their model first extracts sentence plans and then verbalizes them one-by-one by conditioning on previously generated sentences. Moryossef et al. (2019b, a) propose a two-stage approach that first predicts a document plan and then generates text based on it. The input to their model is a set of RDF ⟨Subject, Object, Predicate⟩ tuples. Their document plan is a sequence of sentence plans where each sentence plan contains a subset of tuples in a specific order. Text generation is implemented using a sequence-to-sequence model enhanced with attention and copy mechanisms (Bahdanau et al., 2015). They evaluate their model on the WebNLG dataset (Gardent et al., 2017), where the outputs are relatively short (24 tokens on average).
Our approach is closest to Puduppully and Lapata (2021), who advocate macro planning as a way of organizing high-level document content. Their model operates over paragraph plans that are verbalizations of the tabular input and predicts a document plan as a sequence of paragraph plans. In a second stage, the summary is generated from the predicted plan making use of attention enriched with a copy mechanism. We follow their formulation of content planning as paragraph plan prediction. Our model thus operates over larger content units compared to related work (Puduppully et al., 2019a; Narayan et al., 2020) and performs the tasks of micro- and macro- planning in one go. In contrast to Puduppully and Lapata (2021), we predict paragraph plans and their corresponding paragraphs jointly in an incremental fashion. Our approach is reminiscent of psycholinguistic models of speech production (Levelt, 1993; Taylor and Taylor, 1990; Guhe, 2020), which postulate that different levels of processing (or modules) are responsible for language generation; these modules are incremental, each producing output as soon as the information it needs is available and the output is processed immediately by the next module.
We assume that plans form a sequence of paragraphs, which we treat as a latent variable and learn with a structured variational model. Sequential latent variables (Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017) have previously found application in modeling attention in sequence-to-sequence networks (Shankar and Sarawagi, 2019), document summarization (Li et al., 2017), controllable generation (Li and Rush, 2020; Fu et al., 2020), and knowledge- grounded dialogue (Kim et al., 2020). In the context of data-to-text generation, latent variable models have been primarily used to inject diversity in the output. Shao et al. (2019) generate a sequence of groups (essentially a subset of the input) which specifies the content of the sentence to be generated. Their plans receive no feedback from text generation, they cover a small set of input items, and give rise to relatively short documents (approximately 100 tokens long). Ye et al. (2020) use latent variables to disentangle the content from the structure (operationalized as templates) of the output text. Their approach generates diverse output output by sampling from the template-specific sample space. They apply their model to single-sentence generation tasks (Lebret et al., 2016; Reed et al., 2018).
3 Model
Following Puduppully and Lapata (2021), we assume that at training time our model has access to a pool of paragraph plans (see Table D in Figure 1), which represent a clustering of records. We explain how paragraph plans are created from tabular input in Section 4. Given , we aim to generate a sequence of paragraphs y = [y1,…,yT] that describe the data following a sequence of chosen plans z = [z1,…,zT]. Let yt denote a paragraph, which can consist of multiple sentences, and T the count of paragraphs in a summary. With a slight abuse of notation, superscripts denote indices rather than exponentiation. So, refers to the i-th word in the t-th paragraph. A plan z = [z1,…,zT] is a list of discrete variables where zt = j means that we choose the j-th item from pool of candidate plans to guide the generation of paragraph yt.
Generation with Latent Plans
Inference Model
Neural Parametrization
Model workflow. Solid arrows show dependencies between random variables. Dashed arrows show the computation graph whose backbone consists of an LSTMtext and an LSTMplan. Note that the variational model and the generative model are tied closely with the shared LSTM. To generate long documents, the model observes what has been already generated, decides on a plan about what to discuss next, uses this plan to guide next stage generation, and repeats until the end.
Model workflow. Solid arrows show dependencies between random variables. Dashed arrows show the computation graph whose backbone consists of an LSTMtext and an LSTMplan. Note that the variational model and the generative model are tied closely with the shared LSTM. To generate long documents, the model observes what has been already generated, decides on a plan about what to discuss next, uses this plan to guide next stage generation, and repeats until the end.
Although we primarily focus on the inference, and how the latent plan can improve the generation of long documents, we note that the model sketched above could be parametrized differently, for example, by replacing the encoder and decoder with pretrained language models like BART (Lewis et al., 2020). However, we leave this to future work.
Training
4 Experimental Setup
Data
We performed experiments on the RotoWire (Wiseman et al., 2017) and MLB (Puduppully et al., 2019b) datasets and the German RotoWire provided as part of the WNGT 2020 DGT shared task on “Document-Level Generation and Translation” (Hayashi et al., 2019). Statistics on these datasets are shown in Table 1. We used the official train/dev/test splits: 3,398/727/728 for RotoWire, 22,821/1,739/1,744 for MLB, and 242/240/241 for German RotoWire. The latter is considerably smaller than its English counterpart and MLB, and serves to illustrate our model’s sample efficiency when training data is scarce.
Dataset statistics for RotoWire (RW), MLB, and German RotoWire (DE-RW). Vocabulary size, number of tokens, number of instances (i.e., table-summary pairs), number of paragraphs, number of record types, average number of records, average summary length, average macro plan length measured in terms of number of paragraphs.
. | RW . | MLB . | DE-RW . |
---|---|---|---|
Vocab Size | 11.3K | 38.9K | 9.5K |
# Tokens | 1.5M | 14.3M | 234K |
# Instances | 4.9K | 26.3K | 723 |
# Paragraphs | 399K | 47.7K | 7K |
# Record Types | 39 | 53 | 39 |
Avg Records | 628 | 565 | 628 |
Avg Length | 337.1 | 542.1 | 323.6 |
Avg Plan length | 10.6 | 15.1 | 9.5 |
. | RW . | MLB . | DE-RW . |
---|---|---|---|
Vocab Size | 11.3K | 38.9K | 9.5K |
# Tokens | 1.5M | 14.3M | 234K |
# Instances | 4.9K | 26.3K | 723 |
# Paragraphs | 399K | 47.7K | 7K |
# Record Types | 39 | 53 | 39 |
Avg Records | 628 | 565 | 628 |
Avg Length | 337.1 | 542.1 | 323.6 |
Avg Plan length | 10.6 | 15.1 | 9.5 |
All three datasets were preprocessed following the method of Puduppully and Lapata (2021). A paragraph plan for an entity is constructed by verbalizing its records in a fixed sequence of record type followed by its value. For example, pitcher B.Keller from Figure 1 would be verbalized as <PLAYER >B.Keller <H/V >V <W >7 <L >5 <IP >8 <PH >4 …. We denote this using the shorthand < V(B.Keller) >. The paragraph plan for an event is the verbalization of the players in the event followed by the verbalization of play-by-plays. Candidate paragraph plans are obtained by enumerating entities and events and their combinations (see Table D in Figure 1). Oracle macro plans are obtained by matching the mentions of entities and events in the gold summary with the input table. We make use of these oracle macro plans during training. The versions of MLB and RotoWire released by Puduppully and Lapata (2021) contain paragraph delimiters for gold summaries; we preprocessed the German RotoWire in a similar fashion.
Table 1 also shows the average length of the macro plan in terms of the number of paragraph plans it contains. This is 10.6 for RotoWire, 15.1 for MLB, and 9.5 for German RotoWire.
Training Configuration
We train our model with the AdaGrad optimizer (Duchi et al., 2011) and tune parameters on the development set. We use a learning rate of 0.15. We learn a joint subword vocabulary (Sennrich et al., 2016) for paragraph plans and summaries with 6K merge operations for RotoWire, 16K merge operations for MLB, and 2K merge operations for German RotoWire. The model is implemented on a fork of OpenNMT-py (Klein et al., 2017). For efficiency, we batch using summaries instead of individual paragraphs. Batch sizes for MLB, RotoWire, and German-RotoWire are 8, 5, and 1, respectively. We set λ to 2 in Equation (18). In Equation (19), c is 1/100000 for MLB, 1/50000 for RotoWire, and 1/30000 for German-RotoWire. We set the temperature of Gumbel-Softmax to 0.1.
During inference in MLB, similar to Puduppully and Lapata (2021), we block the repetition of paragraph plan bigrams (i.e., we disallow the repetition of (zt,zt +1)) and select the paragraph plan with the next higher probability in Equation (8). In addition, we block consecutive repetitions, and more than two repetitions of a unigram. During training we observed high variance in the length of paragraphs yt, since the same plan can result in a shorter or longer paragraph. For example, < V(B.Keller) > corresponds to two paragraphs (first and third paragraph) with different lengths in Figure 1. We found that this encourages the model to be conservative and generate relatively short output. We control the paragraph length (Fan et al., 2018) by creating discrete bins, each containing approximately an equal number of paragraphs. During training, we prepend the embedding of the bin to the current plan (see Equation (11)). For inference, bins are tuned on the validation set.
We run inference for 15 paragraphs on RotoWire and German RotoWire, and for 20 paragraphs on MLB; we stop when the model predicts the end of paragraph plan token EOP. Unlike previous work (Wiseman et al., 2017; Puduppully et al., 2019a, b, inter alia), we do not make use of truncated Back Propagation Through Time (Williams and Peng, 1990), as we incrementally generate paragraphs instead of long documents.
System Comparisons
We compared our model with: (1) a Template-based generator which creates a document consisting of template sentences. We used Wiseman et al.’s (2017) system on RotoWire and Puduppully et al.’s (2019b) system on MLB. They are both similar in that they describe team scores followed by player specific statistics and a concluding statement. In MLB, the template additionally describes play-by-play details. We also created a template system for German RotoWire following a similar approach. (2) ED +CC, the best performing model of Wiseman et al. (2017). It consists of an encoder-decoder model equipped with attention and copy mechanisms. (3) NCP +CC, the micro planning model of Puduppully et al. (2019a). It first creates a content plan by pointing to input records through the use of Pointer Networks (Vinyals et al., 2015). The content plan is then encoded with a BiLSTM and decoded using another LSTM with an attention and copy mechanism. (4) ENT, the entity model of Puduppully et al. (2019b). It creates entity-specific representations which are updated dynamically. At each time step during decoding, their model makes use of hierarchical attention by attending over entity representations and the records corresponding to these. (5) MACRO, the two-stage planning model of Puduppully and Lapata (2021), which first makes use of Pointer Networks (Vinyals et al., 2015) to predict a macro plan from a set of candidate paragraph plans. The second stage takes the predicted plan as input and generates the game summary with a sequence-to-sequence model enhanced with attention and copy mechanisms. In addition, we compare with a variant of Macro enhanced with length control ( +Bin).
5 Results
Our experiments were designed to explore how the proposed model compares to related approaches which are either not enhanced with planning modules or non-incremental. We also investigated the sample efficiency of these models and the quality of the predicted plans when these are available. The majority of our results focus on automatic evaluation metrics. We also follow previous work (Wiseman et al., 2017; Puduppully et al., 2019a, b; Puduppully and Lapata, 2021) in eliciting judgments to evaluate system output.
5.1 Automatic Evaluation
We evaluate model output using BLEU (Papineni et al., 2002) with the gold summary as a reference. We also report model performance against the Information Extraction (IE) metrics of Wiseman et al. (2017) which are defined based on the output of an IE model which extracts entity (team and player names) and value (numbers) pairs from the summary and predicts the type of relation between them.
Let be the gold summary and y be the model output. Relation Generation (RG) measures the precision and count of relations obtained from y that are found in the input table. Content Selection (CS) measures the precision, recall, and F- measure of relations extracted from y also found in . And Content Ordering (CO) measures the complement of the Damerau-Levenshtein distance between relations extracted from y and . Higher values are better for RG Precision, CS F-measure, CO, and BLEU. We reuse the IE model from Puduppully et al. (2019a) for RotoWire, Puduppully and Lapata (2021) for MLB, and Hayashi et al. (2019) for German RotoWire. Our computation of IE metrics for all systems includes duplicate records (Puduppully and Lapata, 2021).
In addition to IE-based metrics, we report the number of errors made by systems according to Number (incorrect number in digits, number spelled in words, etc.), Name (incorrect names of teams, players, days of week, etc.), and Word (errors in usage of words) following the classification of Thomson and Reiter (2020). We detect such errors automatically using the system of Kasner et al. (2021), which scored best against gold standard human annotations of the same type (Thomson and Reiter, 2021). We only report these metrics for English RotoWire, since error annotations (for automatic metric learning) are not available for other datasets. Moreover, with regard to Word errors, we only report errors for incorrect usage of the word double-double.3 We found such errors to be detected reliably, in contrast to Word errors as a whole for which the precision of the system of Kasner et al. (2021) is ∼50%. Lower values are better for the Number, Name, and double-double errors. We note metrics such as RG precision, Number, Name, and double-double errors directly compute the accuracy of the generation model. Metrics such as CS, CO, and BLEU measure how similar model output is against a reference summary. Thus, CS, CO, and BLEU measure generation accuracy indirectly under the assumption that gold summaries are accurate.
MLB Dataset
Table 2 summarizes our results on MLB. Our sequential planning model (SeqPlan) has the highest RG P among neural models and performs best in terms of CS F, CO, and BLEU. The variant of Macro with length control ( +Bin) performs comparably or worse than Macro.
MLB results (test set); relation generation (RG) count (#) and precision (P%), content selection (CS) precision (P%), recall (R%), and F-measure (F%), content ordering (CO) as complement of normalized Damerau-Levenshtein distance (DLD%), and BLEU. Highest and generation models are highlighted.
MLB . | RG . | CS . | CO . | BLEU . | |||
---|---|---|---|---|---|---|---|
# . | P% . | P% . | R% . | F% . | DLD% . | ||
Templ | 62.3 | 99.9 | 21.6 | 55.2 | 31.0 | 11.0 | 4.12 |
ED +CC | 32.5 | 91.3 | 27.8 | 40.6 | 33.0 | 17.1 | 9.68 |
NCP +CC | 19.6 | 81.3 | 44.5 | 44.1 | 44.3 | 21.9 | 9.68 |
ENT | 23.8 | 81.1 | 40.9 | 49.5 | 44.8 | 20.7 | 11.50 |
Macro | 40.8 | 54.9 | |||||
+Bin | 31.2 | 93.7 | 38.3 | 52.4 | 44.2 | 21.6 | 12.32 |
SeqPlan | 28.9 | 95.9 | 47.8 | 22.7 | 14.29 | ||
w Uniform | 18.5 | 90.9 | 36.5 | 30.6 | 33.3 | 14.5 | 10.30 |
w Oracle | 27.6 | 95.9 | 42.5 | 50.4 | 46.1 | 22.0 | 13.13 |
2-Stage | 28.6 | 95.9 | 41.4 | 50.8 | 45.6 | 21.3 | 13.96 |
MLB . | RG . | CS . | CO . | BLEU . | |||
---|---|---|---|---|---|---|---|
# . | P% . | P% . | R% . | F% . | DLD% . | ||
Templ | 62.3 | 99.9 | 21.6 | 55.2 | 31.0 | 11.0 | 4.12 |
ED +CC | 32.5 | 91.3 | 27.8 | 40.6 | 33.0 | 17.1 | 9.68 |
NCP +CC | 19.6 | 81.3 | 44.5 | 44.1 | 44.3 | 21.9 | 9.68 |
ENT | 23.8 | 81.1 | 40.9 | 49.5 | 44.8 | 20.7 | 11.50 |
Macro | 40.8 | 54.9 | |||||
+Bin | 31.2 | 93.7 | 38.3 | 52.4 | 44.2 | 21.6 | 12.32 |
SeqPlan | 28.9 | 95.9 | 47.8 | 22.7 | 14.29 | ||
w Uniform | 18.5 | 90.9 | 36.5 | 30.6 | 33.3 | 14.5 | 10.30 |
w Oracle | 27.6 | 95.9 | 42.5 | 50.4 | 46.1 | 22.0 | 13.13 |
2-Stage | 28.6 | 95.9 | 41.4 | 50.8 | 45.6 | 21.3 | 13.96 |
To examine the importance of latent sequential planning, we also present a variant of our model that uniformly samples a plan from the pool instead of Equation (8) (see row w(ith) Uniform in Table 2). This version obtains lower values compared to SeqPlan across all metrics underscoring the importance of sequential planning. We also present two variants of SeqPlan: (a) one that makes use of oracle (instead of predicted) plans during training to generate yt; essentially, it replaces zt with z* in Equation (12) (row w(ith) Oracle in Table 2); and (b) a two-stage model that trains the planner (Equation (15)) and generator (Equation (12)) separately (row 2-stage in Table 2)—in this case, we use greedy decoding to sample zt from Equation (15) instead of Gumbel-Softmax and replace zt with z* in Equation (12). Both variants are comparable to SeqPlan in terms of RG P but worse in terms of CS F, CO, and BLEU.
Furthermore, we evaluate the accuracy of the inferred plans by comparing them against oracle plans, using the CS and CO metrics (computed over the entities and events in the plan).4Table 4 shows that SeqPlan achieves higher CS F and CO scores than Macro. Again, this indicates that planning is beneficial, particularly when taking the table and the generated summary into account.
English and German RotoWire
Results on RotoWire are presented in Table 3 (top). In addition to Templ, ED +CC, NCP +CC, and ENT, we compare with the models of Wiseman et al. (2017) (WS-2017) and Rebuffel et al. (2020) (RBF-2020). WS-2017 is the best performing model of Wiseman et al. (2017). Note that ED +CC is an improved re-implementation of WS-2017. RBF-2020 represents the current state-of-the-art on RotoWire, and is composed of a Transformer encoder-decoder architecture (Vaswani et al., 2017) with hierarchical attention on entities and their records. The models of Saleh et al. (2019), Iso et al. (2019), and Gong et al. (2019) are not comparable as they make use of information additional to the table such as previous/next games or the author of the game summary. The model of Narayan et al. (2020) is also not comparable as it relies on a pretrained language model (Rothe et al., 2020) to generate the summary sentences.
Evaluation on RotoWire (RW) and German RotoWire (DE-RW) test sets; relation generation (RG) count (#) and precision (P%), content selection (CS) precision (P%), recall (R%), and F-measure (F%), content ordering (CO) as complement of normalized Damerau-Levenshtein distance (DLD%), and BLEU. Highest and generation models are highlighted.
RW . | RG . | CS . | CO . | BLEU . | |||
---|---|---|---|---|---|---|---|
# . | P% . | P% . | R% . | F% . | DLD% . | ||
Templ | 54.3 | 99.9 | 27.1 | 57.7 | 36.9 | 13.1 | 8.46 |
WS-2017 | 34.1 | 75.1 | 20.3 | 36.3 | 26.1 | 12.4 | 14.19 |
ED +CC | 35.9 | 82.6 | 19.8 | 33.8 | 24.9 | 12.0 | 14.99 |
NCP +CC | 40.8 | 87.6 | 28.0 | 51.1 | 36.2 | 15.8 | 16.50 |
ENT | 32.7 | 34.7 | 48.5 | 40.5 | 16.6 | 16.12 | |
RBF-2020 | 44.9 | 89.5 | 23.9 | 47.0 | 31.7 | 14.3 | 17.16 |
Macro | 42.1 | 97.6 | 34.1 | 57.8 | 42.9 | 17.7 | 15.46 |
+Bin | 61.0 | 97.2 | 26.8 | 66.1 | 38.2 | 15.8 | 16.48 |
SeqPlan | 97.6 | ||||||
w Uniform | 22.0 | 80.2 | 18.2 | 19.6 | 18.9 | 6.0 | 8.61 |
w Oracle | 50.4 | 97.2 | 29.0 | 59.1 | 38.9 | 16.8 | 16.32 |
2-stage | 53.4 | 97.5 | 28.5 | 61.3 | 38.9 | 16.1 | 16.61 |
DE-RW | RG | CS | CO | BLEU | |||
# | P% | P% | R% | F% | DLD% | ||
Templ | 54.4 | 99.9 | 17.2 | 63.0 | 27.1 | 11.6 | 7.32 |
ED +CC | 59.3 | 6.7 | 18.8 | 9.9 | 6.8 | 5.09 | |
NCP +CC | 17.7 | 52.5 | 11.3 | 15.7 | 9.6 | ||
ENT | 17.4 | 24.0 | 6.52 | ||||
RBF-2020 | 0.2 | 4.0 | 1.1 | 0.4 | 0.6 | 0.3 | 2.29 |
Macro | 30.2 | 49.7 | 5.1 | 21.0 | 8.3 | 6.1 | 5.15 |
+Bin | 20.4 | 55.0 | 7.9 | 20.0 | 11.3 | 8.1 | 6.18 |
SeqPlan | 13.8 | 91.8 | 38.0 | 38.4 | 38.2 | 21.2 | 8.65 |
RW . | RG . | CS . | CO . | BLEU . | |||
---|---|---|---|---|---|---|---|
# . | P% . | P% . | R% . | F% . | DLD% . | ||
Templ | 54.3 | 99.9 | 27.1 | 57.7 | 36.9 | 13.1 | 8.46 |
WS-2017 | 34.1 | 75.1 | 20.3 | 36.3 | 26.1 | 12.4 | 14.19 |
ED +CC | 35.9 | 82.6 | 19.8 | 33.8 | 24.9 | 12.0 | 14.99 |
NCP +CC | 40.8 | 87.6 | 28.0 | 51.1 | 36.2 | 15.8 | 16.50 |
ENT | 32.7 | 34.7 | 48.5 | 40.5 | 16.6 | 16.12 | |
RBF-2020 | 44.9 | 89.5 | 23.9 | 47.0 | 31.7 | 14.3 | 17.16 |
Macro | 42.1 | 97.6 | 34.1 | 57.8 | 42.9 | 17.7 | 15.46 |
+Bin | 61.0 | 97.2 | 26.8 | 66.1 | 38.2 | 15.8 | 16.48 |
SeqPlan | 97.6 | ||||||
w Uniform | 22.0 | 80.2 | 18.2 | 19.6 | 18.9 | 6.0 | 8.61 |
w Oracle | 50.4 | 97.2 | 29.0 | 59.1 | 38.9 | 16.8 | 16.32 |
2-stage | 53.4 | 97.5 | 28.5 | 61.3 | 38.9 | 16.1 | 16.61 |
DE-RW | RG | CS | CO | BLEU | |||
# | P% | P% | R% | F% | DLD% | ||
Templ | 54.4 | 99.9 | 17.2 | 63.0 | 27.1 | 11.6 | 7.32 |
ED +CC | 59.3 | 6.7 | 18.8 | 9.9 | 6.8 | 5.09 | |
NCP +CC | 17.7 | 52.5 | 11.3 | 15.7 | 9.6 | ||
ENT | 17.4 | 24.0 | 6.52 | ||||
RBF-2020 | 0.2 | 4.0 | 1.1 | 0.4 | 0.6 | 0.3 | 2.29 |
Macro | 30.2 | 49.7 | 5.1 | 21.0 | 8.3 | 6.1 | 5.15 |
+Bin | 20.4 | 55.0 | 7.9 | 20.0 | 11.3 | 8.1 | 6.18 |
SeqPlan | 13.8 | 91.8 | 38.0 | 38.4 | 38.2 | 21.2 | 8.65 |
Table 3 (bottom) shows our results on German RotoWire. We compare against NCP +CC’s entry in the WNGT 2019 shared task5 (Hayashi et al., 2019), and our implementation of Templ, ED + CC, ENT, Macro, and RBF-2020. Saleh et al. (2019) are not comparable as they pretrain on 32M parallel and 420M monolingual data. Likewise, Puduppully et al. (2019c) make use of a jointly trained multilingual model by combining RotoWire with German RotoWire.
We find that SeqPlan achieves highest RG P among neural models, and performs on par with Macro (it obtains higher BLEU but lower CS F and CO scores). The +Bin variant of Macro performs better on BLEU but worse on other metrics. As in Table 2, w Uniform struggles across metrics corroborating our hypothesis that latent sequential planning improves generation performance. The other two variants (w Oracle and 2-Stage) are worse than SeqPlan in RG P and CS F, comparable in CO, and slightly higher in terms of BLEU.
On German, our model is best across metrics, achieving an RG P of 91.8% which is higher by 42% (absolute) compared to Macro. In fact, the RG P of SeqPlan is superior to Saleh et al. (2019), whose model is pretrained with additional data and is considered state of the art (Hayashi et al., 2019). RG# is lower mainly because of a bug in the German IE that is excludes number records. RG# for NCP +CC and Macro is too high because the summaries contain considerable repetition. The same record will repeat at least once with NCP +CC and three times with Macro, whereas only 7% of the records are repeated with SeqPlan.
Table 4 evaluates the quality of the plans inferred by our model on the RotoWire dataset. As can be seen, SeqPlan is slightly worse than Macro in terms of CS F and CO. We believe this is because summaries in RotoWire are somewhat formulaic, with a plan similar to Templ: an opening statement is followed by a description of the top scoring players, and a conclusion describing the next match. Such plans can be learned well by Macro without access to the summary. MLB texts show much more diversity in terms of length, and the sequencing of entities and events. The learning problem is also more challenging, supported by the fact that the template system does not do very well in this domain (i.e., it is worse in BLEU, CS F, and CO compared to RotoWire). In German RotoWire, SeqPlan plans achieve higher CS F and CO than Macro.
Evaluation of macro planning stage (test set); content selection (CS) precision (P%), recall (R%), and F-measure (F%), content ordering (CO) as complement of normalized Damerau-Levenshtein distance (DLD%).
. | Datasets . | CS . | CO . | ||
---|---|---|---|---|---|
P% . | R% . | F% . | DLD% . | ||
MLB | Macro | 73.6 | 45.9 | 56.5 | 27.0 |
SeqPlan | 74.4 | 51.1 | 60.6 | 27.1 | |
RW | Macro | 81.5 | 62.7 | 70.9 | 36.3 |
SeqPlan | 79.1 | 61.6 | 69.3 | 35.5 | |
DE-RW | Macro | 86.8 | 34.2 | 49.0 | 30.1 |
SeqPlan | 73.1 | 60.8 | 66.4 | 31.0 |
. | Datasets . | CS . | CO . | ||
---|---|---|---|---|---|
P% . | R% . | F% . | DLD% . | ||
MLB | Macro | 73.6 | 45.9 | 56.5 | 27.0 |
SeqPlan | 74.4 | 51.1 | 60.6 | 27.1 | |
RW | Macro | 81.5 | 62.7 | 70.9 | 36.3 |
SeqPlan | 79.1 | 61.6 | 69.3 | 35.5 | |
DE-RW | Macro | 86.8 | 34.2 | 49.0 | 30.1 |
SeqPlan | 73.1 | 60.8 | 66.4 | 31.0 |
Table 5 reports complementary automatic metrics on English RotoWire aiming to assess the factuality of generated output. We find that Templ has the least Number, Name, and double-double errors. This is expected as it simply reproduces facts from the table. SeqPlan and Macro have similar Number errors, and both are significantly better than other neural models. SeqPlan has significantly more Name errors than Macro, and significantly fewer than other neural models. Inspection of Name errors revealed that these are mostly due to incorrect information about next games. Such information is not part of the input and models are prone to hallucinate. SeqPlan fares worse as it attempts to discuss next games for both teams while Macro focuses on one team only. In terms of double-double errors, SeqPlan is comparable to Macro, ENT, and NCP +CC, and significantly better than WS-2017, ED +CC, and RBF-2020.
Number, Name, and double-double (Word) errors per example. Systems significantly different from SeqPlan are marked with an asterisk * (using a one-way ANOVA with posthoc Tukey HSD tests; p ≤ 0.05).
. | Number . | Name . | double-double . |
---|---|---|---|
Templ | 0.08* | 3.05* | 0.00* |
WS-2017 | 13.01* | 9.66* | 0.36* |
ED +CC | 8.11* | 8.29* | 0.31* |
NCP +CC | 7.89* | 7.76* | 0.14 |
ENT | 5.89* | 7.24* | 0.15 |
RBF-2020 | 6.20* | 8.39* | 0.41* |
Macro | 2.57 | 4.60* | 0.18 |
SeqPlan | 2.70 | 6.56 | 0.20 |
. | Number . | Name . | double-double . |
---|---|---|---|
Templ | 0.08* | 3.05* | 0.00* |
WS-2017 | 13.01* | 9.66* | 0.36* |
ED +CC | 8.11* | 8.29* | 0.31* |
NCP +CC | 7.89* | 7.76* | 0.14 |
ENT | 5.89* | 7.24* | 0.15 |
RBF-2020 | 6.20* | 8.39* | 0.41* |
Macro | 2.57 | 4.60* | 0.18 |
SeqPlan | 2.70 | 6.56 | 0.20 |
5.2 Sample Efficiency
We also evaluated whether SeqPlan is more sample-efficient in comparison to Macro, by examining how RG P varies with (training) data size. As shown in Figure 4, the difference between SeqPlan and Macro is more pronounced when relatively little data is available. For example, with 10% of training data, RG P for SeqPlan on MLB is 85.7% and 92.1% on RotoWire. In contrast, Macro obtains 57.5% on MLB and 47.1% on RotoWire. As more training data becomes available, the difference in RG P decreases. The slope of increase in RG P for Macro is higher for RotoWire than MLB. We hypothesize that this is because MLB has longer summaries with more paragraphs, and it is thus more difficult for Macro to learn alignments between paragraph plans and text paragraphs in the game summary.
Sample efficiency for (a) MLB and (b) RotoWire datasets. SeqPlan and Macro are trained on different portions (%) of the training dataset and performance is measured with RG P%.
Sample efficiency for (a) MLB and (b) RotoWire datasets. SeqPlan and Macro are trained on different portions (%) of the training dataset and performance is measured with RG P%.
5.3 Human Evaluation
We used the Amazon Mechanical Turk crowdsourcing platform for our judgment elicitation study. To ensure consistent ratings (van der Lee et al., 2019), we required that raters have completed at least 1,000 tasks, and have at least 98% approval rate. Participants were restricted to English-speaking countries (USA, UK, Canada, Australia, Ireland, or New Zealand) and were allowed to provide feedback or ask questions. Raters were paid an average of $0.35 for each task, ensuring that the remuneration is higher than the minimum wage per hour in the United States. We compared SeqPlan with Gold, Templ, ED +CC, and Macro; we did not compare against ENT, as previous work (Puduppully and Lapata, 2021) has shown that it performs poorly against Macro. For RotoWire, we additionally compared against RBF-2020.
Supported and Contradicted Facts
Our first eliciation study provided raters with box scores (and play-by-plays in the case of MLB), along with sentences randomly extracted from game summaries. We asked them to count supported and contradicting facts (ignoring hallucinations). Participants were given a cheatsheet to help them understand box score and play-by-play statistics as well as examples of sentences with the correct count of supported and contradicting facts. This evaluation was conducted on 40 summaries (20 for each dataset), with four sentences per summary, each rated by three participants. For MLB, this resulted in 300 tasks (5 systems × 20 summaries × 3 raters) and for RotoWire in 360 (6 systems × 20 summaries × 3 raters). Altogether, we had 177 participants. The agreement between raters using Krippendorff’s α for supported facts and contradicting facts was 0.43.
Table 6 (columns #Supp and #Contra) presents our results. Lower is better for contradicting facts. In case of supporting facts, the count should neither be too high nor too low. A high count of supporting facts indicates indicates poor content selection. A low count of supporting facts with a high count of contradicting facts indicates low accuracy of generation.
Average number of supported (#Supp) and contradicting (#Contra) facts in game summaries and best-worst scaling evaluation for Coherence (Coher), Conciseness (Concis), and Grammaticality (Gram). Lower is better for contradicting facts; higher is better for Coherence, Conciseness, and Grammaticality. Systems significantly different from SeqPlan are marked with an asterisk * (using a one-way ANOVA with post hoc Tukey HSD tests; p ≤ 0.05).
MLB . | #Supp . | #Contra . | Gram . | Coher . | Concis . |
---|---|---|---|---|---|
Gold | 3.59 | 0.14 | 21.67 | 29.17 | 14.17 |
Templ | 4.21* | 0.04 | −58.33* | −48.33* | 9.17 |
ED +CC | 3.42 | 0.72* | −32.50* | −18.33* | −48.33* |
Macro | 3.76 | 0.25 | 37.50 | 15.00 | 22.50 |
SeqPlan | 3.68 | 0.19 | 31.67 | 22.50 | 2.50 |
RotoWire | #Supp | #Contra | Gram | Coher | Concis |
Gold | 3.63* | 0.07 | 42.67* | 40.67 | 28.00 |
Templ | 7.57* | 0.08 | −57.33* | −55.33* | −34.67* |
ED +CC | 3.92 | 0.91* | 4.00 | −14.67* | −13.33 |
RBF-2020 | 5.08 | 0.67* | 6.00 | 1.33 | −0.67 |
Macro | 4.00 | 0.27 | 0.67 | 7.33 | 10.00 |
SeqPlan | 4.84 | 0.17 | 4.00 | 20.67 | 10.67 |
MLB . | #Supp . | #Contra . | Gram . | Coher . | Concis . |
---|---|---|---|---|---|
Gold | 3.59 | 0.14 | 21.67 | 29.17 | 14.17 |
Templ | 4.21* | 0.04 | −58.33* | −48.33* | 9.17 |
ED +CC | 3.42 | 0.72* | −32.50* | −18.33* | −48.33* |
Macro | 3.76 | 0.25 | 37.50 | 15.00 | 22.50 |
SeqPlan | 3.68 | 0.19 | 31.67 | 22.50 | 2.50 |
RotoWire | #Supp | #Contra | Gram | Coher | Concis |
Gold | 3.63* | 0.07 | 42.67* | 40.67 | 28.00 |
Templ | 7.57* | 0.08 | −57.33* | −55.33* | −34.67* |
ED +CC | 3.92 | 0.91* | 4.00 | −14.67* | −13.33 |
RBF-2020 | 5.08 | 0.67* | 6.00 | 1.33 | −0.67 |
Macro | 4.00 | 0.27 | 0.67 | 7.33 | 10.00 |
SeqPlan | 4.84 | 0.17 | 4.00 | 20.67 | 10.67 |
Templ achieves the lowest count of contradicting facts and the highest count of supported facts for both the datasets. This is no surprise as it essentially regurgitates facts (i.e., records) from the table. On MLB, all systems display a comparable count of supported facts (differences are not statistically significant), with the exception of Templ, which contains significantly more. In terms of contradicting facts, SeqPlan performs on par with Macro, Gold, and Templ, and is significantly better than ED +CC. On RotoWire, in terms of supported facts, SeqPlan performs on par with the other neural models, is significantly higher than Gold, and significantly lower than Templ. In terms of contradicting facts, SeqPlan performs on par with Macro, Gold, and Templ, and significantly better than ED +CC and RBF-2020.
Coherence, Grammaticality, and Conciseness
In our second study, raters were asked to choose the better summary from a pair of summaries based on Coherence (Is the summary well structured and well organized and does it have a natural ordering of the facts?), Conciseness (Does the summary avoid unnecessary repetition including whole sentences, facts or phrases?), and Grammaticality (Is the summary written in well-formed English?). For this study, we required that the raters be able to comfortably comprehend summaries of NBA/MLB games. We obtained ratings using Best-Worst scaling (Louviere and Woodworth, 1991; Louviere et al., 2015), an elicitation paradigm shown to be more accurate than Likert scales. The score for a system is obtained by the number of times it is rated best minus the number of times it is rated worst (Orme, 2009). Scores range between −100 (absolutely worst) and +100 (absolutely best); higher is better. We assessed 40 summaries from the test set (20 for each dataset). Each summary pair was rated by three participants. For MLB, we created 1,800 tasks (10 system pairs × 20 summaries × 3 raters × 3 dimensions) and 2,700 for RotoWire (15 pairs of systems × 20 summaries × 3 raters × 3 dimensions). Altogether, 377 raters participated in this task. The agreement between the raters using Krippendorff’s α was 0.49.
On MLB, SeqPlan is significantly more coherent than ED +CC and Templ, and is comparable with Gold and Macro. A similar picture emerges with grammaticality. SeqPlan is as concise as Gold, Macro, and Templ, and significantly better than ED +CC. On RotoWire, SeqPlan is significantly more coherent than Templ and ED +CC, but on par with Macro, RBF-2020, and Gold. In terms of conciseness, SeqPlan is comparable to Gold, Macro, RBF-2020, and ED+CC, and significantly better than Templ. In terms of grammaticality, SeqPlan is comparable to Macro, RBF-2020, and ED+CC, significantly better than Templ, and significantly worse than Gold.
6 Discussion
In this work, we proposed a novel sequential latent variable model for joint macro planning and generation. Key in our approach is the creation of a latent plan in a sequential manner, while interleaving the prediction of plans and the generation of corresponding paragraphs. We proposed to deconstruct monolithic long document generation into smaller units (paragraphs, in our case), which affords flexibility and better communication between planning and generation. Taken together, the results of automatic and human evaluation suggest that SeqPlan performs best in terms of factuality and coherence, it generates diverse, and overall fluent, summaries, and is less data-hungry compared with strong systems like Macro and NCP +CC. As SeqPlan does not have to learn alignments between the macro plan and the output text, it is better suited for long-form generation. Potential applications include summarizing books (Kryściński et al., 2021), where the output can be longer than 1,000 tokens, or generating financial reports (Kogan et al., 2009; Händschke et al., 2018), where the output exceeds 9,000 tokens. Existing approaches for long-form generation summarize individual paragraphs independently (Kryściński et al., 2021) or adopt a hierarchical approach (Wu et al., 2021), where summaries of paragraphs form the basis of chapter summaries which in turn are composed into a book summary.
Table 7 gives an example of SeqPlan output. We see that the game summary follows the macro plan closely. In addition, the paragraph plans and the paragraphs exhibit coherent ordering. Manual inspection of SeqPlan summaries reveals that a major source of errors in MLB relate to attention diffusing over long paragraph plans. As an example, consider the following paragraph produced by SeqPlan “Casey Kotchman had three hits and three RBIs, including a two-run double in the second inning that put the Angels up 2–0. Torii Hunter had three hits and drove in a run.” In reality, Torii Hunter had two hits but the model incorrectly generates hits for Casey Kotchman. The corresponding paragraph plan is 360 tokens long and attention fails to discern important tokens. A more sophisticated encoder, for example, based on Transformers (Vaswani et al., 2017), could make attention more focused. In RotoWire, the majority of errors involve numbers (e.g., team attributes) and numerical comparisons. Incorporating pre-executed operations such as min, max (Nie et al., 2018) could help alleviate these errors.
Predicted macro plan (top) and generated output from our model. Transitions between paragraph plans are shown using . Paragraphs are separated with < P > delimiters. Entities and events in the summary corresponding to the macro plan are boldfaced.
< V(Cardinals) > < V(9-T) > < V(Cardinals) > < V(Cardinals) > < V(Brewers) > < V(9-T) > < V(8-B) > < V(8-T) > < V(8-B) > < V(9-B) > < V(Brewers) > < V(Adam Wainwright) > < V(Brewers) > < V(3-T) > < V(3-B) > < V(Carlos Villanueva) > . |
---|
ST. LOUIS – The St. Louis Cardinals have been waiting for their starting rotation. < P > Skip Schumaker drove in the go-ahead run with a double in the ninth inning, and the Cardinals beat the Milwaukee Brewers 4–3 on Wednesday night to avoid a three-game sweep. < P > The Cardinals have won four of five, and have won four in a row. < P > The Cardinals have won four of five, including a three-game sweep by the Brewers. < P > Brian Barton led off the ninth with a pinch-hit double off Derrick Turnbow (0–1) and moved to third on Cesar Izturis’ sacrifice bunt. Schumaker drove in Barton with a double down the left-field line. < P > Ryan Braun, who had two hits, led off the eighth with a double off Ryan Franklin (1–1). Braun went to third on a wild pitch and scored on Corey Hart’s triple into the right-field corner. < P > Albert Pujols was intentionally walked to load the bases with one out in the eighth, and Guillermo Ankiel flied out. Troy Glaus walked to load the bases for Kennedy, who hit a sacrifice fly off Guillermo Mota. < P > Ryan Franklin (1–1) got the win despite giving up a run in the eighth. Ryan Braun led off with a double and scored on Corey Hart’s one-out triple. < P > Jason Isringhausen pitched a perfect ninth for his seventh save in nine chances. He has converted his last six save opportunities and has n’t allowed a run in his last three appearances. < P > The Brewers lost for the seventh time in eight games. < P > Wainwright allowed two runs and four hits in seven innings. He walked four and struck out six. < P > Brewers manager Ron Roenicke was ejected by home plate umpire Bill Miller for arguing a called third strike. < P > The Cardinals took a 2–0 lead in the third. Albert Pujols walked with two outs and Rick Ankiel walked. Glaus then lined a two-run double into the left-field corner. < P > The Brewers tied it in the third. Jason Kendall led off with a double and scored on Rickie Weeks’ double. Ryan Braun’s RBI single tied it at 2. < P > Villanueva allowed two runs and three hits in seven innings. He walked four and struck out one. |
< V(Cardinals) > < V(9-T) > < V(Cardinals) > < V(Cardinals) > < V(Brewers) > < V(9-T) > < V(8-B) > < V(8-T) > < V(8-B) > < V(9-B) > < V(Brewers) > < V(Adam Wainwright) > < V(Brewers) > < V(3-T) > < V(3-B) > < V(Carlos Villanueva) > . |
---|
ST. LOUIS – The St. Louis Cardinals have been waiting for their starting rotation. < P > Skip Schumaker drove in the go-ahead run with a double in the ninth inning, and the Cardinals beat the Milwaukee Brewers 4–3 on Wednesday night to avoid a three-game sweep. < P > The Cardinals have won four of five, and have won four in a row. < P > The Cardinals have won four of five, including a three-game sweep by the Brewers. < P > Brian Barton led off the ninth with a pinch-hit double off Derrick Turnbow (0–1) and moved to third on Cesar Izturis’ sacrifice bunt. Schumaker drove in Barton with a double down the left-field line. < P > Ryan Braun, who had two hits, led off the eighth with a double off Ryan Franklin (1–1). Braun went to third on a wild pitch and scored on Corey Hart’s triple into the right-field corner. < P > Albert Pujols was intentionally walked to load the bases with one out in the eighth, and Guillermo Ankiel flied out. Troy Glaus walked to load the bases for Kennedy, who hit a sacrifice fly off Guillermo Mota. < P > Ryan Franklin (1–1) got the win despite giving up a run in the eighth. Ryan Braun led off with a double and scored on Corey Hart’s one-out triple. < P > Jason Isringhausen pitched a perfect ninth for his seventh save in nine chances. He has converted his last six save opportunities and has n’t allowed a run in his last three appearances. < P > The Brewers lost for the seventh time in eight games. < P > Wainwright allowed two runs and four hits in seven innings. He walked four and struck out six. < P > Brewers manager Ron Roenicke was ejected by home plate umpire Bill Miller for arguing a called third strike. < P > The Cardinals took a 2–0 lead in the third. Albert Pujols walked with two outs and Rick Ankiel walked. Glaus then lined a two-run double into the left-field corner. < P > The Brewers tied it in the third. Jason Kendall led off with a double and scored on Rickie Weeks’ double. Ryan Braun’s RBI single tied it at 2. < P > Villanueva allowed two runs and three hits in seven innings. He walked four and struck out one. |
Finally, it is worth mentioning that although the template models achieve highest RG precision for both MLB and RotoWire (Tables 2 and 3), this is mainly because they repeat facts from the table. Template models score low against CS F, CO, and BLEU metrics. In addition, they obtain lowest scores in Grammaticality and Coherence (Table 6), which indicates that they are poor at selecting records from the table and ordering them correctly in a fluent manner.
Acknowledgments
We thank the Action Editor, Ehud Reiter, and the anonymous reviewers for their constructive feedback. We also thank Parag Jain for helpful discussions. We acknowledge the financial support of the European Research Council (award number 681760, “Translating Multiple Modalities into Text”).
Notes
In our notation neural network layers are described by math functions.
A double-double occurs when a player scores 10 points or more in two record types: points, rebounds, assists, steals, and blocked shots.
To compute the accuracy of macro plans, entities and events from the model’s plan need to be compared against entities and events in the oracle macro plan. Puduppully and Lapata (2021) obtained the entities and events for the oracle macro plan by extracting these from reference summaries. We noted that this includes coreferent or repeat mentions of entities and events within a paragraph. We instead extract entities and events directly from the oracle macro plan.
We thank Hiroaki Hayashi for providing us with the output of the NCP +CC system.
References
Author notes
Action Editor: Ehud Reiter