Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.

Large language models (LLMs) have shown promising results in zero-/few-shot tasks across a wide range of domains (Chowdhery et al., 2022; Bai et al., 2022; Brown et al., 2020; Zhang et al., 2022) and have raised significant interest for their potential for automatic summarization (Goyal et al., 2022; Liu et al., 2022a). However, the design decisions contributing to its success in summarization remain poorly understood, and while prior work has shown that LLMs outperform the prior state of the art, it remains unclear whether their outputs are comparable to human writers. Examining these questions is crucial for advancing future research in automatic summarization.

To answer the first question, we perform a systematic evaluation of ten diverse LLMs with human evaluation on news summarization; our evaluation identifies instruction tuning to be the key to zero-shot summarization capability. In contrast, self-supervised learning alone cannot induce strong summarization performance in the zero-shot setting (Figure 1). In fact, even a 350M parameter instruction-tuned GPT-3 can perform on par with the 175B parameter GPT-3.

Figure 1: 

Selected annotator ratings of summary coherence on a 1 to 5 Likert scale.

Figure 1: 

Selected annotator ratings of summary coherence on a 1 to 5 Likert scale.

Close modal

To benchmark LLMs, we evaluated the standard CNN/DM (Hermann et al., 2015) and XSUM datasets (Narayan et al., 2018) but found that existing reference summaries caused several issues. The reference summaries in these benchmarks were originally created in a different use context and, when evaluated as part of a generic news summarization benchmark, human annotators judge them to be worse than the outputs of most automatic systems (Figure 1). When computing automatic metrics using these references, their poor quality reduces the correlation between metric results and human judgment. Not only does this make evaluation difficult, but it also degrades the performance of systems that take supervision either through finetuning or few-shot prompting and makes comparison difficult.

To address the quality issues of reference summaries and better understand how LLMs compare to human summary writers, we recruit freelance writers from Upwork1 to re-annotate 100 articles from the test set of CNN/DM and XSUM. Comparing the best performing LLM, Instruct Davinci, to the freelance writers, we find that the Instruct Davinci summaries are much more extractive. By manually annotating the summarization operations (Jing and McKeown, 2000) used in these summaries, we find that Instruct Davinci paraphrases much less frequently although it is able to combine copied segments coherently.

Given their stylistic differences, we recruit annotators to compare the Instruct Davinci summaries to those written by freelance writers. On aggregate, we find that Instruct Davinci is rated as comparable to the freelance writers. Examination of the annotations from each individual rater shows that every rater has their own consistent preference for either Instruct Davinci or the freelance writers.

Together, our work makes the following key contributions. First, we identify instruction tuning, instead of model scale, as the key to LLMs’ summarization capability. Second, we show that reference summaries used in XSUM, which are simply the first sentence of the news article, are judged by humans to be worse than the best LLM-generated summaries. Third, to address these issues with references, we collect better quality summaries from freelance writers and we show that the best LLM is rated as comparable to Upwork freelance writers. In combination, these results call into question recent claims made about LLM summarization. In particular, summarization progress cannot be measured using reference-based metrics applied on XSUM. Furthermore, the question of whether fine-tuned, few-shot, or zero-shot models perform better remains an open question due to the poor quality of training data. To encourage future work on improved evaluations, we release the high-quality summaries written by freelance writers and the evaluation data on 18 model settings and two datasets as resources.2

2.1 News Summarization

News summarization is the task of producing a concise paragraph that captures the main points of a news article and has been a core problem within the field of automatic summarization (Radev et al., 2002; Nenkova and McKeown, 2011). Early work focused mostly on extractive approaches, using unsupervised data-driven methods that relied on different variants of word frequency to determine salience (e.g., Salton et al., 1997; Hovy and Lin, 1999; Lin and Hovy, 2002; Mani and Bloedorn, 1999; Conroy et al., 2006; Nenkova et al., 2006). Other approaches to extractive summarization relied on aspects of discourse semantics (e.g., lexical chains and rhetorical structure theory) (Barzilay and Elhadad, 1997; Marcu, 1997; Silber and McCoy, 2002; Steinberger et al., 2007), or graph-based methods (e.g., Radev et al., 2000; Mihalcea and Tarau, 2005; Erkan and Radev, 2004). These extractive approaches were developed both for single-document and multi-document news summarization, with far more work focusing on multi-document than the single-document task. Humans, however, rely on more abstractive operations (such as paraphrasing, generalizations, etc.) in order to write fluent summaries (Jing and McKeown, 1999). This has led to a push toward building abstractive summarization systems, with initial research focusing on designing post-processing algorithms for extractive summarizers that focused on specific operations such as sentence fusion (Barzilay and McKeown, 2005; Marsi and Krahmer, 2005; Krahmer et al., 2008; Filippova and Strube, 2008; Thadani and McKeown, 2013), generation (Barzilay et al., 1999) and sentence compression (Jing, 2000; Knight and Marcu, 2002; McDonald, 2006; Cohn and Lapata, 2008). More scalable, data-driven approaches for building abstractive summarization systems were made possible with more effective neural systems for conditional generation (Sutskever et al., 2014; Bahdanau et al., 2015) as well as large-scale datasets (Rush et al., 2015; Hermann et al., 2015), leading to steady progress over the years (See et al., 2017; Chen and Bansal, 2018; Dong et al., 2019; Liu and Lapata, 2019; Lewis et al., 2019; Zhang et al., 2020).

This work benchmarks LLMs on news summarization using two popular benchmarks, CNN/DM (Hermann et al., 2015) and XSUM (Narayan et al., 2018). These datasets contain hundreds of thousands of article-summary pairs but were created using “incidental supervision”, i.e., the reference summaries were not written specifically for the task but adapted from content on the websites. CNN/DM includes articles from the CNN and DailyMail websites as the source articles and adapts the bullet point highlights that accompany the articles as reference summaries. XSUM includes articles from BBC News and adapts the bolded introductory sentence(s) as reference summaries. As a result, the reference summaries in these datasets are known to have quality issues (Maynez et al., 2020; Kang and Hashimoto, 2020), motivating us to address these defects to improve LLM evaluation.

To contextualize the performance of LLMs, we mainly compare to previous state-of-the-art approaches that leveraged supervised finetuning (Liu and Lapata, 2019; Lewis et al., 2019; Zhang et al., 2020 Zhang; Liu et al., 2022b). Summarization evaluation is another active area of research. Many automatic metrics have been proposed (Lin, 2004; Zhang et al., 2020; Sellam et al., 2020; Durmus et al., 2020; Maynez et al., 2020; Deutsch and Roth, 2021) but they do not always correlate with human evaluation of summarization systems (Fabbri et al., 2020; Durmus et al., 2022). In this work, we evaluate the effectiveness of automatic metrics for evaluating LLMs and show that the usefulness of reference-based evaluation is closely linked to the quality of the references.

2.2 Large Language Models

LLMs (Bommasani et al., 2021; Chowdhery et al., 2022; Brown et al., 2020) have two distinctive features over previous pretrained models. First, LLMs have a much larger scale in terms of model parameters and training data. Second, unlike previous pretrained models that require finetuning, LLMs can be prompted zero-shot or few-shot to solve a task. In the zero-shot setting, prompting presents the LLMs with inputs (e.g., news articles) and a natural language instruction (e.g., “summarize this news article in three sentences”) and solicits outputs by having LLMs generate answers directly. When few-shot training examples are available, LLMs have the ability to learn “in context”. Incontext learning prepends training input-output pairs along with the same style of instruction to the testing input.

Recently, instruction-tuning has emerged as an effective way to improve LLM prompting performance (Sanh et al., 2021; Wang et al., 2022; Ouyang et al., 2022). In this approach, a diverse set of natural language processing tasks are reformulated into the prompting format and the LLM’s parameters are updated for these tasks either through supervised finetuning or reinforcement learning.

Recent work (Goyal and Durrett, 2020) shows that the instruct-tuned GPT-3 Davinci model is better than finetuned LMs, but does not show the design decision that contributes to the improved performance. In our work, we carry out a more comprehensive benchmark on ten different LLMs, to understand the effect of model scale, in-context learning, and instruction tuning. Given that automatic metrics may not be reliable, we focus on human evaluation as our benchmarking method.

In this section, we use human evaluation to systematically benchmark a diverse set of ten LLMs on news summarization. We observe that instruction tuning is the key to strong summarization capability and reference summaries in current benchmarks may underestimate few-shot or finetuning performance.

3.1 Experimental Setup

Data

We conduct our human evaluation on CNN/DM and XSUM by sampling a hundred examples from each validation set, respectively. For the few-shot in-context learning settings, we sample five examples from the training set to be the demonstration examples. Due to the limited context window, we sample five articles that are between 50 and 150 tokens in length according to the GPT-2 tokenizer. For XSUM, we find that a uniform sampling occasionally results in articles that are unreadable due to data preprocessing so we manually pick from the training set.

Model Details

We consider ten LLMs across different pretraining strategies and model scales.3Table 1 lists the details of the LLMs we consider. Due to limited computational resources and model access, we benchmark all models in the five-shot setting but only benchmark three OpenAI GPT-3 models and three OpenAI instruction-tuned GPT-3 models in the zero-shot setting.

Table 1: 

List of large language models we benchmarked with human evaluation.

ModelModel Creator# ParametersInstruction TuningReference
GPT-3 davinci v1 OpenAI 175B ✗ Brown et al. (2020
GPT-3 curie v1 6.7B 
GPT-3 ada v1 350M 
InstructGPT davinci v2 OpenAI 175B ✓ Ouyang et al. (2022
InstructGPT curie v1 6.7B 
InstructGPT ada v1 350M 
OPT 175B Meta 175B ✗ Zhang et al. (2022
GLM Tsinghua 130B ✗ Du et al. (2021
University 
Cohere xlarge v20220609 Cohere 52.4B ✗ Cohere (2022
Anthropic-LM v4-s3 Anthropic 52B ✓ Bai et al. (2022
ModelModel Creator# ParametersInstruction TuningReference
GPT-3 davinci v1 OpenAI 175B ✗ Brown et al. (2020
GPT-3 curie v1 6.7B 
GPT-3 ada v1 350M 
InstructGPT davinci v2 OpenAI 175B ✓ Ouyang et al. (2022
InstructGPT curie v1 6.7B 
InstructGPT ada v1 350M 
OPT 175B Meta 175B ✗ Zhang et al. (2022
GLM Tsinghua 130B ✗ Du et al. (2021
University 
Cohere xlarge v20220609 Cohere 52.4B ✗ Cohere (2022
Anthropic-LM v4-s3 Anthropic 52B ✓ Bai et al. (2022

For CNN/DM, we solicit LLM summaries with the following prompt template “Article: [article]. Summarize the article in three sentences. Summary:

For XSUM, we modify the prompt template to summarize in one sentence to match the style of the reference summaries. For all LLMs we consider, we sample with temperature 0.3 following prior work (Wu et al., 2021).

To contextualize our LLM benchmarking results, we also evaluate two state-of-the-art finetuned LMs: Pegasus (Zhang et al., 2020) and BRIO (Liu et al., 2022b). We decode the finetuned LMs using a beam size of 5 following prior work (Lewis et al., 2019). In addition, we also evaluate the existing reference summaries in the CNN/DM and XSUM validation sets.

Human Evaluation Protocol

We recruit annotators from Amazon Mechanical Turk, compensating them at California minimum wage of $15.00/hr using conservative time estimates as recommended by Whiting et al. (2019). We recruited a total of 30 annotators from the US who have a lifetime HIT approval rate of 98% or above with at least 10,000 approved HITs (Figure 8).4 Summaries are presented in random order and are evaluated independently by three annotators. We report average scores for each summary based on ratings from all three annotators.

Our annotators evaluate each summary based on three criteria: faithfulness, coherence, and relevance. We define these terms and collect data according to the guidelines in Fabbri et al. (2020). Coherence and relevance ratings are collected on a 1 to 5 Likert scale while faithfulness ratings are collected as a binary value, since it is inherently binary in nature. Unlike Fabbri et al. (2020), we do not evaluate fluency because we find LLM outputs to be mostly fluent. The average pairwise agreement for the annotators in our annotator pool was 75% for faithfulness, 81% for coherence, and 86% for relevance.5 The full annotation guidelines are included in our code release.

3.2 Evaluation Results

Table 2 presents the evaluation results.6 We now discuss two main observations.

Table 2: 

Human evaluation results for zero-shot and five-shot LLMs, finetuned LMs, and reference summaries. We bold the entries that are not statistically significantly different from the best numbers in each column at p = 0.05, using a bootstrap-based paired mean difference test.

SettingModelsCNN/Daily MailXSUM
FaithfulnessCoherenceRelevanceFaithfulnessCoherenceRelevance
Zero-shot language models GPT-3 (350M) 0.29 1.92 1.84 0.26 2.03 1.90 
GPT-3 (6.7B)  0.29 1.77 1.93 0.77 3.16 3.39 
GPT-3 (175B) 0.76 2.65 3.50 0.80 2.78 3.52 
Ada Instruct v1 (350M*) 0.88 4.02 4.26 0.81 3.90 3.87 
Curie Instruct v1 (6.7B*) 0.97 4.24 4.59 0.96 4.27 4.34 
Davinci Instruct v2 (175B*) 0.99 4.15 4.60 0.97 4.41 4.28 
Five-shot language models Anthropic-LM (52B) 0.94 3.88 4.33 0.70 4.77 4.14 
Cohere XL (52.4B) 0.99 3.42 4.48 0.63 4.79 4.00 
GLM (130B) 0.94 3.69 4.24 0.74 4.72 4.12 
OPT (175B) 0.96 3.64 4.33 0.67 4.80 4.01 
GPT-3 (350M) 0.86 3.73 3.85 – – – 
GPT-3 (6.7B) 0.97 3.87 4.17 0.75 4.19 3.36 
GPT-3 (175B) 0.99 3.95 4.34 0.69 4.69 4.03 
Ada Instruct v1 (350M*) 0.84 3.84 4.07 0.63 3.54 3.07 
Curie Instruct v1 (6.7B*) 0.96 4.30 4.43 0.85 4.28 3.80 
Davinci Instruct v2 (175B*) 0.98 4.13 4.49 0.77 4.83 4.33 
Fine-tuned language models Brio 0.94 3.94 4.40 0.58 4.68 3.89 
Pegasus 0.97 3.93 4.38 0.57 4.73 3.85 
Existing references – 0.84 3.20 3.94 0.37 4.13 3.00 
SettingModelsCNN/Daily MailXSUM
FaithfulnessCoherenceRelevanceFaithfulnessCoherenceRelevance
Zero-shot language models GPT-3 (350M) 0.29 1.92 1.84 0.26 2.03 1.90 
GPT-3 (6.7B)  0.29 1.77 1.93 0.77 3.16 3.39 
GPT-3 (175B) 0.76 2.65 3.50 0.80 2.78 3.52 
Ada Instruct v1 (350M*) 0.88 4.02 4.26 0.81 3.90 3.87 
Curie Instruct v1 (6.7B*) 0.97 4.24 4.59 0.96 4.27 4.34 
Davinci Instruct v2 (175B*) 0.99 4.15 4.60 0.97 4.41 4.28 
Five-shot language models Anthropic-LM (52B) 0.94 3.88 4.33 0.70 4.77 4.14 
Cohere XL (52.4B) 0.99 3.42 4.48 0.63 4.79 4.00 
GLM (130B) 0.94 3.69 4.24 0.74 4.72 4.12 
OPT (175B) 0.96 3.64 4.33 0.67 4.80 4.01 
GPT-3 (350M) 0.86 3.73 3.85 – – – 
GPT-3 (6.7B) 0.97 3.87 4.17 0.75 4.19 3.36 
GPT-3 (175B) 0.99 3.95 4.34 0.69 4.69 4.03 
Ada Instruct v1 (350M*) 0.84 3.84 4.07 0.63 3.54 3.07 
Curie Instruct v1 (6.7B*) 0.96 4.30 4.43 0.85 4.28 3.80 
Davinci Instruct v2 (175B*) 0.98 4.13 4.49 0.77 4.83 4.33 
Fine-tuned language models Brio 0.94 3.94 4.40 0.58 4.68 3.89 
Pegasus 0.97 3.93 4.38 0.57 4.73 3.85 
Existing references – 0.84 3.20 3.94 0.37 4.13 3.00 

Instruction Tuned Models Have Strong Summarization Ability.

Across the two datasets and three aspects, we find that the zero-shot instruction-tuned GPT-3 models, especially Instruct Curie and Davinci, perform the best overall. Compared to the fine-tuned LMs (e.g., Pegasus), Instruct Davinci achieves higher coherence and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40) on CNN and higher faithfulness and relevance scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, which is consistent with recent work (Goyal et al., 2022). In contrast to instruction tuning, we find scale to be less important. Even the largest 175B model often ignores the instruction and generates irrelevant content while the much smaller Instruct Ada outperforms the 175B GPT-3 model on coherence and relevance.

In the five-shot setting, non-instruction-tuned LLMs can improve their summarization performance through in-context learning. For faithfulness scores on CNN/DM and coherence scores on XSUM, several non-instruction-tuned LLMs can perform as well as the instruction-tuned LLMs. However, for other aspects, we still find the instruction-tuned LLMs to be better.

Reference Summaries in Current Benchmarks Should Not Be Used for Training and Evaluating Generic News Summarization Systems.

We arrive at this conclusion based on two observations. First, most automatic summarization systems score better than the reference summaries across all three aspects. Second, applying in-context learning with the current reference summaries makes instruction-tuned models generate worse summaries. For example, on the XSUM dataset, after conditioning on five reference summaries, the faithfulness score of Instruct Davinci drops from 0.97 to 0.77.

The reference summaries make it difficult to compare LLMs to both finetuned models and humans. When comparing to finetuned models, the relatively poor performance of finetuned models can be attributed to the low quality of references in the training data. This suggests we could be underestimating the potential performance of finetuning approaches. When comparing to humans, the existing low-quality references are not representative of actual human performance since they were created through heuristics. As a result, the differences between instruction-tuned LLMs and human performance are likely overstated in Table 3.

Table 3: 

System-level kendall’s tau correlation with human scores across different axes.

MetricCNN/DailyMailXSUM
FaithfulnessCoherenceRelevanceFaithfulnessCoherenceRelevance
Rouge-L 0.54 0.48 0.72 −0.27 0.71 0.30 
METEOR 0.58 0.37 0.66 −0.22 0.68 0.38 
BertScore 0.54 0.47 0.70 −0.23 0.70 0.30 
BARTScore 0.56 0.34 0.65 −0.22 0.70 0.35 
BLEURT 0.56 0.62 0.81 −0.08 0.67 0.41 
 
SummaC 0.54 0.11 0.26 0.26 −0.41 −0.29 
QAFactEval 0.64 0.16 0.35 0.55 0.16 0.37 
BLANC 0.54 0.31 0.50 0.50 0.10 0.32 
MetricCNN/DailyMailXSUM
FaithfulnessCoherenceRelevanceFaithfulnessCoherenceRelevance
Rouge-L 0.54 0.48 0.72 −0.27 0.71 0.30 
METEOR 0.58 0.37 0.66 −0.22 0.68 0.38 
BertScore 0.54 0.47 0.70 −0.23 0.70 0.30 
BARTScore 0.56 0.34 0.65 −0.22 0.70 0.35 
BLEURT 0.56 0.62 0.81 −0.08 0.67 0.41 
 
SummaC 0.54 0.11 0.26 0.26 −0.41 −0.29 
QAFactEval 0.64 0.16 0.35 0.55 0.16 0.37 
BLANC 0.54 0.31 0.50 0.50 0.10 0.32 

Qualitative Examples.

Figure 2 showcases example summaries on an article from the CNN/DM validation set, comparing the summaries of zero-shot GPT-3 Davinci, instruction-tuned GPT-3 Davinci, and the CNN/DM reference summary.

Figure 2: 

Examples summaries generated by GPT-3 models (Section 3) or written by freelance writers (Section 4) of an article from the CNN/DM dataset. We find that the instruction-tuned GPT-3 model can generate a much better summary compared to the non-instruction-tuned variant. The reference summary from CNN/DM is not coherent whereas the freelance writer summary is both coherent and relevant.

Figure 2: 

Examples summaries generated by GPT-3 models (Section 3) or written by freelance writers (Section 4) of an article from the CNN/DM dataset. We find that the instruction-tuned GPT-3 model can generate a much better summary compared to the non-instruction-tuned variant. The reference summary from CNN/DM is not coherent whereas the freelance writer summary is both coherent and relevant.

Close modal

We start by noting that the zero-shot GPT-3 model cannot follow the instructions to summarize well. After the summary paragraph, the model generates an additional question that is completely irrelevant. In addition to the failure to follow instructions, the generated summary contains a factual error, stating that the handbag mentioned is the most expensive in the world, which contradicts the original article. In contrast, the instruction-tuned GPT-3 model generates a summary that is both faithful and coherent.

We also observe from Figure 2 that the reference summary is not coherent. The brand “Hermes” is not introduced until the end and its connection to the rest of the story is unclear. This is unsurprising as reference summaries in the CNN/DM dataset were originally bullet points accompanying the articles as opposed to a coherent paragraph. While such reference summaries might be suited in the original context, we argue that they are not useful for evaluating generic news summarization.

3.3 Understanding Automatic Metrics

We compute system-level correlations against human ratings for eight popular automated evaluation metrics. For reference-based metrics we consider: Rouge-L (Lin, 2004), METEOR Banerjee and Lavie, 2005, BertScore (Zhang et al., 2020), BLEURT (Sellam et al., 2020), and BARTScore (Yuan et al., 2021). For reference-free metrics we consider: SummaC (Laban et al., 2021), QAFactEval (Fabbri et al., 2022), and BLANC (Vasilyev et al., 2020).

Table 3 shows Kendall’s tau rank correlations between automated metrics and human judgments. We observe significantly different trends on CNN/ DM and XSUM so we discuss them separately in the following paragraphs.

For CNN/DM, we observe that the reference-based automatic metrics have a moderate correlation with some aspects of human judgments, e.g., Rouge-L has a 0.72 Kendall’s tau correlation coefficient with relevance in Table 3. Such a level of correlation is comparable to that reported in the study of Fabbri et al. (2020), which measures the correlation of automatic metrics on evaluating finetuned LMs and even earlier neural summarization systems. Therefore, we conclude that on CNN/DM automatic, reference-based metrics can still provide useful signals for relevance.

Studying the results more closely, we find that Rouge-L and human evaluation are more correlated when comparing within each model group. We plot Rouge-L over the relevance rating in Figure 3 as an example. First, we observe that Rouge-L still prefers finetuned LMs (green points on top of the plots) to LLMs, consistent with prior work (Goyal et al., 2022). Despite this mistake, when only comparing LLMs with each other, we find that a larger than 0.05 Rouge-L difference usually translates to improved human evaluation.

Figure 3: 

System-level Rouge-L vs. annotator rated relevance scores.

Figure 3: 

System-level Rouge-L vs. annotator rated relevance scores.

Close modal

On XSUM, reference-based metrics have a very low correlation with faithfulness and relevance since the reference summaries themselves are terrible in these aspects (Table 3; also see Maynez et al., 2020). With such low-quality references, we do not expect reference-based metrics to extract useful information.

In general, across both datasets, we find that reference-based metrics correlate better with human judgments on the aspects for which reference summaries also have better scores (e.g., CNN/DM relevance, XSUM coherence). This points to the important role of quality reference summaries for reference-based metrics, as previously observed in machine translation (Freitag et al., 2020). Reference-free metrics are less handicapped by the low-quality references but they are mostly geared towards measuring faithfulness. Even BLANC, which is designed to measure overall summary quality, correlates best with faithfulness and much worse for relevance and coherence.

In Section 3, we see that the low-quality reference summaries make studying and benchmarking LLMs difficult. In this section, we address this by recruiting Upwork freelance writers to collect higher-quality summaries. With this data, we aim to answer two important questions. First, we would like to know whether the best LLM has reached human-level performance and how the summaries written by the best LLM differ from the ones written by humans. Second, we aim to examine the correlation between reference-based metrics and human judgments when the metrics are calculated using our higher-quality reference summaries.

4.1 Experimental Setup

In this section, we describe the recruitment process and instructions for the summary writing task.

Data.

For data used in our study, we select 50 articles from each of the CNN/DM and XSUM evaluation sets described in Section 3.1 and assign each article to three writers. For XSUM, we use the full articles rather than the preprocessed version where the first bolded sentence is removed.

Writer Recruitment.

We recruit six writers who have had previous experience in writing blog posts, landing page introductions, or product descriptions from the freelance work platform Upwork. After conducting a qualification round by asking writers to summarize five articles, we selected the best writers according to the faithfulness, coherence, and relevance of their summaries. Through an initial pilot study, we estimate that the time required to summarize a CNN/DM or XSUM article is around 12 to 15 minutes. Therefore, we pay our writers $4 for every article they summarize following the recommended practice (Whiting et al., 2019). We based the assignments on writers’ availability, with the most prolific writer summarizing 100 articles and the least prolific writer summarizing 35 articles. We include our annotation guideline for freelance writers in Figure 7.

Summary Writing Instructions.

For the annotation instruction, we instruct our writers to summarize each article in around 50 words.7 To give better task grounding, we ask the writers to summarize as if they are writing a newsletter to update their readers on the news. We release the full annotation guideline along with our code release.

LLM Summaries Generation.

Recently, Liu et al. (2022a) showed that length is a confounding factor in the human evaluation of summarization. To control this potential length confound, we modify the zero-shot prompt in Section 3.1 to elicit summaries that are around 50 words, which is the same word limit provided to the freelance writers. We found that the Instruct Davinci model consistently produces summaries that exceed a given word limit. Therefore, we intentionally prompt the Instruct Davinci model with a 25-word limit to produce summaries with an average length of 50 words. With this new prompt, we generate the summaries using the same hyperparameters described in Section 3.1.

Quality Control.

To verify the quality of the summaries written by freelance writers, we evaluate a random subset of 100 summaries using the same annotation scheme in Section 3.1 using Mechanical Turkers. Table 4 reports the evaluation results, where we see that the freelance writer summaries have much higher quality than the original reference summaries in CNN/DM and XSUM. In addition, we see that the difference between the freelance writer and Instruct Davinci in this evaluation is small. Next, we carry out more targeted evaluations to compare the summaries written by freelance writers and Instruct Davinci.

Table 4: 

Amazon Mechanical Turker evaluation results of the freelance writer summaries. Results of zero-shot Instruct Davinci and reference summaries are taken from Table 2 after averaging the corresponding ratings.

ModelFaithfulnessCoherenceRelevance
Freelance Writer 0.93 4.39 4.26 
Zero-shot 0.98 4.26 4.40 
Instruct Davinci 
Reference Summaries 0.64 3.59 3.45 
ModelFaithfulnessCoherenceRelevance
Freelance Writer 0.93 4.39 4.26 
Zero-shot 0.98 4.26 4.40 
Instruct Davinci 
Reference Summaries 0.64 3.59 3.45 

4.2 Paired Comparison between LLM and Freelance Writers

Comparing Stylistic Differences.

Despite the similar performance in our quality control study, we find that LLM summaries and freelance writer summaries have distinctive styles. Figure 2 shows an example summary written by the freelance writer. Compared to the LLM-generated summary, we find the freelance writer summary often contains more paraphrasing and copies less from the article.

To illustrate this stylistic difference, we measure two extractiveness measures, coverage and density, following Grusky et al. (2018). Coverage is defined as the percentage of words in the summary that are also present in the article; density is defined as the average length of the continuous text spans in the summary that are copied from the article. Our analysis shows that the coverage and density for Instruct Davinci generated summaries are 0.92 and 12.1 whereas those for the writers’ summaries are 0.81 and 2.07. These measures show that the summaries generated by Instruct Davinci are highly extractive whereas the summaries written by the freelance writers are much more abstractive.

To have a fine-grained understanding of these stylistic differences, we manually analyze the distribution of “cut-and-paste operations” in these two sets of summaries. Jing and McKeown (2000) identify a set of “cut and paste” operations for reusing text from the article, including sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, and generalization or specification. On top of these operations, we additionally include a sentence copy operation to account for summary sentences that are directly copied from the article. Using this guideline, we manually annotate ten randomly sampled summary pairs written by Instruct Davinci and the freelance writers.

Figure 4 reports the distribution of the cut-and-paste operations, showing the fraction of sentences that contain each operation. First, we observe that the freelance writer summaries use lexical paraphrasing and generalization/specification much more frequently than the Instruct Davinci generated summaries. Because both operations often involve using novel words that are not present in the article, this matches with the fact that the freelance writer summaries have lower coverage (0.81 vs. 0.92) than the Instruct Davinci summaries. Second, we find that sentence combination is a common strategy used by both the freelance writers and Instruct Davinci. Third, we find that the freelance writers never copy an entire sentence directly from the article but Instruct Davinci does this more frequently.

Figure 4: 

Distributions of cut and paste operations in the summaries written by freelance writers and by Instruct Davinci. By comparison, human-written summaries contain more lexical paraphrasing and sentence reduction whereas the Instruct Davinci model has more direct copying from the article.

Figure 4: 

Distributions of cut and paste operations in the summaries written by freelance writers and by Instruct Davinci. By comparison, human-written summaries contain more lexical paraphrasing and sentence reduction whereas the Instruct Davinci model has more direct copying from the article.

Close modal

In conclusion, we find that Instruct Davinci summarizes in a very different style than human writers. We emphasize here that the freelance writers write in an abstractive style despite the fact that we have not explicitly instructed them to do so. We also observe similarly abstractive styles across the six freelance writers.

Comparing Human Preference.

We now return to our original goal of understanding whether LLM-generated summaries have quality on par with the human-written ones. In the following paragraphs, we discuss our annotation design and recruitment process.

We conduct a blinded pairwise comparison evaluation between the best LLM Instruct Davinci and the freelance writers, similar to the evaluation in Goyal and Durrett (2020). Besides selecting the better summary within each pair, the annotators can decide the summary pair to be equally good. We release the full annotation instructions along with the code release for this project.

In order to compare the best LLM with the freelance writers, we annotate two aspects. First, we solicit annotators’ overall preference, which balances the multiple quality aspects such as faithfulness, coherence, and relevance. Second, we solicit a more targeted measure of informativeness by asking the annotators to compare the number of facts in each summary. For the informativeness measure, we are motivated by the hypothesis that a more abstractive writing style can pack more information into the summary given the same word count. While it is also interesting to compare summary coherence and relevance, we omit them because annotators were unable to differentiate these aspects from the overall preference in a pilot study.

For our recruitment process, we recruit five additional annotators through Upwork and retain one writer who participated in the previous round of summary writing.8 We carry out a qualification round and reject annotators whose ratings differ significantly from the authors’ on a set of control questions for informativeness. We give each annotator the same set of 100 summary pairs, where the average length of the freelance writer summaries and the Instruct Davinci summaries are 53.2 and 52.0, respectively.

Figure 5 shows the results of the paired comparison. While we hypothesized that the more abstractive writing style could lead to more informative summaries, we did not find a significant effect in our annotator pool, who rate the more abstractive summaries to be more informative only 51.1% of the time. On the informative question, our annotators reached a moderate agreement (Krippendorff’s alpha is 0.32), validating our annotation instruction and recruitment process. Moving onto the more subjective overall preference, we find that our annotators equally prefer the freelance writer summaries and the Instruct Davinci summaries. However, a closer analysis shows that there is significant variability in individual annotators’ preferences and the inter-annotator agreement is low (Krippendorff’s alpha is 0.07). This suggests that the quality of generated summaries is getting close to that of the freelance writer summaries and the comparison is dependent on each annotator’s stylistic preference.

Figure 5: 

Human evaluation results comparing summaries written by freelance writers and summaries generated by Instruct GPT-3 Davinci. On aggregate, annotators equally prefer freelance writers and Instruct Davinci. However, there is high variability in individual annotators’ preferences. Notably, annotator 1 writes abstractive summaries but prefers the more extractive Instruct Davinci summaries.

Figure 5: 

Human evaluation results comparing summaries written by freelance writers and summaries generated by Instruct GPT-3 Davinci. On aggregate, annotators equally prefer freelance writers and Instruct Davinci. However, there is high variability in individual annotators’ preferences. Notably, annotator 1 writes abstractive summaries but prefers the more extractive Instruct Davinci summaries.

Close modal

One example of such stylistic preference is seen in the results from annotator 1, who also participated in the first round of summary writing. Like other writers, annotator 1 summarizes in an abstractive style (2.5 density and 0.86 coverage). However, annotator 1 prefers Instruct Davinci 57% of the time even though it generated much more extractive summaries. These results suggest an intriguing gap between annotator preferences when writing and evaluating summaries.

4.3 Reevaluating Reference-based Metrics

In Section 3.3, we saw that the performance of automated metrics may depend on the quality of reference summaries. With the freelance writer summaries, we now conduct an initial study on the effect of using better quality summaries. We focus on using Rouge-L for faithfulness evaluation on the XSUM dataset because the current reference summaries are known to be highly unfaithful (Maynez et al., 2020).

In Figure 6, we plot the system-level Rouge-L against the human ratings. The left plot shows the results of computing Rouge-L with existing reference summaries from XSUM, which has a negative correlation with human ratings. This result matches our expectations because the existing reference summaries are highly unfaithful. On the right, we see the results of computing Rouge-L with the freelance writer summaries, which leads to a much more positive correlation. Hence, we see that the usefulness of reference-based evaluation is closely linked to the quality of the references and we can improve metric correlation by using better reference summaries.

Figure 6: 

System-level Rouge-L vs. annotating rating of faithfulness. The left plot is computed with XSUM references, where the correlation is weak, and the right plot is computed with the freelance writer summaries, where the correlation is much improved.

Figure 6: 

System-level Rouge-L vs. annotating rating of faithfulness. The left plot is computed with XSUM references, where the correlation is weak, and the right plot is computed with the freelance writer summaries, where the correlation is much improved.

Close modal
Figure 7: 

Annotation guideline for freelance writers.

Figure 7: 

Annotation guideline for freelance writers.

Close modal
Figure 8: 

MTurk annotation guideline for summary quality evaluation.

Figure 8: 

MTurk annotation guideline for summary quality evaluation.

Close modal

Implication for Model Development.

In this study, we systematically evaluate a diverse set of LLMs and find that instruction tuning contributes the most to LLMs’ summarization capability. We believe that there is much research beyond our benchmarking effort that needs to be done to better understand the effect of instruction tuning. Here we hypothesize three aspects that could account for the success of instruction tuning.

First, the quality of the summarization data used in instruction tuning can serve an important role. Our findings in Section 3 show that currently, we are finetuning language models on low-quality training data, which can account for their ineffectiveness. At this point, we cannot rule out the possibility that when finetuned on higher quality data, finetuned LMs may perform much better.

Second, the learning algorithm used for instruction tuning can be important (Ouyang et al., 2022). While the exact training details are unknown, the success of Instruct Davinci might be credited to “learning from human feedback” (LHF; Stiennon et al., 2020; Ziegler et al., 2019). Contrary to supervised finetuning that trains systems on written summaries, learning from human feedback trains systems from binary labels of human preferences. As we observe in Section 4.2, there is a discrepancy in how annotators write and rate summaries. While it is possible that LHF has merits over the supervised learning/finetuning approach in exploiting this discrepancy, more analysis is needed to validate this hypothesis.

Third, multi-task learning can be important. Instruct Davinci is trained on a diverse distribution of inputs and many previous studies have confirmed the effectiveness of multi-task learning. We look forward to understanding how summarization benefits from learning on other tasks.

Implication for Summarization Evaluation.

Our work also reveals the difficulties in evaluating high-performance LLMs. As LLMs become increasingly close to human-level performance, human evaluation requires a larger number of samples and less noisy measurements to evaluate the quality of LLMs. Recently, Liu et al. (2022a) also pointed out the difficulties in conducting human evaluation for summarization and advocated using fine-grained semantic units to match with reference summaries. However, as our evaluation points out, not only are the existing reference summaries unreliable but the summaries written by well-paid freelance writers also may not outperform LLM summaries significantly. Therefore, defining reference summaries as the ground truth may be overly restrictive as LLMs are approaching or even exceeding average human-level performance.

We acknowledge that summarization evaluation is dependent on the application scenarios and the existing reference summaries could be suitable in another context. For example, the bullet points style summary in CNN/DM may suffice for being displayed on news websites. The quality issues (such as coherence) we pointed out in this paper may not constitute a concern in specific application scenarios. However, we emphasize that the research on single document news summarization is often abstracted away from the downstream applications and used for judging the generic summarization capability. Our findings in this paper are tied to this research context. This is the reason why the major results of our study rely on new summaries written by freelance writers.

Not only is human evaluation limited by the reference quality, but it also is affected by the subjectivity in evaluation. Individual variation shows that there are many acceptable ways to summarize and individuals may even show different preferences at different points in time (writing vs rating). These factors in combination lead to the fact that we may have reached the limit of single-document news summarization. Existing benchmarks can still play a role in evaluating new models but only if evaluation is done correctly. As LLMs improve, we believe that summarization can be better grounded in downstream applications where user values are better defined so that annotators have a lower degree of freedom in balancing which quality aspects matter most to them.

Limitations.

Due to time constraints, this study has only evaluated systems on English news summarization where the summaries are designed to have around 50 words. We also acknowledge that as automatic systems improve, it becomes increasingly difficult for annotators to unambiguously rank summaries by quality due to differences in their individual preferences.

In this work, we conducted a comprehensive human evaluation of ten LLMs, across the two most popular news summarization benchmarks. Through our experiments, we find that the state-of-the-art LLM performs on par with summaries written by freelance writers, with instruction tuning being the key factor for success. Beyond these findings, our work highlights the crucial role of good reference summaries in both summarization model development and evaluation. Unless the reference quality issue is addressed, comparing zero-shot, few-shot, and finetuning performance will remain an open question, and the current benchmarks will provide limited value when used with reference-based evaluation. Even when we address the quality issue and conduct a human evaluation with high-quality references, we observe a significant amount of individual variation from our annotator pool. Due to these factors, evaluations for single document news summarization may be reaching their limits.

This work is supported by an Open Philanthropy grant and partially supported by a gift from Northrup Grumman. We thank the reviewers and editors for their comments, as well as the Stanford NLP group and the Stanford Center for Research on Foundation Models community for their feedback.

3 

We note that the training details of instruction-tuned GPT-3 models may differ from those mentioned in the publication and are inferred by us based on the API naming scheme.

4 

We recruited annotators who were previously vetted for an earlier study (Liang et al., 2022).

5 

To compute agreement for coherence and relevance, we first binarize the Likert scores, with a score of 3 or above being mapped to 1.

6 

We note that the 350M GPT-3 consistently generates empty outputs on the XSUM dataset so we omit it from the human evaluation.

7 

We conducted an initial study to pilot instructions and found that instructing writers with a sentence limit often resulted in summaries that differ significantly in length.

8 

Other annotators left during the course of study due to a change in freelance work schedule.

Dzmitry
Bahdanau
,
Kyung Hyun
Cho
, and
Yoshua
Bengio
.
2015
.
Neural machine translation by jointly learning to align and translate
. In
3rd International Conference on Learning Representations, ICLR 2015
.
Yushi
Bai
,
Andy
Jones
,
Kamal
Ndousse
,
Amanda
Askell
,
Anna
Chen
,
Nova
DasSarma
,
Dawn
Drain
,
Stanislav
Fort
,
Deep
Ganguli
,
T. J.
Henighan
,
Nicholas
Joseph
,
Saurav
Kadavath
,
John
Kernion
,
Tom
Conerly
,
Sheer
El-Showk
,
Nelson
Elhage
,
Zac
Hatfield-Dodds
,
Danny
Hernandez
,
Tristan
Hume
,
Scott
Johnston
,
Shauna
Kravec
,
Liane
Lovitt
,
Neel
Nanda
,
Catherine
Olsson
,
Dario
Amodei
,
Tom B.
Brown
,
Jack
Clark
,
Sam
McCandlish
,
Christopher
Olah
,
Benjamin
Mann
, and
Jared
Kaplan
.
2022
.
Training a helpful and harmless assistant with reinforcement learning from human feedback
.
arXiv
preprint arXiv:2204.05862
.
Satanjeev
Banerjee
and
Alon
Lavie
.
2005
.
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
. In
IEEvaluation@ACL
.
Regina
Barzilay
and
Michael
Elhadad
.
1997
.
Using lexical chains for text summarization
. In
Proceedings of ISTS, ACL 1997
.
Regina
Barzilay
and
Kathleen R.
McKeown
.
2005
.
Sentence fusion for multidocument news summarization
.
Computational Linguistics
,
31
(
3
):
297
328
.
Regina
Barzilay
,
Kathleen R.
McKeown
, and
Michael
Elhadad
.
1999
.
Information fusion in the context of multi-document summarization
. In
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics
, pages
550
557
,
College Park, Maryland, USA
.
Association for Computational Linguistics
.
Rishi
Bommasani
,
Drew A.
Hudson
,
Ehsan
Adeli
,
Russ
Altman
,
Simran
Arora
,
Sydney
von Arx
,
Michael S.
Bernstein
,
Jeannette
Bohg
,
Antoine
Bosselut
,
Emma
Brunskill
,
Erik
Brynjolfsson
,
S.
Buch
,
Dallas
Card
,
Rodrigo
Castellon
,
Niladri S.
Chatterji
,
Annie S.
Chen
,
Kathleen A.
Creel
,
Jared
Davis
,
Dora
Demszky
,
Chris
Donahue
,
Moussa
Doumbouya
,
Esin
Durmus
,
Stefano
Ermon
,
John
Etchemendy
,
Kawin
Ethayarajh
,
Li
Fei-Fei
,
Chelsea
Finn
,
Trevor
Gale
,
Lauren E.
Gillespie
,
Karan
Goel
,
Noah D.
Goodman
,
Shelby
Grossman
,
Neel
Guha
,
Tatsunori
Hashimoto
,
Peter
Henderson
,
John
Hewitt
,
Daniel E.
Ho
,
Jenny
Hong
,
Kyle
Hsu
,
Jing
Huang
,
Thomas F.
Icard
,
Saahil
Jain
,
Dan
Jurafsky
,
Pratyusha
Kalluri
,
Siddharth
Karamcheti
,
Geoff
Keeling
,
Fereshte
Khani
,
O.
Khattab
,
Pang Wei
Koh
,
Mark S.
Krass
,
Ranjay
Krishna
,
Rohith
Kuditipudi
,
Ananya
Kumar
,
Faisal
Ladhak
,
Mina
Lee
,
Tony
Lee
,
Jure
Leskovec
,
Isabelle
Levent
,
Xiang Lisa
Li
,
Xuechen
Li
,
Tengyu
Ma
,
Ali
Malik
,
Christopher D.
Manning
,
Suvir
Mirchandani
,
Eric
Mitchell
,
Zanele
Munyikwa
,
Suraj
Nair
,
Avanika
Narayan
,
Deepak
Narayanan
,
Benjamin
Newman
,
Allen
Nie
,
Juan Carlos
Niebles
,
Hamed
Nilforoshan
,
J. F.
Nyarko
,
Giray
Ogut
,
Laurel J.
Orr
,
Isabel
Papadimitriou
,
Joon Sung
Park
,
Chris
Piech
,
Eva
Portelance
,
Christopher
Potts
,
Aditi
Raghunathan
,
Robert
Reich
,
Hongyu
Ren
,
Frieda
Rong
,
Yusuf H.
Roohani
,
Camilo
Ruiz
,
Jack
Ryan
,
Christopher
Re
,
Dorsa
Sadigh
,
Shiori
Sagawa
,
Keshav
Santhanam
,
Andy
Shih
,
Krishna Parasuram
Srinivasan
,
Alex
Tamkin
,
Rohan
Taori
,
Armin W.
Thomas
,
Florian
Tramer
,
Rose E.
Wang
,
William
Wang
,
Bohan
Wu
,
Jiajun
Wu
,
Yuhuai
Wu
,
Sang Michael
Xie
,
Michihiro
Yasunaga
,
Jiaxuan
You
,
Matei A.
Zaharia
,
Michael
Zhang
,
Tianyi
Zhang
,
Xikun
Zhang
,
Yuhui
Zhang
,
Lucia
Zheng
,
Kaitlyn
Zhou
, and
Percy
Liang
.
2021
.
On the opportunities and risks of foundation models
.
arXiv preprint arXiv:2108.07258
.
Tom B.
Brown
,
Benjamin
Mann
,
Nick
Ryder
,
Melanie
Subbiah
,
Jared
Kaplan
,
Prafulla
Dhariwal
,
Arvind
Neelakantan
,
Pranav
Shyam
,
Girish
Sastry
,
Amanda
Askell
,
Sandhini
Agarwal
,
Ariel
Herbert-Voss
,
Gretchen
Krueger
,
T. J.
Henighan
,
Rewon
Child
,
Aditya
Ramesh
,
Daniel M.
Ziegler
,
Jeff
Wu
,
Clemens
Winter
,
Christopher
Hesse
,
Mark
Chen
,
Eric
Sigler
,
Mateusz
Litwin
,
Scott
Gray
,
Benjamin
Chess
,
Jack
Clark
,
Christopher
Berner
,
Sam
McCandlish
,
Alec
Radford
,
Ilya
Sutskever
, and
Dario
Amodei
.
2020
.
Language models are few-shot learners
. In
NeurIPS
.
Yen-Chun
Chen
and
Mohit
Bansal
.
2018
.
Fast abstractive summarization with reinforce-selected sentence rewriting
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
675
686
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Aakanksha
Chowdhery
,
Sharan
Narang
,
Jacob
Devlin
,
Maarten
Bosma
,
Gaurav
Mishra
,
Adam
Roberts
,
Paul
Barham
,
Hyung Won
Chung
,
Charles
Sutton
,
Sebastian
Gehrmann
,
Parker
Schuh
,
Kensen
Shi
,
Sasha
Tsvyashchenko
,
Joshua
Maynez
,
Abhishek B.
Rao
,
Parker
Barnes
,
Yi
Tay
,
Noam M.
Shazeer
,
Vinodkumar
Prabhakaran
,
Emily
Reif
,
Nan
Du
,
Benton C.
Hutchinson
,
Reiner
Pope
,
James
Bradbury
,
Jacob
Austin
,
Michael
Isard
,
Guy
Gur-Ari
,
Pengcheng
Yin
,
Toju
Duke
,
Anselm
Levskaya
,
Sanjay
Ghemawat
,
Sunipa
Dev
,
Henryk
Michalewski
,
Xavier
Garcia
,
Vedant
Misra
,
Kevin
Robinson
,
Liam
Fedus
,
Denny
Zhou
,
Daphne
Ippolito
,
David
Luan
,
Hyeontaek
Lim
,
Barret
Zoph
,
Alexander
Spiridonov
,
Ryan
Sepassi
,
David
Dohan
,
Shivani
Agrawal
,
Mark
Omernick
,
Andrew M.
Dai
,
Thanumalayan Sankaranarayana
Pillai
,
Marie
Pellat
,
Aitor
Lewkowycz
,
Erica
Moreira
,
Rewon
Child
,
Oleksandr
Polozov
,
Katherine
Lee
,
Zongwei
Zhou
,
Xuezhi
Wang
,
Brennan
Saeta
,
Mark
Diaz
,
Orhan
Firat
,
Michele
Catasta
,
Jason
Wei
,
Kathleen S.
Meier-Hellstern
,
Douglas
Eck
,
Jeff
Dean
,
Slav
Petrov
, and
Noah
Fiedel
.
2022
.
Palm: Scaling language modeling with pathways
.
arXiv preprint arXiv:2204.02311
.
Cohere
.
2022
.
Introduction to large language models
. https://docs.cohere.ai/docs/introduction-to-large-language-models
Trevor
Cohn
and
Mirella
Lapata
.
2008
.
Sentence compression beyond word deletion
. In
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
, pages
137
144
,
Manchester, UK
.
Coling 2008 Organizing Committee
.
J.
Conroy
,
J.
Schlessinger
,
D.
O’Leary
, and
J.
Goldstein
.
2006
.
Back to basics: Classy 2006
. In
Proceedings of the Document Understanding Conference
.
Daniel
Deutsch
and
Dan
Roth
.
2021
.
Understanding the extent to which content quality metrics measure the information quality of summaries
. In
Proceedings of the 25th Conference on Computational Natural Language Learning
, pages
300
309
,
Online
.
Association for Computational Linguistics
.
Li
Dong
,
Nan
Yang
,
Wenhui
Wang
,
Furu
Wei
,
Xiaodong
Liu
,
Yu
Wang
,
Jianfeng
Gao
,
Ming
Zhou
, and
Hsiao-Wuen
Hon
.
2019
.
Unified language model pre-training for natural language understanding and generation
.
Zhengxiao
Du
,
Yujie
Qian
,
Xiao
Liu
,
Ming
Ding
,
Jiezhong
Qiu
,
Zhilin
Yang
, and
Jie
Tang
.
2021
.
Glm: General language model pretraining with autoregressive blank infilling
. In
ACL
.
Esin
Durmus
,
He
He
, and
Mona
Diab
.
2020
.
FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
5055
5070
,
Online
.
Association for Computational Linguistics
.
Esin
Durmus
,
Faisal
Ladhak
, and
Tatsunori
Hashimoto
.
2022
.
Spurious correlations in reference-free evaluation of text generation
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
.
Güneş
Erkan
and
Dragomir R.
Radev
.
2004
.
Lexrank: Graph-based centrality as salience in text summarization
.
Journal of Artificial Intelligence Research
.
Alexander
Fabbri
,
Chien-Sheng
Wu
,
Wenhao
Liu
, and
Caiming
Xiong
.
2022
.
QAFactEval: Improved QA-based factual consistency evaluation for summarization
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
2587
2601
,
Seattle, United States
.
Association for Computational Linguistics
.
Alexander R.
Fabbri
,
Wojciech
Kryscinski
,
Bryan
McCann
,
Caiming
Xiong
,
Richard
Socher
, and
Dragomir
Radev
.
2020
.
Summeval: Re-evaluating summarization evaluation
.
arXiv preprint arXiv:2007.12626
.
Katja
Filippova
and
Michael
Strube
.
2008
.
Sentence fusion via dependency graph compression
. In
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
, pages
177
185
.
Markus
Freitag
,
David
Grangier
, and
Isaac
Caswell
.
2020
.
BLEU might be guilty but references are not innocent
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
Tanya
Goyal
and
Greg
Durrett
.
2020
.
Evaluating factuality in generation with dependency-level entailment
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
3592
3603
,
Online
.
Association for Computational Linguistics
.
Tanya
Goyal
,
Junyi Jessy
Li
, and
Greg
Durrett
.
2022
.
News summarization and evaluation in the era of gpt-3
.
ArXiv
,
abs/2209.12356
.
Max
Grusky
,
Mor
Naaman
, and
Yoav
Artzi
.
2018
.
Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies
. In
North American Chapter of the Association for Computational Linguistics
.
Karl Moritz
Hermann
,
Tomas
Kocisky
,
Edward
Grefenstette
,
Lasse
Espeholt
,
Will
Kay
,
Mustafa
Suleyman
, and
Phil
Blunsom
.
2015
.
Teaching machines to read and comprehend
. In
NeurIPS
.
Eduard
Hovy
and
Chin-Yew
Lin
.
1999
.
Automated text summarization in summarist
. In
Advances in Automatic Text Summarization
, pages
82
94
.
Hongyan
Jing
.
2000
.
Sentence reduction for automatic text summarization
. In
Applied Natural Language Processing Conference
.
Hongyan
Jing
and
Kathleen
McKeown
.
2000
.
Cut and paste based text summarization
. In
Applied Natural Language Processing Conference
.
Hongyan
Jing
and
Kathleen R.
McKeown
.
1999
.
The decomposition of human-written summary sentences
. In
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, pages
129
136
.
Daniel
Kang
and
Tatsunori B.
Hashimoto
.
2020
.
Improved natural language generation via loss truncation
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
718
731
,
Online
.
Association for Computational Linguistics
.
Kevin
Knight
and
Daniel
Marcu
.
2002
.
Summarization beyond sentence extraction: A probabilistic approach to sentence compression
.
Artificial Intelligence
,
139
(
1
):
91
107
.
Emiel
Krahmer
,
Erwin
Marsi
, and
Paul
van Pelt
.
2008
.
Query-based sentence fusion is better defined and leads to more preferred results than generic sentence fusion
. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics
, pages
193
196
.
Philippe
Laban
,
Tobias
Schnabel
,
Paul N.
Bennett
, and
Marti A.
Hearst
.
2021
.
Summac: Re-visiting nli-based models for inconsistency detection in summarization
.
Transactions of the Association for Computational Linguistics
,
10
:
163
177
.
Mike
Lewis
,
Yinhan
Liu
,
Naman
Goyal
,
Marjan
Ghazvininejad
,
Abdelrahman
Mohamed
,
Omer
Levy
,
Veselin
Stoyanov
, and
Luke
Zettlemoyer
.
2019
.
Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
. In
Annual Meeting of the Association for Computational Linguistics
.
Percy
Liang
,
Rishi
Bommasani
,
Tony
Lee
,
Dimitris
Tsipras
,
Dilara
Soylu
,
Michihiro
Yasunaga
,
Yian
Zhang
,
Deepak
Narayanan
,
Yuhuai
Wu
,
Ananya
Kumar
,
Benjamin
Newman
,
Binhang
Yuan
,
Bobby
Yan
,
Ce
Zhang
,
Christian
Cosgrove
,
Christopher D.
Manning
,
Christopher
Re
,
Diana
Acosta-Navas
,
Drew A.
Hudson
,
E.
Zelikman
,
Esin
Durmus
,
Faisal
Ladhak
,
Frieda
Rong
,
Hongyu
Ren
,
Huaxiu
Yao
,
Jue
Wang
,
Keshav
Santhanam
,
Laurel J.
Orr
,
Lucia
Zheng
,
Mert
Yuksekgonul
,
Mirac
Suzgun
,
Nathan S.
Kim
,
Neel
Guha
,
Niladri S.
Chatterji
,
O.
Khattab
,
Peter
Henderson
,
Qian
Huang
,
Ryan
Chi
,
Sang Michael
Xie
,
Shibani
Santurkar
,
Surya
Ganguli
,
Tatsunori
Hashimoto
,
Thomas F.
Icard
,
Tianyi
Zhang
,
Vishrav
Chaudhary
,
William
Wang
,
Xuechen
Li
,
Yifan
Mai
,
Yuhui
Zhang
, and
Yuta
Koreeda
.
2022
.
Holistic evaluation of language models
.
arXiv preprint arXiv:2211.09110
.
C.
Lin
and
E.
Hovy
.
2002
.
From single to multi-document summarization: A prototype system and its evaluation
. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics
, pages
457
464
.
Chin-Yew
Lin
.
2004
.
Rouge: A package for automatic evaluation of summaries
. In
Annual Meeting of the Association for Computational Linguistics
.
Yang
Liu
and
Mirella
Lapata
.
2019
.
Text summarization with pretrained encoders
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
3730
3740
,
Hong Kong, China
.
Association for Computational Linguistics
.
Yixin
Liu
,
Alexander R.
Fabbri
,
Pengfei
Liu
,
Yilun
Zhao
,
Linyong
Nan
,
Ruilin
Han
,
Simeng
Han
,
Shafiq R.
Joty
,
Chien-Sheng
Wu
,
Caiming
Xiong
, and
Dragomir R.
Radev
.
2022a
.
Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation
.
ArXiv
,
abs/2212.07981
.
Yixin
Liu
,
Pengfei
Liu
,
Dragomir R.
Radev
, and
Graham
Neubig
.
2022b
.
Brio: Bringing order to abstractive summarization
. In
Annual Meeting of the Association for Computational Linguistics
.
Inderjeet
Mani
and
Eric
Bloedorn
.
1999
.
Summarizing similarities and differences among related documents
.
Information Retrieval
,
1
(
1–2
):
35
67
.
Daniel
Marcu
.
1997
.
From discourse structures to text summaries
. In
Intelligent Scalable Text Summarization
.
Erwin
Marsi
and
Emiel
Krahmer
.
2005
.
Explorations in sentence fusion
. In
Proceedings of the European Workshop on Natural Language Generation 2005
, pages
109
117
.
Joshua
Maynez
,
Shashi
Narayan
,
Bernd
Bohnet
, and
Ryan
McDonald
.
2020
.
On faithfulness and factuality in abstractive summarization
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
1906
1919
,
Online
.
Association for Computational Linguistics
.
Ryan
McDonald
.
2006
.
Discriminative sentence compression with soft syntactic evidence
. In
11th Conference of the European Chapter of the Association for Computational Linguistics
, pages
297
304
.
Rada
Mihalcea
and
Paul
Tarau
.
2005
.
Multi-document summarization with iterative graph-based algorithms
. In
Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005)
.
McLean, VA
.
Shashi
Narayan
,
Shay B.
Cohen
, and
Mirella
Lapata
.
2018
.
Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
.
Ani
Nenkova
and
Kathleen
McKeown
.
2011
.
Automatic summarization
.
Foundations and Trends in Information Retrieval
,
52
(
2–3
):
103
233
.
Ani
Nenkova
,
Lucy
Vanderwende
, and
Kathleen
McKeown
.
2006
.
A compositional context sensitive multi-document summarizer: Exploring the factors that influence summarization
. In
Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, pages
573
580
.
Long
Ouyang
,
Jeff
Wu
,
Xu
Jiang
,
Diogo
Almeida
,
Carroll L.
Wainwright
,
Pamela
Mishkin
,
Chong
Zhang
,
Sandhini
Agarwal
,
Katarina
Slama
,
Alex
Ray
,
John
Schulman
,
Jacob
Hilton
,
Fraser
Kelton
,
Luke E.
Miller
,
Maddie
Simens
,
Amanda
Askell
,
Peter
Welinder
,
Paul Francis
Christiano
,
Jan
Leike
, and
Ryan J.
Lowe
.
2022
.
Training language models to follow instructions with human feedback
.
arXiv preprint arXiv:2203.02155
.
Dragomir R.
Radev
,
Eduard H.
Hovy
, and
Kathleen
McKeown
.
2002
.
Introduction to the special issue on summarization
.
Computational Linguistics
,
28
:
399
408
.
Dragomir R.
Radev
,
Hongyan
Jing
, and
Malgorzata
Budzikowska
.
2000
.
Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies
. In
NAACL-ANLP 2000 Workshop: Automatic Summarization
.
Alexander M.
Rush
,
Sumit
Chopra
, and
Jason
Weston
.
2015
.
A neural attention model for abstractive sentence summarization
.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
.
Gerard
Salton
,
Amit
Singhal
,
Mandar
Mitra
, and
Chris
Buckley
.
1997
.
Automatic text structuring and summarization
.
Information Processing & Management
,
33
(
2
):
193
207
.
Methods and Tools for the Automatic Construction of Hypertext
.
Victor
Sanh
,
Albert
Webson
,
Colin
Raffel
,
Stephen H.
Bach
,
Lintang
Sutawika
,
Zaid
Alyafeai
,
Antoine
Chaffin
,
Arnaud
Stiegler
,
Teven Le
Scao
,
Arun
Raja
,
Manan
Dey
,
M.
Saiful Bari
,
Canwen
Xu
,
Urmish
Thakker
,
Shanya
Sharma
,
Eliza
Szczechla
,
Taewoon
Kim
,
Gunjan
Chhablani
,
Nihal V.
Nayak
,
Debajyoti
Datta
,
Jonathan
Chang
,
Mike Tian-Jian
Jiang
,
Han
Wang
,
Matteo
Manica
,
Sheng
Shen
,
Zheng Xin
Yong
,
Harshit
Pandey
,
Rachel
Bawden
,
Thomas
Wang
,
Trishala
Neeraj
,
Jos
Rozen
,
Abheesht
Sharma
,
Andrea
Santilli
,
Thibault
Fevry
,
Jason Alan
Fries
,
Ryan
Teehan
,
Stella Rose
Biderman
,
Leo
Gao
,
Tali
Bers
,
Thomas
Wolf
, and
Alexander M.
Rush
.
2021
.
Multitask prompted training enables zero-shot task generalization
.
arXiv preprint arXiv:2110.08207
.
Abigail
See
,
Peter J.
Liu
, and
Christopher D.
Manning
.
2017
.
Get to the point: Summarization with pointer-generator networks
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1073
1083
,
Vancouver, Canada
.
Association for Computational Linguistics
.
Thibault
Sellam
,
Dipanjan
Das
, and
Ankur P.
Parikh
.
2020
.
Bleurt: Learning robust metrics for text generation
. In
Annual Meeting of the Association for Computational Linguistics
.
H.
Grogory Silber
and
Kathleen F.
McCoy
.
2002
.
Efficiently computed lexical chains as an intermediate representation for automatic text summarization
.
Computational Linguistics
,
28
(
4
):
487
496
.
Josef
Steinberger
,
Massimo
Poesio
,
Mijail A.
Kabadjov
, and
Karel
Jeek
.
2007
.
Two uses of anaphora resolution in summarization
.
Information Processing and Management
,
43
(
6
):
1663
1680
.
Nisan
Stiennon
,
Long
Ouyang
,
Jeff
Wu
,
Daniel M.
Ziegler
,
Ryan J.
Lowe
,
Chelsea
Voss
,
Alec
Radford
,
Dario
Amodei
, and
Paul
Christiano
.
2020
.
Learning to summarize from human feedback
.
arXiv preprint arXiv:2009.01325
.
Ilya
Sutskever
,
Oriol
Vinyals
, and
Quoc V.
Le
.
2014
.
Sequence to sequence learning with neural networks
.
Advances in Neural Information Processing Systems
,
27
.
Kapil
Thadani
and
Kathleen
McKeown
.
2013
.
Supervised sentence fusion with single-stage inference
. In
Proceedings of IJCNLP
,
Nagoya, Japan
.
Oleg
Vasilyev
,
Vedant
Dharnidharka
, and
John
Bohannon
.
2020
.
Fill in the BLANC: Human-free quality estimation of document summaries
. In
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
, pages
11
20
,
Online
.
Association for Computational Linguistics
.
Yizhong
Wang
,
Swaroop
Mishra
,
Pegah
Alipoormolabashi
,
Yeganeh
Kordi
,
Amirreza
Mirzaei
,
Anjana
Arunkumar
,
Arjun
Ashok
,
Arut Selvan
Dhanasekaran
,
Atharva
Naik
,
David
Stap
,
Eshaan
Pathak
,
Giannis
Karamanolakis
,
Haizhi Gary
Lai
,
Ishan
Purohit
,
Ishani
Mondal
,
Jacob
Anderson
,
Kirby
Kuznia
,
Krima
Doshi
,
Maitreya
Patel
,
Kuntal Kumar
Pal
,
M.
Moradshahi
,
Mihir
Parmar
,
Mirali
Purohit
,
Neeraj
Varshney
,
Phani Rohitha
Kaza
,
Pulkit
Verma
,
Ravsehaj Singh
Puri
,
Rushang
Karia
,
Shailaja Keyur
Sampat
,
Savan
Doshi
,
Siddharth Deepak
Mishra
,
Sujan
Reddy
,
Sumanta
Patro
,
Tanay
Dixit
,
Xudong
Shen
,
Chitta
Baral
,
Yejin
Choi
,
Hannaneh
Hajishirzi
,
Noah A.
Smith
, and
Daniel
Khashabi
.
2022
.
Benchmarking generalization via in-context instructions on 1, 600+ language tasks
.
arXiv preprint arXiv:2204.07705
.
Mark E.
Whiting
,
Grant
Hugh
, and
Michael S.
Bernstein
.
2019
.
Fair work: Crowd work minimum wage with one line of code
. In
AAAI Conference on Human Computation & Crowdsourcing
.
Jeff
Wu
,
Long
Ouyang
,
Daniel M
Ziegler
,
Nisan
Stiennon
,
Ryan
Lowe
,
Jan
Leike
, and
Paul
Christiano
.
2021
.
Recursively summarizing books with human feedback
.
arXiv preprint arXiv:2109.10862
.
Weizhe
Yuan
,
Graham
Neubig
, and
Pengfei
Liu
.
2021
.
Bartscore: Evaluating generated text as text generation
.
ArXiv
,
abs/2106.11520
.
Jingqing
Zhang
,
Yao
Zhao
,
Mohammad
Saleh
, and
Peter J.
Liu
.
2020
.
Pegasus: Pre-training with extracted gap-sentences for abstractive summarization
. In
ICML
.
Susan
Zhang
,
Stephen
Roller
,
Naman
Goyal
,
Mikel
Artetxe
,
Moya
Chen
,
Shuohui
Chen
,
Christopher
Dewan
,
Mona
Diab
,
Xian
Li
,
Xi
Victoria Lin
,
Todor
Mihaylov
,
Myle
Ott
,
Sam
Shleifer
,
Kurt
Shuster
,
Daniel
Simig
,
Punit Singh
Koura
,
Anjali
Sridhar
,
Tianlu
Wang
, and
Luke
Zettlemoyer
.
2022
.
Opt: Open pre-trained transformer language models
.
ArXiv
,
abs/2205.01068
.
Tianyi
Zhang
,
Varsha
Kishore
,
Felix
Wu
,
Kilian Q.
Weinberger
, and
Yoav
Artzi
.
2020
.
Bertscore: Evaluating text generation with bert
. In
International Conference on Learning Representations
.
Daniel M.
Ziegler
,
Nisan
Stiennon
,
Jeff
Wu
,
Tom B.
Brown
,
Alec
Radford
,
Dario
Amodei
,
Paul
Christiano
, and
Geoffrey
Irving
.
2019
.
Fine-tuning language models from human preferences
.
arXiv preprint arXiv:1909.08593
.

Author notes

Action Editor: Dan Goldwasser

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.