Abstract
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.
1 Introduction
Large language models (LLMs) have shown promising results in zero-/few-shot tasks across a wide range of domains (Chowdhery et al., 2022; Bai et al., 2022; Brown et al., 2020; Zhang et al., 2022) and have raised significant interest for their potential for automatic summarization (Goyal et al., 2022; Liu et al., 2022a). However, the design decisions contributing to its success in summarization remain poorly understood, and while prior work has shown that LLMs outperform the prior state of the art, it remains unclear whether their outputs are comparable to human writers. Examining these questions is crucial for advancing future research in automatic summarization.
To answer the first question, we perform a systematic evaluation of ten diverse LLMs with human evaluation on news summarization; our evaluation identifies instruction tuning to be the key to zero-shot summarization capability. In contrast, self-supervised learning alone cannot induce strong summarization performance in the zero-shot setting (Figure 1). In fact, even a 350M parameter instruction-tuned GPT-3 can perform on par with the 175B parameter GPT-3.
To benchmark LLMs, we evaluated the standard CNN/DM (Hermann et al., 2015) and XSUM datasets (Narayan et al., 2018) but found that existing reference summaries caused several issues. The reference summaries in these benchmarks were originally created in a different use context and, when evaluated as part of a generic news summarization benchmark, human annotators judge them to be worse than the outputs of most automatic systems (Figure 1). When computing automatic metrics using these references, their poor quality reduces the correlation between metric results and human judgment. Not only does this make evaluation difficult, but it also degrades the performance of systems that take supervision either through finetuning or few-shot prompting and makes comparison difficult.
To address the quality issues of reference summaries and better understand how LLMs compare to human summary writers, we recruit freelance writers from Upwork1 to re-annotate 100 articles from the test set of CNN/DM and XSUM. Comparing the best performing LLM, Instruct Davinci, to the freelance writers, we find that the Instruct Davinci summaries are much more extractive. By manually annotating the summarization operations (Jing and McKeown, 2000) used in these summaries, we find that Instruct Davinci paraphrases much less frequently although it is able to combine copied segments coherently.
Given their stylistic differences, we recruit annotators to compare the Instruct Davinci summaries to those written by freelance writers. On aggregate, we find that Instruct Davinci is rated as comparable to the freelance writers. Examination of the annotations from each individual rater shows that every rater has their own consistent preference for either Instruct Davinci or the freelance writers.
Together, our work makes the following key contributions. First, we identify instruction tuning, instead of model scale, as the key to LLMs’ summarization capability. Second, we show that reference summaries used in XSUM, which are simply the first sentence of the news article, are judged by humans to be worse than the best LLM-generated summaries. Third, to address these issues with references, we collect better quality summaries from freelance writers and we show that the best LLM is rated as comparable to Upwork freelance writers. In combination, these results call into question recent claims made about LLM summarization. In particular, summarization progress cannot be measured using reference-based metrics applied on XSUM. Furthermore, the question of whether fine-tuned, few-shot, or zero-shot models perform better remains an open question due to the poor quality of training data. To encourage future work on improved evaluations, we release the high-quality summaries written by freelance writers and the evaluation data on 18 model settings and two datasets as resources.2
2 Background and Related Work
2.1 News Summarization
News summarization is the task of producing a concise paragraph that captures the main points of a news article and has been a core problem within the field of automatic summarization (Radev et al., 2002; Nenkova and McKeown, 2011). Early work focused mostly on extractive approaches, using unsupervised data-driven methods that relied on different variants of word frequency to determine salience (e.g., Salton et al., 1997; Hovy and Lin, 1999; Lin and Hovy, 2002; Mani and Bloedorn, 1999; Conroy et al., 2006; Nenkova et al., 2006). Other approaches to extractive summarization relied on aspects of discourse semantics (e.g., lexical chains and rhetorical structure theory) (Barzilay and Elhadad, 1997; Marcu, 1997; Silber and McCoy, 2002; Steinberger et al., 2007), or graph-based methods (e.g., Radev et al., 2000; Mihalcea and Tarau, 2005; Erkan and Radev, 2004). These extractive approaches were developed both for single-document and multi-document news summarization, with far more work focusing on multi-document than the single-document task. Humans, however, rely on more abstractive operations (such as paraphrasing, generalizations, etc.) in order to write fluent summaries (Jing and McKeown, 1999). This has led to a push toward building abstractive summarization systems, with initial research focusing on designing post-processing algorithms for extractive summarizers that focused on specific operations such as sentence fusion (Barzilay and McKeown, 2005; Marsi and Krahmer, 2005; Krahmer et al., 2008; Filippova and Strube, 2008; Thadani and McKeown, 2013), generation (Barzilay et al., 1999) and sentence compression (Jing, 2000; Knight and Marcu, 2002; McDonald, 2006; Cohn and Lapata, 2008). More scalable, data-driven approaches for building abstractive summarization systems were made possible with more effective neural systems for conditional generation (Sutskever et al., 2014; Bahdanau et al., 2015) as well as large-scale datasets (Rush et al., 2015; Hermann et al., 2015), leading to steady progress over the years (See et al., 2017; Chen and Bansal, 2018; Dong et al., 2019; Liu and Lapata, 2019; Lewis et al., 2019; Zhang et al., 2020).
This work benchmarks LLMs on news summarization using two popular benchmarks, CNN/DM (Hermann et al., 2015) and XSUM (Narayan et al., 2018). These datasets contain hundreds of thousands of article-summary pairs but were created using “incidental supervision”, i.e., the reference summaries were not written specifically for the task but adapted from content on the websites. CNN/DM includes articles from the CNN and DailyMail websites as the source articles and adapts the bullet point highlights that accompany the articles as reference summaries. XSUM includes articles from BBC News and adapts the bolded introductory sentence(s) as reference summaries. As a result, the reference summaries in these datasets are known to have quality issues (Maynez et al., 2020; Kang and Hashimoto, 2020), motivating us to address these defects to improve LLM evaluation.
To contextualize the performance of LLMs, we mainly compare to previous state-of-the-art approaches that leveraged supervised finetuning (Liu and Lapata, 2019; Lewis et al., 2019; Zhang et al., 2020 Zhang; Liu et al., 2022b). Summarization evaluation is another active area of research. Many automatic metrics have been proposed (Lin, 2004; Zhang et al., 2020; Sellam et al., 2020; Durmus et al., 2020; Maynez et al., 2020; Deutsch and Roth, 2021) but they do not always correlate with human evaluation of summarization systems (Fabbri et al., 2020; Durmus et al., 2022). In this work, we evaluate the effectiveness of automatic metrics for evaluating LLMs and show that the usefulness of reference-based evaluation is closely linked to the quality of the references.
2.2 Large Language Models
LLMs (Bommasani et al., 2021; Chowdhery et al., 2022; Brown et al., 2020) have two distinctive features over previous pretrained models. First, LLMs have a much larger scale in terms of model parameters and training data. Second, unlike previous pretrained models that require finetuning, LLMs can be prompted zero-shot or few-shot to solve a task. In the zero-shot setting, prompting presents the LLMs with inputs (e.g., news articles) and a natural language instruction (e.g., “summarize this news article in three sentences”) and solicits outputs by having LLMs generate answers directly. When few-shot training examples are available, LLMs have the ability to learn “in context”. Incontext learning prepends training input-output pairs along with the same style of instruction to the testing input.
Recently, instruction-tuning has emerged as an effective way to improve LLM prompting performance (Sanh et al., 2021; Wang et al., 2022; Ouyang et al., 2022). In this approach, a diverse set of natural language processing tasks are reformulated into the prompting format and the LLM’s parameters are updated for these tasks either through supervised finetuning or reinforcement learning.
Recent work (Goyal and Durrett, 2020) shows that the instruct-tuned GPT-3 Davinci model is better than finetuned LMs, but does not show the design decision that contributes to the improved performance. In our work, we carry out a more comprehensive benchmark on ten different LLMs, to understand the effect of model scale, in-context learning, and instruction tuning. Given that automatic metrics may not be reliable, we focus on human evaluation as our benchmarking method.
3 Human Evaluation on News Summarization Benchmarks
In this section, we use human evaluation to systematically benchmark a diverse set of ten LLMs on news summarization. We observe that instruction tuning is the key to strong summarization capability and reference summaries in current benchmarks may underestimate few-shot or finetuning performance.
3.1 Experimental Setup
Data
We conduct our human evaluation on CNN/DM and XSUM by sampling a hundred examples from each validation set, respectively. For the few-shot in-context learning settings, we sample five examples from the training set to be the demonstration examples. Due to the limited context window, we sample five articles that are between 50 and 150 tokens in length according to the GPT-2 tokenizer. For XSUM, we find that a uniform sampling occasionally results in articles that are unreadable due to data preprocessing so we manually pick from the training set.
Model Details
We consider ten LLMs across different pretraining strategies and model scales.3Table 1 lists the details of the LLMs we consider. Due to limited computational resources and model access, we benchmark all models in the five-shot setting but only benchmark three OpenAI GPT-3 models and three OpenAI instruction-tuned GPT-3 models in the zero-shot setting.
Model . | Model Creator . | # Parameters . | Instruction Tuning . | Reference . |
---|---|---|---|---|
GPT-3 davinci v1 | OpenAI | 175B | ✗ | Brown et al. (2020) |
GPT-3 curie v1 | 6.7B | |||
GPT-3 ada v1 | 350M | |||
InstructGPT davinci v2 | OpenAI | 175B | ✓ | Ouyang et al. (2022) |
InstructGPT curie v1 | 6.7B | |||
InstructGPT ada v1 | 350M | |||
OPT 175B | Meta | 175B | ✗ | Zhang et al. (2022) |
GLM | Tsinghua | 130B | ✗ | Du et al. (2021) |
University | ||||
Cohere xlarge v20220609 | Cohere | 52.4B | ✗ | Cohere (2022) |
Anthropic-LM v4-s3 | Anthropic | 52B | ✓ | Bai et al. (2022) |
Model . | Model Creator . | # Parameters . | Instruction Tuning . | Reference . |
---|---|---|---|---|
GPT-3 davinci v1 | OpenAI | 175B | ✗ | Brown et al. (2020) |
GPT-3 curie v1 | 6.7B | |||
GPT-3 ada v1 | 350M | |||
InstructGPT davinci v2 | OpenAI | 175B | ✓ | Ouyang et al. (2022) |
InstructGPT curie v1 | 6.7B | |||
InstructGPT ada v1 | 350M | |||
OPT 175B | Meta | 175B | ✗ | Zhang et al. (2022) |
GLM | Tsinghua | 130B | ✗ | Du et al. (2021) |
University | ||||
Cohere xlarge v20220609 | Cohere | 52.4B | ✗ | Cohere (2022) |
Anthropic-LM v4-s3 | Anthropic | 52B | ✓ | Bai et al. (2022) |
For CNN/DM, we solicit LLM summaries with the following prompt template “Article: [article]. Summarize the article in three sentences. Summary:”
For XSUM, we modify the prompt template to summarize in one sentence to match the style of the reference summaries. For all LLMs we consider, we sample with temperature 0.3 following prior work (Wu et al., 2021).
To contextualize our LLM benchmarking results, we also evaluate two state-of-the-art finetuned LMs: Pegasus (Zhang et al., 2020) and BRIO (Liu et al., 2022b). We decode the finetuned LMs using a beam size of 5 following prior work (Lewis et al., 2019). In addition, we also evaluate the existing reference summaries in the CNN/DM and XSUM validation sets.
Human Evaluation Protocol
We recruit annotators from Amazon Mechanical Turk, compensating them at California minimum wage of $15.00/hr using conservative time estimates as recommended by Whiting et al. (2019). We recruited a total of 30 annotators from the US who have a lifetime HIT approval rate of 98% or above with at least 10,000 approved HITs (Figure 8).4 Summaries are presented in random order and are evaluated independently by three annotators. We report average scores for each summary based on ratings from all three annotators.
Our annotators evaluate each summary based on three criteria: faithfulness, coherence, and relevance. We define these terms and collect data according to the guidelines in Fabbri et al. (2020). Coherence and relevance ratings are collected on a 1 to 5 Likert scale while faithfulness ratings are collected as a binary value, since it is inherently binary in nature. Unlike Fabbri et al. (2020), we do not evaluate fluency because we find LLM outputs to be mostly fluent. The average pairwise agreement for the annotators in our annotator pool was 75% for faithfulness, 81% for coherence, and 86% for relevance.5 The full annotation guidelines are included in our code release.
3.2 Evaluation Results
Setting . | Models . | CNN/Daily Mail . | XSUM . | ||||
---|---|---|---|---|---|---|---|
Faithfulness . | Coherence . | Relevance . | Faithfulness . | Coherence . | Relevance . | ||
Zero-shot language models | GPT-3 (350M) | 0.29 | 1.92 | 1.84 | 0.26 | 2.03 | 1.90 |
GPT-3 (6.7B) | 0.29 | 1.77 | 1.93 | 0.77 | 3.16 | 3.39 | |
GPT-3 (175B) | 0.76 | 2.65 | 3.50 | 0.80 | 2.78 | 3.52 | |
Ada Instruct v1 (350M*) | 0.88 | 4.02 | 4.26 | 0.81 | 3.90 | 3.87 | |
Curie Instruct v1 (6.7B*) | 0.97 | 4.24 | 4.59 | 0.96 | 4.27 | 4.34 | |
Davinci Instruct v2 (175B*) | 0.99 | 4.15 | 4.60 | 0.97 | 4.41 | 4.28 | |
Five-shot language models | Anthropic-LM (52B) | 0.94 | 3.88 | 4.33 | 0.70 | 4.77 | 4.14 |
Cohere XL (52.4B) | 0.99 | 3.42 | 4.48 | 0.63 | 4.79 | 4.00 | |
GLM (130B) | 0.94 | 3.69 | 4.24 | 0.74 | 4.72 | 4.12 | |
OPT (175B) | 0.96 | 3.64 | 4.33 | 0.67 | 4.80 | 4.01 | |
GPT-3 (350M) | 0.86 | 3.73 | 3.85 | – | – | – | |
GPT-3 (6.7B) | 0.97 | 3.87 | 4.17 | 0.75 | 4.19 | 3.36 | |
GPT-3 (175B) | 0.99 | 3.95 | 4.34 | 0.69 | 4.69 | 4.03 | |
Ada Instruct v1 (350M*) | 0.84 | 3.84 | 4.07 | 0.63 | 3.54 | 3.07 | |
Curie Instruct v1 (6.7B*) | 0.96 | 4.30 | 4.43 | 0.85 | 4.28 | 3.80 | |
Davinci Instruct v2 (175B*) | 0.98 | 4.13 | 4.49 | 0.77 | 4.83 | 4.33 | |
Fine-tuned language models | Brio | 0.94 | 3.94 | 4.40 | 0.58 | 4.68 | 3.89 |
Pegasus | 0.97 | 3.93 | 4.38 | 0.57 | 4.73 | 3.85 | |
Existing references | – | 0.84 | 3.20 | 3.94 | 0.37 | 4.13 | 3.00 |
Setting . | Models . | CNN/Daily Mail . | XSUM . | ||||
---|---|---|---|---|---|---|---|
Faithfulness . | Coherence . | Relevance . | Faithfulness . | Coherence . | Relevance . | ||
Zero-shot language models | GPT-3 (350M) | 0.29 | 1.92 | 1.84 | 0.26 | 2.03 | 1.90 |
GPT-3 (6.7B) | 0.29 | 1.77 | 1.93 | 0.77 | 3.16 | 3.39 | |
GPT-3 (175B) | 0.76 | 2.65 | 3.50 | 0.80 | 2.78 | 3.52 | |
Ada Instruct v1 (350M*) | 0.88 | 4.02 | 4.26 | 0.81 | 3.90 | 3.87 | |
Curie Instruct v1 (6.7B*) | 0.97 | 4.24 | 4.59 | 0.96 | 4.27 | 4.34 | |
Davinci Instruct v2 (175B*) | 0.99 | 4.15 | 4.60 | 0.97 | 4.41 | 4.28 | |
Five-shot language models | Anthropic-LM (52B) | 0.94 | 3.88 | 4.33 | 0.70 | 4.77 | 4.14 |
Cohere XL (52.4B) | 0.99 | 3.42 | 4.48 | 0.63 | 4.79 | 4.00 | |
GLM (130B) | 0.94 | 3.69 | 4.24 | 0.74 | 4.72 | 4.12 | |
OPT (175B) | 0.96 | 3.64 | 4.33 | 0.67 | 4.80 | 4.01 | |
GPT-3 (350M) | 0.86 | 3.73 | 3.85 | – | – | – | |
GPT-3 (6.7B) | 0.97 | 3.87 | 4.17 | 0.75 | 4.19 | 3.36 | |
GPT-3 (175B) | 0.99 | 3.95 | 4.34 | 0.69 | 4.69 | 4.03 | |
Ada Instruct v1 (350M*) | 0.84 | 3.84 | 4.07 | 0.63 | 3.54 | 3.07 | |
Curie Instruct v1 (6.7B*) | 0.96 | 4.30 | 4.43 | 0.85 | 4.28 | 3.80 | |
Davinci Instruct v2 (175B*) | 0.98 | 4.13 | 4.49 | 0.77 | 4.83 | 4.33 | |
Fine-tuned language models | Brio | 0.94 | 3.94 | 4.40 | 0.58 | 4.68 | 3.89 |
Pegasus | 0.97 | 3.93 | 4.38 | 0.57 | 4.73 | 3.85 | |
Existing references | – | 0.84 | 3.20 | 3.94 | 0.37 | 4.13 | 3.00 |
Instruction Tuned Models Have Strong Summarization Ability.
Across the two datasets and three aspects, we find that the zero-shot instruction-tuned GPT-3 models, especially Instruct Curie and Davinci, perform the best overall. Compared to the fine-tuned LMs (e.g., Pegasus), Instruct Davinci achieves higher coherence and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40) on CNN and higher faithfulness and relevance scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, which is consistent with recent work (Goyal et al., 2022). In contrast to instruction tuning, we find scale to be less important. Even the largest 175B model often ignores the instruction and generates irrelevant content while the much smaller Instruct Ada outperforms the 175B GPT-3 model on coherence and relevance.
In the five-shot setting, non-instruction-tuned LLMs can improve their summarization performance through in-context learning. For faithfulness scores on CNN/DM and coherence scores on XSUM, several non-instruction-tuned LLMs can perform as well as the instruction-tuned LLMs. However, for other aspects, we still find the instruction-tuned LLMs to be better.
Reference Summaries in Current Benchmarks Should Not Be Used for Training and Evaluating Generic News Summarization Systems.
We arrive at this conclusion based on two observations. First, most automatic summarization systems score better than the reference summaries across all three aspects. Second, applying in-context learning with the current reference summaries makes instruction-tuned models generate worse summaries. For example, on the XSUM dataset, after conditioning on five reference summaries, the faithfulness score of Instruct Davinci drops from 0.97 to 0.77.
The reference summaries make it difficult to compare LLMs to both finetuned models and humans. When comparing to finetuned models, the relatively poor performance of finetuned models can be attributed to the low quality of references in the training data. This suggests we could be underestimating the potential performance of finetuning approaches. When comparing to humans, the existing low-quality references are not representative of actual human performance since they were created through heuristics. As a result, the differences between instruction-tuned LLMs and human performance are likely overstated in Table 3.
Metric . | CNN/DailyMail . | XSUM . | ||||
---|---|---|---|---|---|---|
Faithfulness . | Coherence . | Relevance . | Faithfulness . | Coherence . | Relevance . | |
Rouge-L | 0.54 | 0.48 | 0.72 | −0.27 | 0.71 | 0.30 |
METEOR | 0.58 | 0.37 | 0.66 | −0.22 | 0.68 | 0.38 |
BertScore | 0.54 | 0.47 | 0.70 | −0.23 | 0.70 | 0.30 |
BARTScore | 0.56 | 0.34 | 0.65 | −0.22 | 0.70 | 0.35 |
BLEURT | 0.56 | 0.62 | 0.81 | −0.08 | 0.67 | 0.41 |
SummaC | 0.54 | 0.11 | 0.26 | 0.26 | −0.41 | −0.29 |
QAFactEval | 0.64 | 0.16 | 0.35 | 0.55 | 0.16 | 0.37 |
BLANC | 0.54 | 0.31 | 0.50 | 0.50 | 0.10 | 0.32 |
Metric . | CNN/DailyMail . | XSUM . | ||||
---|---|---|---|---|---|---|
Faithfulness . | Coherence . | Relevance . | Faithfulness . | Coherence . | Relevance . | |
Rouge-L | 0.54 | 0.48 | 0.72 | −0.27 | 0.71 | 0.30 |
METEOR | 0.58 | 0.37 | 0.66 | −0.22 | 0.68 | 0.38 |
BertScore | 0.54 | 0.47 | 0.70 | −0.23 | 0.70 | 0.30 |
BARTScore | 0.56 | 0.34 | 0.65 | −0.22 | 0.70 | 0.35 |
BLEURT | 0.56 | 0.62 | 0.81 | −0.08 | 0.67 | 0.41 |
SummaC | 0.54 | 0.11 | 0.26 | 0.26 | −0.41 | −0.29 |
QAFactEval | 0.64 | 0.16 | 0.35 | 0.55 | 0.16 | 0.37 |
BLANC | 0.54 | 0.31 | 0.50 | 0.50 | 0.10 | 0.32 |
Qualitative Examples.
Figure 2 showcases example summaries on an article from the CNN/DM validation set, comparing the summaries of zero-shot GPT-3 Davinci, instruction-tuned GPT-3 Davinci, and the CNN/DM reference summary.
We start by noting that the zero-shot GPT-3 model cannot follow the instructions to summarize well. After the summary paragraph, the model generates an additional question that is completely irrelevant. In addition to the failure to follow instructions, the generated summary contains a factual error, stating that the handbag mentioned is the most expensive in the world, which contradicts the original article. In contrast, the instruction-tuned GPT-3 model generates a summary that is both faithful and coherent.
We also observe from Figure 2 that the reference summary is not coherent. The brand “Hermes” is not introduced until the end and its connection to the rest of the story is unclear. This is unsurprising as reference summaries in the CNN/DM dataset were originally bullet points accompanying the articles as opposed to a coherent paragraph. While such reference summaries might be suited in the original context, we argue that they are not useful for evaluating generic news summarization.
3.3 Understanding Automatic Metrics
We compute system-level correlations against human ratings for eight popular automated evaluation metrics. For reference-based metrics we consider: Rouge-L (Lin, 2004), METEOR Banerjee and Lavie, 2005, BertScore (Zhang et al., 2020), BLEURT (Sellam et al., 2020), and BARTScore (Yuan et al., 2021). For reference-free metrics we consider: SummaC (Laban et al., 2021), QAFactEval (Fabbri et al., 2022), and BLANC (Vasilyev et al., 2020).
Table 3 shows Kendall’s tau rank correlations between automated metrics and human judgments. We observe significantly different trends on CNN/ DM and XSUM so we discuss them separately in the following paragraphs.
For CNN/DM, we observe that the reference-based automatic metrics have a moderate correlation with some aspects of human judgments, e.g., Rouge-L has a 0.72 Kendall’s tau correlation coefficient with relevance in Table 3. Such a level of correlation is comparable to that reported in the study of Fabbri et al. (2020), which measures the correlation of automatic metrics on evaluating finetuned LMs and even earlier neural summarization systems. Therefore, we conclude that on CNN/DM automatic, reference-based metrics can still provide useful signals for relevance.
Studying the results more closely, we find that Rouge-L and human evaluation are more correlated when comparing within each model group. We plot Rouge-L over the relevance rating in Figure 3 as an example. First, we observe that Rouge-L still prefers finetuned LMs (green points on top of the plots) to LLMs, consistent with prior work (Goyal et al., 2022). Despite this mistake, when only comparing LLMs with each other, we find that a larger than 0.05 Rouge-L difference usually translates to improved human evaluation.
On XSUM, reference-based metrics have a very low correlation with faithfulness and relevance since the reference summaries themselves are terrible in these aspects (Table 3; also see Maynez et al., 2020). With such low-quality references, we do not expect reference-based metrics to extract useful information.
In general, across both datasets, we find that reference-based metrics correlate better with human judgments on the aspects for which reference summaries also have better scores (e.g., CNN/DM relevance, XSUM coherence). This points to the important role of quality reference summaries for reference-based metrics, as previously observed in machine translation (Freitag et al., 2020). Reference-free metrics are less handicapped by the low-quality references but they are mostly geared towards measuring faithfulness. Even BLANC, which is designed to measure overall summary quality, correlates best with faithfulness and much worse for relevance and coherence.
4 Comparing to Freelance Writers
In Section 3, we see that the low-quality reference summaries make studying and benchmarking LLMs difficult. In this section, we address this by recruiting Upwork freelance writers to collect higher-quality summaries. With this data, we aim to answer two important questions. First, we would like to know whether the best LLM has reached human-level performance and how the summaries written by the best LLM differ from the ones written by humans. Second, we aim to examine the correlation between reference-based metrics and human judgments when the metrics are calculated using our higher-quality reference summaries.
4.1 Experimental Setup
In this section, we describe the recruitment process and instructions for the summary writing task.
Data.
For data used in our study, we select 50 articles from each of the CNN/DM and XSUM evaluation sets described in Section 3.1 and assign each article to three writers. For XSUM, we use the full articles rather than the preprocessed version where the first bolded sentence is removed.
Writer Recruitment.
We recruit six writers who have had previous experience in writing blog posts, landing page introductions, or product descriptions from the freelance work platform Upwork. After conducting a qualification round by asking writers to summarize five articles, we selected the best writers according to the faithfulness, coherence, and relevance of their summaries. Through an initial pilot study, we estimate that the time required to summarize a CNN/DM or XSUM article is around 12 to 15 minutes. Therefore, we pay our writers $4 for every article they summarize following the recommended practice (Whiting et al., 2019). We based the assignments on writers’ availability, with the most prolific writer summarizing 100 articles and the least prolific writer summarizing 35 articles. We include our annotation guideline for freelance writers in Figure 7.
Summary Writing Instructions.
For the annotation instruction, we instruct our writers to summarize each article in around 50 words.7 To give better task grounding, we ask the writers to summarize as if they are writing a newsletter to update their readers on the news. We release the full annotation guideline along with our code release.
LLM Summaries Generation.
Recently, Liu et al. (2022a) showed that length is a confounding factor in the human evaluation of summarization. To control this potential length confound, we modify the zero-shot prompt in Section 3.1 to elicit summaries that are around 50 words, which is the same word limit provided to the freelance writers. We found that the Instruct Davinci model consistently produces summaries that exceed a given word limit. Therefore, we intentionally prompt the Instruct Davinci model with a 25-word limit to produce summaries with an average length of 50 words. With this new prompt, we generate the summaries using the same hyperparameters described in Section 3.1.
Quality Control.
To verify the quality of the summaries written by freelance writers, we evaluate a random subset of 100 summaries using the same annotation scheme in Section 3.1 using Mechanical Turkers. Table 4 reports the evaluation results, where we see that the freelance writer summaries have much higher quality than the original reference summaries in CNN/DM and XSUM. In addition, we see that the difference between the freelance writer and Instruct Davinci in this evaluation is small. Next, we carry out more targeted evaluations to compare the summaries written by freelance writers and Instruct Davinci.
Model . | Faithfulness . | Coherence . | Relevance . |
---|---|---|---|
Freelance Writer | 0.93 | 4.39 | 4.26 |
Zero-shot | 0.98 | 4.26 | 4.40 |
Instruct Davinci | |||
Reference Summaries | 0.64 | 3.59 | 3.45 |
Model . | Faithfulness . | Coherence . | Relevance . |
---|---|---|---|
Freelance Writer | 0.93 | 4.39 | 4.26 |
Zero-shot | 0.98 | 4.26 | 4.40 |
Instruct Davinci | |||
Reference Summaries | 0.64 | 3.59 | 3.45 |
4.2 Paired Comparison between LLM and Freelance Writers
Comparing Stylistic Differences.
Despite the similar performance in our quality control study, we find that LLM summaries and freelance writer summaries have distinctive styles. Figure 2 shows an example summary written by the freelance writer. Compared to the LLM-generated summary, we find the freelance writer summary often contains more paraphrasing and copies less from the article.
To illustrate this stylistic difference, we measure two extractiveness measures, coverage and density, following Grusky et al. (2018). Coverage is defined as the percentage of words in the summary that are also present in the article; density is defined as the average length of the continuous text spans in the summary that are copied from the article. Our analysis shows that the coverage and density for Instruct Davinci generated summaries are 0.92 and 12.1 whereas those for the writers’ summaries are 0.81 and 2.07. These measures show that the summaries generated by Instruct Davinci are highly extractive whereas the summaries written by the freelance writers are much more abstractive.
To have a fine-grained understanding of these stylistic differences, we manually analyze the distribution of “cut-and-paste operations” in these two sets of summaries. Jing and McKeown (2000) identify a set of “cut and paste” operations for reusing text from the article, including sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, and generalization or specification. On top of these operations, we additionally include a sentence copy operation to account for summary sentences that are directly copied from the article. Using this guideline, we manually annotate ten randomly sampled summary pairs written by Instruct Davinci and the freelance writers.
Figure 4 reports the distribution of the cut-and-paste operations, showing the fraction of sentences that contain each operation. First, we observe that the freelance writer summaries use lexical paraphrasing and generalization/specification much more frequently than the Instruct Davinci generated summaries. Because both operations often involve using novel words that are not present in the article, this matches with the fact that the freelance writer summaries have lower coverage (0.81 vs. 0.92) than the Instruct Davinci summaries. Second, we find that sentence combination is a common strategy used by both the freelance writers and Instruct Davinci. Third, we find that the freelance writers never copy an entire sentence directly from the article but Instruct Davinci does this more frequently.
In conclusion, we find that Instruct Davinci summarizes in a very different style than human writers. We emphasize here that the freelance writers write in an abstractive style despite the fact that we have not explicitly instructed them to do so. We also observe similarly abstractive styles across the six freelance writers.
Comparing Human Preference.
We now return to our original goal of understanding whether LLM-generated summaries have quality on par with the human-written ones. In the following paragraphs, we discuss our annotation design and recruitment process.
We conduct a blinded pairwise comparison evaluation between the best LLM Instruct Davinci and the freelance writers, similar to the evaluation in Goyal and Durrett (2020). Besides selecting the better summary within each pair, the annotators can decide the summary pair to be equally good. We release the full annotation instructions along with the code release for this project.
In order to compare the best LLM with the freelance writers, we annotate two aspects. First, we solicit annotators’ overall preference, which balances the multiple quality aspects such as faithfulness, coherence, and relevance. Second, we solicit a more targeted measure of informativeness by asking the annotators to compare the number of facts in each summary. For the informativeness measure, we are motivated by the hypothesis that a more abstractive writing style can pack more information into the summary given the same word count. While it is also interesting to compare summary coherence and relevance, we omit them because annotators were unable to differentiate these aspects from the overall preference in a pilot study.
For our recruitment process, we recruit five additional annotators through Upwork and retain one writer who participated in the previous round of summary writing.8 We carry out a qualification round and reject annotators whose ratings differ significantly from the authors’ on a set of control questions for informativeness. We give each annotator the same set of 100 summary pairs, where the average length of the freelance writer summaries and the Instruct Davinci summaries are 53.2 and 52.0, respectively.
Figure 5 shows the results of the paired comparison. While we hypothesized that the more abstractive writing style could lead to more informative summaries, we did not find a significant effect in our annotator pool, who rate the more abstractive summaries to be more informative only 51.1% of the time. On the informative question, our annotators reached a moderate agreement (Krippendorff’s alpha is 0.32), validating our annotation instruction and recruitment process. Moving onto the more subjective overall preference, we find that our annotators equally prefer the freelance writer summaries and the Instruct Davinci summaries. However, a closer analysis shows that there is significant variability in individual annotators’ preferences and the inter-annotator agreement is low (Krippendorff’s alpha is 0.07). This suggests that the quality of generated summaries is getting close to that of the freelance writer summaries and the comparison is dependent on each annotator’s stylistic preference.
One example of such stylistic preference is seen in the results from annotator 1, who also participated in the first round of summary writing. Like other writers, annotator 1 summarizes in an abstractive style (2.5 density and 0.86 coverage). However, annotator 1 prefers Instruct Davinci 57% of the time even though it generated much more extractive summaries. These results suggest an intriguing gap between annotator preferences when writing and evaluating summaries.
4.3 Reevaluating Reference-based Metrics
In Section 3.3, we saw that the performance of automated metrics may depend on the quality of reference summaries. With the freelance writer summaries, we now conduct an initial study on the effect of using better quality summaries. We focus on using Rouge-L for faithfulness evaluation on the XSUM dataset because the current reference summaries are known to be highly unfaithful (Maynez et al., 2020).
In Figure 6, we plot the system-level Rouge-L against the human ratings. The left plot shows the results of computing Rouge-L with existing reference summaries from XSUM, which has a negative correlation with human ratings. This result matches our expectations because the existing reference summaries are highly unfaithful. On the right, we see the results of computing Rouge-L with the freelance writer summaries, which leads to a much more positive correlation. Hence, we see that the usefulness of reference-based evaluation is closely linked to the quality of the references and we can improve metric correlation by using better reference summaries.
5 Discussion
Implication for Model Development.
In this study, we systematically evaluate a diverse set of LLMs and find that instruction tuning contributes the most to LLMs’ summarization capability. We believe that there is much research beyond our benchmarking effort that needs to be done to better understand the effect of instruction tuning. Here we hypothesize three aspects that could account for the success of instruction tuning.
First, the quality of the summarization data used in instruction tuning can serve an important role. Our findings in Section 3 show that currently, we are finetuning language models on low-quality training data, which can account for their ineffectiveness. At this point, we cannot rule out the possibility that when finetuned on higher quality data, finetuned LMs may perform much better.
Second, the learning algorithm used for instruction tuning can be important (Ouyang et al., 2022). While the exact training details are unknown, the success of Instruct Davinci might be credited to “learning from human feedback” (LHF; Stiennon et al., 2020; Ziegler et al., 2019). Contrary to supervised finetuning that trains systems on written summaries, learning from human feedback trains systems from binary labels of human preferences. As we observe in Section 4.2, there is a discrepancy in how annotators write and rate summaries. While it is possible that LHF has merits over the supervised learning/finetuning approach in exploiting this discrepancy, more analysis is needed to validate this hypothesis.
Third, multi-task learning can be important. Instruct Davinci is trained on a diverse distribution of inputs and many previous studies have confirmed the effectiveness of multi-task learning. We look forward to understanding how summarization benefits from learning on other tasks.
Implication for Summarization Evaluation.
Our work also reveals the difficulties in evaluating high-performance LLMs. As LLMs become increasingly close to human-level performance, human evaluation requires a larger number of samples and less noisy measurements to evaluate the quality of LLMs. Recently, Liu et al. (2022a) also pointed out the difficulties in conducting human evaluation for summarization and advocated using fine-grained semantic units to match with reference summaries. However, as our evaluation points out, not only are the existing reference summaries unreliable but the summaries written by well-paid freelance writers also may not outperform LLM summaries significantly. Therefore, defining reference summaries as the ground truth may be overly restrictive as LLMs are approaching or even exceeding average human-level performance.
We acknowledge that summarization evaluation is dependent on the application scenarios and the existing reference summaries could be suitable in another context. For example, the bullet points style summary in CNN/DM may suffice for being displayed on news websites. The quality issues (such as coherence) we pointed out in this paper may not constitute a concern in specific application scenarios. However, we emphasize that the research on single document news summarization is often abstracted away from the downstream applications and used for judging the generic summarization capability. Our findings in this paper are tied to this research context. This is the reason why the major results of our study rely on new summaries written by freelance writers.
Not only is human evaluation limited by the reference quality, but it also is affected by the subjectivity in evaluation. Individual variation shows that there are many acceptable ways to summarize and individuals may even show different preferences at different points in time (writing vs rating). These factors in combination lead to the fact that we may have reached the limit of single-document news summarization. Existing benchmarks can still play a role in evaluating new models but only if evaluation is done correctly. As LLMs improve, we believe that summarization can be better grounded in downstream applications where user values are better defined so that annotators have a lower degree of freedom in balancing which quality aspects matter most to them.
Limitations.
Due to time constraints, this study has only evaluated systems on English news summarization where the summaries are designed to have around 50 words. We also acknowledge that as automatic systems improve, it becomes increasingly difficult for annotators to unambiguously rank summaries by quality due to differences in their individual preferences.
6 Conclusion
In this work, we conducted a comprehensive human evaluation of ten LLMs, across the two most popular news summarization benchmarks. Through our experiments, we find that the state-of-the-art LLM performs on par with summaries written by freelance writers, with instruction tuning being the key factor for success. Beyond these findings, our work highlights the crucial role of good reference summaries in both summarization model development and evaluation. Unless the reference quality issue is addressed, comparing zero-shot, few-shot, and finetuning performance will remain an open question, and the current benchmarks will provide limited value when used with reference-based evaluation. Even when we address the quality issue and conduct a human evaluation with high-quality references, we observe a significant amount of individual variation from our annotator pool. Due to these factors, evaluations for single document news summarization may be reaching their limits.
Acknowledgments
This work is supported by an Open Philanthropy grant and partially supported by a gift from Northrup Grumman. We thank the reviewers and editors for their comments, as well as the Stanford NLP group and the Stanford Center for Research on Foundation Models community for their feedback.
Notes
We note that the training details of instruction-tuned GPT-3 models may differ from those mentioned in the publication and are inferred by us based on the API naming scheme.
We recruited annotators who were previously vetted for an earlier study (Liang et al., 2022).
To compute agreement for coherence and relevance, we first binarize the Likert scores, with a score of 3 or above being mapped to 1.
We note that the 350M GPT-3 consistently generates empty outputs on the XSUM dataset so we omit it from the human evaluation.
We conducted an initial study to pilot instructions and found that instructing writers with a sentence limit often resulted in summaries that differ significantly in length.
Other annotators left during the course of study due to a change in freelance work schedule.
References
Author notes
Action Editor: Dan Goldwasser