Benchmarking Large Language Models for News Summarization

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.


Introduction
Large language models (LLMs) have shown promising results in zero-/few-shot tasks across a wide range of domains (Chowdhery et al., 2022;Bai et al., 2022;Brown et al., 2020;Zhang et al., 2022) and raised significant interest for their potential for automatic summarization (Goyal et al., 2022;Liu et al., 2022a).However, the design decisions contributing to its success on summarization remain poorly understood, and while prior work has shown that LLMs outperform prior state of the art, it remains unclear whether their outputs are comparable to human writers.Examining these questions is crucial for advancing future research in automatic summarization.To answer the first question, we perform a systematic evaluation of ten diverse LLMs with human evaluation on news summarization and our evaluation identifies instruction tuning to be the key to zero-shot summarization capability.In contrast, self-supervised learning alone cannot induce strong summarization performance in the zeroshot setting (Figure 1).In fact, even a 350M parameter instruction-tuned GPT-3 can perform on par with the 175B parameter GPT-3.
To benchmark LLMs, we evaluated on the standard CNN/DM (Hermann et al., 2015) and XSUM datasets (Narayan et al., 2018), but found that lowquality reference summaries caused several issues.The reference summaries in these benchmarks are of such poor quality that human annotators judge them to be worse than the outputs of most automatic systems (Figure 1).When computing automatic metrics using these references, their poor quality reduces the correlation between metric results and human judgement.Not only does this make evaluation difficult, but it also degrades the performance of systems that take supervision either through finetuning or few-shot prompting and makes comparison difficult.
To address the quality issues of reference summaries and better understand how LLMs compare to human summary writers, we recruit free-arXiv:2301.13848v1[cs.CL] 31 Jan 2023 lance writers from Upwork1 to re-annotate 100 articles from the test set of CNN/DM and XSUM.Comparing the best performing LLM, Instruct Davinci, to the freelance writers, we find that the Instruct Davinci summaries are much more extractive.By manually annotating the summarization operations (Jing and McKeown, 2000) used in these summaries, we find that Instruct Davinci paraphrases much less frequently although it is able to combine copied segments coherently.
Given their stylistic differences, we recruit annotators to compare the Instruct Davinci summaries to those written by freelance writers.On aggregate, we find that Instruct Davinci is rated as comparable to the freelance writers.However, analysis of individual annotators reveals that each annotator has a varying and stable preference for either Instruct Davinci or the freelance writers.
Together, our work makes the following key contributions.First, we identify instruction tuning, instead of model scale, as the key to LLMs' summarization capability.Second, we show that reference summaries used in XSUM are judged by humans to be worse than the best LLM generated summaries.Third, to address the issue of low quality references, we collect better quality summaries from freelance writers and we show that the best LLM is rated as comparable to Upwork freelance writers.In combination, these results call into question recent claims made about LLM summarization.In particular, summarization progress cannot be measured using reference-based metrics applied on XSUM.Furthermore, the question of whether fine-tuned, few-shot or zero-shot models perform better remains an open question due to the poor quality of training data.To encourage furture work on improved evaluations, we release the high-quality summaries written by freelance writers and the evaluation data on 18 model settings and two datasets as resources2 .
2 Background and Related Work

News Summarization
News summarization is the task of producing a concise paragraph that captures the main points of a news article and has been a core problem within the field of automatic summarization (Radev et al., 2002;Rush et al., 2015;Nal-llm_summarization lapati et al., 2016;See et al., 2017;Chen and Bansal, 2018;Dong et al., 2019).In this work, we benchmark LLMs on news summarization to understand their potential for automatic summarization and focus on two popular news summarization benchmarks, CNN/DM (Hermann et al., 2015) and XSUM (Narayan et al., 2018).
These two benchmarks contain large scale data in the order of houndreds of thousands summaries but are created via "incidental supervison".CNN/DM includes articles from the CNN and DailyMail websites as the source article and adapt the bullet point highlights that come with the website articles as reference summaries.XSUM includes articles from BBC news and adapts the bolded sentence(s) that appear in the first paragraph as reference summaries.As a result, the reference summaries in these datasets are known to have quality issues (Maynez et al., 2020;Kang and Hashimoto, 2020), motivating us to addresses these defects to improve LLM evaluation.
To contextualize the performance of LLMs, we mainly compare to previous state-of-the-art approaches that leveraged supervised finetuning (Liu and Lapata, 2019;Lewis et al., 2019;Zhang et al., 2020;Liu et al., 2022b).Summarization evaluation is another active area of research.Many automatic metrics have been proposed (Lin, 2004;Zhang* et al., 2020;Sellam et al., 2020;Durmus et al., 2020;Maynez et al., 2020;Deutsch and Roth, 2021) but they do not always correlate with human evaluation of summarization systems (Fabbri et al., 2020;Durmus et al., 2022).In this work, we evaluate the effectiveness of automatic metrics for evaluating LLMs and show that the usefulness of reference-based evaluation is closely linked to the quality of the references.

Large Language Models
LLMs (Bommasani et al., 2021;Chowdhery et al., 2022;Brown et al., 2020) have two distinctive features over previous pretrained models.First, LLMs have much larger scale in terms of model parameters and training data.Second, unlike previous pretrained models that require finetuning, LLMs can be prompted zero-shot or few-shot to solve a task.In the zero-shot setting, prompting presents the LLMs with inputs (e.g.news articles) and a natural language instruction (e.g., "summarize this news article in three sentences") and solicit outputs by having LLMs generate answers di- Recently, instruction-tuning has emerged as an effective way to improve LLM prompting performance (Sanh et al., 2021;Wang et al., 2022;Ouyang et al., 2022).In this approach, a diverse set of natural language processing tasks are reformulated into the prompting format and the LLM's parameters are updated for these tasks either through supervised finetuning or reinforcement learning.
Recent work (Goyal and Durrett, 2020) shows that the instruct-tuned GPT-3 Davinci model is better than finetuned LMs, but do not show the design decision that contribute to the improved performance.In our work, we carry out a more comprehensive benchmark on ten different LLMs, to understand the effect of model scale, incontext learning and instruction tuning.Given that automatic metrics may not be reliable, we focus on human evaluation as our benchmarking method.

Human Evaluation on News Summarization Benchmarks
In this section, we use human evaluation to systematically benchmark a diverse set of ten LLMs on news summarization.We observe that instruction tuning is the key to strong summarization capability and low-quality reference summaries in current benchmarks may underestimate few-shot or finetuning performance.

Experimental Setup
Data We conduct our human evaluation on CNN/DM and XSUM by sampling a hundred examples from each validation set respectively.For the few-shot incontext learning settings, we sample five examples from the training set to be the demonstration examples.Due to the limited context window, we sample five articles that are between 50 and 150 tokens in length according to the GPT-2 tokenizer.For XSUM, we find that a uniform sampling occasionally result in articles that are unreadable due to data preprocessing so we manually pick from the training set.

Model Details
We consider ten LLMs across different pretraining strategies and model scales3 .Table 1 lists the details of the LLMs we consider.Due to limited computational resources and model access, we benchmark all models in the five-shot setting but only benchmark three OpenAI GPT-3 models and three OpenAI instruction-tuned GPT-3 models in the zero-shot setting.
For CNN/DM, we solicit LLM summaries with the following prompt template "Article: [article].Summarize the article in three sentences.Summary:" For XSUM, we modify the prompt template to summarize in one sentence to match the style of the reference summaries.For all LLMs we consider, we sample with temperature 0.3 following prior work (Wu et al., 2021).
To contextualize our LLM benchmarking results, we also evaluate two state-of-the-art fine- tuned LMs: Pegasus (Zhang et al., 2020) and BRIO (Liu et al., 2022b).We decode the finetuned LMs using a beam size of 5 following prior work (Lewis et al., 2019).In addition, we also evaluate the existing reference summaries in the CNN/DM and XSUM validation sets.

Human Evaluation Protocol
We recruit annotators from Amazon Mechanical Turk, compensating them at California minimum wage of $15.00/hr using conservative time estimates as recommended by Whiting et al. (2019).Each model summary was evaluated by three annotators and we report results based on their average score for each summary.
Our annotators evaluate each summary based on three criteria: faithfulness, coherence, and relevance.We define these terms and collect data according to the guidelines in Fabbri et al. (2020).Coherence and relevance ratings are collected on a 1 to 5 Likert scale while faithfulness ratings are collected in binary ratings due to its binary nature.Unlike Fabbri et al. ( 2020), we omit evaluating fluency because we find LLM outputs to be mostly fluent.The full annotation guidelines are included in our code release.

Evaluation Results
Table 2 presents the evaluation results4 .We now discuss two main observations.Instruction tuned models have strong summarization ability.Across the two datasets and three aspects, we find that the zero-shot instruction-tuned GPT-3 models, especially Instruct Curie and Davinci, perform the best overall.Compared to the fine-tuned LMs (e.g.Pegasus), Instruct Davinci achieves higher coherence and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40) on CNN and higher faithfulness and relevance scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, which is consistent with recent work (Goyal et al., 2022).In contrast to instruction tuning, we find scale to be less important.Even the largest 175B model often ignores the instruction and generates irrelevant content while the much smaller Instruct Ada outperforms the 175B GPT-3 model on coherence and relevance.
In the five-shot setting, non-instruction-tuned LLMs can improve their summarization performance through incontext learning.For faithfulness scores on CNN/DM and coherence scores on XSUM, several non-instruction-tuned LLMs can perform as well as the instruction-tuned LLMs.However, for other aspects, we still find the instruction-tuned LLMs to be better.
Reference summaries in current benchmarks are extremely low quality.We arrive at this conclusion based on two observations.First, most automatic summarization systems score better than the reference summaries across all three aspects.Second, applying incontext learning Article: https://www.dailymail.co.uk/femail/article-3052124Reference summary: Rare tote is so exclusive that it can't even be brought in store.Experts say it's an extremely good investment as only increases in value.Kim Kardashian and Victoria Beckham are both fans of Hermes.0-shot GPT-3 Davinci summary: The most expensive handbag in the world is a pink Hermes Birkin bag, which has been put on sale for £140,000.The bag is so exclusive that it can't even be brought in store and experts say it's an extremely good investment.With the same amount of money that the handbag costs, you could splash out on an Aston Martin DB9 Volante.
What is the main idea of the article?0-shot Instruct Davinci summary: The article discusses a pink Hermes Birkin bag that is on sale for £140,000.The bag is made-tomeasure and embellished with rubies.The article also discusses how the bag is a collector's item and an investment.
Freelance writer summary: Hermes is a brand rich in history and value, so much so that experts are saying their new £140,000 handbag is an extremely good investment.The bag, studded in rubies, is only the latest in Hermes Birkin's long line of highly expensive handbags for highly stylish women, which include celebrities and royalty.
Figure 2: Examples summaries generated by GPT-3 models (Section 3) or written by freelance writers (Section 4) of an article from the CNN/DM dataset.We find that instruction-tuned GPT-3 model can generate a much better summary compared to the non-instruction-tuned variant.Reference summary from CNN/DM is not coherent whereas freelance writer summary both coherent and relevant.
with the current reference summaries makes instruction-tuned models generate worse summaries.For example, on the XSUM dataset, after conditioning on five reference summaries, the faithfulness score of Instruct Davinci drops from 0.97 to 0.77.
The low-quality reference summaries make it difficult to compare LLMs to both fine-tuned models and humans.When comparing to finetuned models, the poor performance of fine-tuned models can be attributed to the low-quality references in training data and we may be underestimating the finetuning performance.When comparing to human, the low-quality references are not representative of human performance because they are created through heuristics.As a result, it's likely that the differences between instructiontuned LLMs and human performance are likely overstated in Table 3.
Qualitative Examples.Figure 2 showcases example summaries on an article from the CNN/DM validation set, comparing the summaries of zeroshot GPT-3 Davinci, instruction-tuned GPT-3 Davinci, and the CNN/DM reference summary.
We start by noting that the zero-shot GPT-3 model cannot follow the instruction to summarize well.After the summary paragraph, the model generates an additional question that is completely irrelevant.In addition to the failure of instruction following, the generated summary contains a factual error, stating that the handbag mentioned is the most expensive in the world, which contradicts the original article.In contrast, the instructiontuned GPT-3 model generates a summary that is both faithful and coherent.
We also observe from Figure 2 that the reference summary is not coherent.The brand "Hermes" is not introduced until the end and its connection to the rest of the story is unclear.This is unsurprising as reference summaries in the CNN/DM dataset were originally bullet points accompanying the articles as opposed to a coherent paragraph.

Understanding Automatic Metrics
We compute six popular automatic metrics and compute their system-level correlations against human ratings.The list of metrics we evaluate are: Rouge-L (Lin, 2004), METEOR (Banerjee and Lavie, 2005), BertScore (Zhang* et al., 2020), BLEURT (Sellam et al., 2020), and BARTScore (Yuan et al., 2021).on CNN/DM and XSUM so we discuss them separately in the following paragraphs.
For CNN/DM, we observe that the referencebased automatic metrics have a moderate correlation with some aspects of human judgments, e.g., Rouge-L has a 0.72 Kendall's tau correlation coefficient with relevance in Table 3.Such a level of correlation is comparable to that reported in Fabbri et al. (2020), which measures the correlation of automatic metrics on evaluating finetuned LMs and even earlier neural summarization systems.Therefore, we conclude that on CNN/DM automatic metrics can still provide useful signals in relevance.
Studying the result more closely, we find that Rouge-L and human evaluation is more correlated when comparing within each model group.We plot Rouge-L over the relevance rating in Figure 3 as an example.First, we observe that Rouge-L does still prefer finetuned LMs (green points on top of the plots) to LLMs, consistent with prior work (Goyal et al., 2022).Despite this mistake, when only comparing LLMs with each other, we find that a larger than 0.05 Rouge-L difference usually translates to improved human evaluation.
On XSUM, the metrics have very low correlation with faithfulness and relevance but it is also because the reference summaries are terrible in these aspects (Table 3; also see Maynez et al., 2020).With such low-quality references, we do not expect reference-based metrics to extract useful information.
Combining the results from two datasets, we find that reference-based metrics correlate better with human judgments on the aspects for which reference summaries also have better scores (e.g.CNN/DM relevance, XSUM coherence).This points to the important role of quality reference summaries for reference-based metrics, as previously observed in machine translation (Freitag et al., 2020).

Comparing the Best LLM to Freelance Writers
In Section 3, we see that the low-quality reference summaries make studying and benchmarking LLMs difficult.In this section, we address this by recruiting Upwork freelance writers to collect better quality summaries.With this data, we aim to answer two important questions.First, we would like to know whether the best LLM has reached human-level performance and how the summaries written by the best LLM differ from the ones written by humans.Second, we want to understand how well reference-based metrics correlate with human judgments once we compute them with higher quality reference summaries.

Experimental Setup
In this section, we describe the process of recruiting summary writers and our summary writing instructions.
Data.For data used in our study, we select 50 articles from each of the CNN/DM and XSUM evaluation sets described in Section 3.1 and assign each article to three writers.For XSUM, we use the full articles rather than the preprocessed version where the first bolded sentence is removed.
Writer recruitment.We recruit six writers who have had previous experience in writing blog posts, landing page introductions, or product descriptions from the freelance work platform Upwork.After conducting a qualification round by asking writers to summarize five articles, we selected the best writers according to the faithfulness, coherence, and relevance of their summaries.
Through an initial pilot study, we estimate that the time required to summarize a CNN/DM or XSUM article is around 12 to 15 minutes.Therefore, we pay our writers $4 for every article they summarize to following the recommended practice (Whiting et al., 2019).We based the assignments on writers' availability, with the most prolific writer summarizing 100 articles and the least prolific writer summarizing 35 articles.
Summary writing instructions.For the annotation instruction, we instruct our writers to summarize each article in around 50 words5 .To give better task grounding, we ask the writers to summarize as if they are writing a newletter to update their readers on news.We release the full annotation guideline along with our code release.
LLM Summaries Generation.Recently, Liu et al. (2022a) showed that length is a confounding factor in summarization human evaluation.To control this potential length confound, we modify the zero-shot prompt in Section 3.1 to elicit summaries that are around 50 words, which is the same word limit provided to the freelance writers.We found that the Instruct Davinci model consistently produces summaries that exceed a given word limit.Therefore, we intentionally prompt the Instruct Davinci model with a 25 words limit to produce summaries with an average length of 50 words.With this new prompt, we generate the summaries using the same hyperparameters described in Section 3.1.
Quality Control.To verify the quality of the summaries written by freelance writers, we evaluate a random subset of 100 summaries using the same annotation scheme in Section 3.1 using Mechanical Turkers.addition, we see that the difference between the freelance writer and Instruct Davinci in this evaluation is small.Next, we carry out more targeted evaluations to compare the summaries written by freelance writers and Instruct Davinci.

Paired Comparison between LLM and Freelance Writers
Comparing Stylistic Differences.Despite the similar performance in our quality control study, we find that LLM summaries and the freelance writer summaries have distinctive styles.Figure 2 shows an example summry written by the freelance writer.Compared to the LLM generated summary, we find the freelance writer summary often contains more paraphrasing and copies less from the article.
To illustrate this stylistic difference, we measure two extractiveness measures, coverage and density, following Grusky et al. (2018) Freelance writers summaries are better Instruct Davinci summaries are better Informative Preference Aggreeement : 0.32 Figure 5: Human evaluation results comparing summaries written by freelance writers and summaries generated by Instruct GPT-3 Davinci.On aggregate, annotators equally prefer the freelance writers and Instruct Davinci.However, there is high variability in individual annotators' preferences.Notably, annotator 1 writes abstractive summaries but prefers the more extractive Instruct Davinci summaries.
is defined as the percentage of words in the summary that are also present in the article; density is defined as the average length of the continuous text spans in the summary that are copied from the article.Our analysis shows that the coverage and density for Instruct Davinci generated summaries are 0.92 and 12.1 whereas those for the writers written summaries are 0.81 and 2.07.These measures show that the summaries generated by Instruct Davinci are highly extractive whereas the summaries written by the freelance writers are much more abstractive.
To have a finegrained understanding of these stylistic differences, we manually analyze the distrubution of "cut and paste operations" in these two sets of summaries.Jing and McKeown (2000) identify a set of "cut and paste" operations for reusing text from the article, including sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, and generalization or specification.On top of these operations, we additionally include a sentence copy operation to account for summary sentences that are directly copied from the article.Using this guideline, we manually annotate ten randomly sampled summary pairs written by Instruct Davinci and the freelance writers.
Figure 4 reports the distribution of the cut and paste operations, showing the fraction of sentences that contain each operation.First, we observe that the freelance writer summaries use lexical paraphrasing and generalization/specification much more frequently than the Instruct Davinci generated summaries.Because both operations often involve using novel words that are not present in the article, this matches with the fact that the freelance writer summaries have lower coverage (0.81 vs. 0.92) than the Instruct Davinci sum-maries.Second, we find that sentence combination is a common strategy used by both the freelance writers and Instruct Davinci.Third, we find that the freelance writers never copy an entire sentence directly from the article but Instruct Davinci does this more frequently.
In conclusion, we find that Instruct Davinci summarizes in a very different style than human writers.We emphasize here that the freelance writers write in an abstractive style despite the fact that we have not explicitly instructed them to do so.We also observe similarly abstractive styles across the six freelance writers.
Comparing Human Preference.We now return to our original goal of understanding wheter LLM generated summaries have quality on par with the human written ones.In the following paragraphs, we discuss our annotation design and recruitment process.
We conduct a blinded pairwise comparison evaluation between the best LLM Instruct Davinci and the freelance writers, similar to the evaluation in Goyal and Durrett (2020).Besides selecting the better summary within each pair, the annotators can decide the summary pair to be equally good.We release the full annotation instructions along with the code release for this project.
In order to compare the best LLM with the freelance writers, we annotate two aspects.First, we solicit annotators' overall preference, which balances the multiple quality aspects such as faithfulness, coherence, and relevance.Second, we solicit a more targeted measure of informativeness by asking the annotators to compare the number of facts in each summary.For the informativeness measure, we are motivated by the hypothesis that a more abstractive writing style can pack more information into the summary given the same word count.While it is also interesting to compare summary coherence and relevance, we omit them because annotators were unable to differentiate these aspects from the overall preference in a pilot study.
For our recruitment process, we recruit five additional annotators through Upwork and retain one writer who participated in the previous round of summary writing 6 .We carry out a qualification round and reject annotators whose ratings differ significantly from the authors' on a set of control questions for informativeness.We give each annotator the same set of 100 summary pairs, where the average length of the freelance writer summaries and the Instruct Davinci summaries are 53.2 and 52.0 respectively.
Figure 5 shows the results of the paired comparison.While we hypothesized that the more abstractive writing style can lead to more informative summaries, we do not find a significant effect in our annotator pool, who rate the more abstractive summaries to be more informative only 51.1% of the time.On the informative question, our annotators reached a moderate agreement (Krippendorff's alpha is 0.32), validating our annotation instruction and recruitment process.Moving onto the more subjective overall preference, we find that our annotators equally prefer the freelance writer summaries and the Instruct Davinci summaries.However, a closer analysis shows that there is significant variability in individual annotators' preference and the interannotator agreement is low (Krippendorff's alpha is 0.07).This suggests that the quality of generated summaries is getting close to that of the freelance writer summaries and the comparison is dependent on each annotator's stylistic preference.
One example of such stylistic preference is seen in the results from annotator 1, who also participated in the first round of summary writing.Like other writers, annotator 1 summarizes in an abstractive style (2.5 density and 0.86 coverage).However, annotator 1 prefers Instruct Davinci 57% of the time even though it generated much more extractive summaries.These results suggest an intriguing gap between annotator preferences when writing and evaluating summaries. 6Other annotators left during the course of study due to change in freelance work schedule.

Reevaluating Reference-based Metrics
In Section 3.3, we saw that the performance of automated metrics may depend on the quality of reference summaries.With the freelance writer summaries, we now conduct an initial study on the effect of using better quality summaries.We focus on using Rouge-L for faithfulness evaluation on the XSUM dataset because the current reference summaries are known to be highly unfaithful (Maynez et al., 2020).
In Figure 6, we plot the system-level Rouge-L against the human ratings.The left plot shows results of computing Rouge-L with existing references summaries from XSUM, which has negative correlation with human ratings.This result matches our expectation because the existing reference summaries are highly unfaithful.On the right, we see the results of computing Rouge-L with the freelance writer summaries, which leads to a much more positive correlation.Hence, we see that the usefulness of reference-based evaluation is closely linked to the quality of the references and we can improve metric correlation by using better reference summaries.

Discussion
Implication for model development.In this study, we build a systematic evaluation of a diverse set of LLMs and find that instruction tuning contributes the most to LLMs' summarization capability.We believe that there is much research beyond our benchmarking effort that needs to be done to better understand the effect of instruction tuning.Here we hypothesize three aspects that could account for the success of instruction tuning.
First, the quality of the summariztion data used in instruction tuning can serve an important role.Our findings in Section 3 show that currently we are finetuning language models on low quality training data, which can account for their ineffectiveness.At this point, we cannot rule out the possibility that when finetuned on higher quality data, finetuned LMs may perform much better.
Second, the learning algorithm used for instruction tuning can be important (Ouyang et al., 2022).While the exact training details are unknown, the success of Instruct Davinci might be credited to "learning from human feedback" (LHF; Stiennon et al., 2020;Ziegler et al., 2019).Contrary to supervised finetuning that trains systems on written  summaries, learning from human feedback trains systems from binary labels of human preferences.As we observe in Section 4.2, there is discrepancy in how annotators write and rate summaries.
While it is possible that LHF have merits over the supervised learning/finetuning approach in exploiting this discrepancy, more analysis is needed to validate this hypothesis.Third, multi-task learning can be important.Instruct Davinci is trained on a diverse distribution of inputs and many previous studies have confirmed the effectiveness of multi-task learning.We look forward to understanding how summarization benefits from learning on other tasks.
Implication for Summarization Evaluation.Our work also reveals the difficulties in evaluating high-performance LLMs.As LLMs become increasingly close to human-level performance, human evaluation requires a larger number of samples and less noisy measurement to evaluate the quality of LLMs.Recently, Liu et al. (2022a) also point out the difficulties in conducting human evaluation for summarization and advocate using finegrained semantic units to match with reference summaries.However, as our evaluation points out, not only are the existing reference summaries unreliable but the summaries written by well-paid freelance writers also may not outperform LLM summaries significantly.Therefore, defining reference summaries as the ground truth may be overly restrictive as LLMs are approaching or even exceeding average human level performance.
Not only is human evaluation limited by the reference quality, but it also is affected by the subjectivitiy in evaluation.Individual variation shows that there are many acceptable ways to summarize and individuals may even show different preferences at different points in time (writing vs rat-ing).These factors in combination lead to the fact that we may have reached the limit of single document news summarization.Existing benchmarks can still play a role in evaluating new models but only if evaluation is done correctly.As LLMs improve, we believe that summarization can be better grounded in downstream applications where user values are better defined so that annotators have a lower degree of freedom in balancing which quality aspects matter most to them.

Conclusion
In this work, we conducted a comprehensive human evaluation of ten LLMs, across the two most popular news summarization benchmarks.Through our experiments, we find that the stateof-the-art LLM performs on par with summaries written by freelance writers, with instruction tuning being the key factor for success.Beyond these findings, our work highlights the crucial role of good reference summaries in both summarization model development and evaluation.Unless the reference quality issue is addressed, comparing zero-shot, few-shot, and finetuning performance will remain an open question, and the current benchmarks will provide limited value when used with reference-based evaluation.Even when we address the quality issue and conduct a human evaluation with high-quality references, we observe a significant amount of individual variation from our annotator pool.Due to these factors, evaluations for single document news summarization may be reaching their limits.

Figure 1 :
Figure 1: Selected annotator ratings of summary coherence on a 1 to 5 Likert scale.

Figure 4 :
Figure 4: Distributions of cut and paste operations in the summaries written by freelance writers and by Instruct Davinci.By comparison, human written summaries contain more lexical paraphrasing and sentence reduction whereas the Instruct Davinci model has more direct copying from the article.

Figure 6 :
Figure6: System-level Rouge-L vs. annotating rating of faithfulness.The left plot is computed with XSUM references, where the correlation is weak, and the right plot is computed with the freelance writer summaries, where the correlation is much improved.

Table 2 :
Human evaluation results for zero-shot and five-shot LLMs, finetuned LMs, and reference summaries.We bold the entries that are not statistically significantly different from the best numbers in each column.

Table 3 :
System-level kendall's tau correlation with human scores across different axes.

Table 4 :
Amazon Mechanical Turker evaluation results of the freelance writer summaries.Results of zero-shot Instruct Davinci and reference summaries are taken from Table2after averaging the corresponding ratings.