SummEval: Re-evaluating Summarization Evaluation

The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continues to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 12 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations, 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics, 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format, 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgements.


Introduction
Text summarization aims to compress long document(s) into a short, fluent, and human readable form which preserves the most salient information from the source document.
A standard dataset for training summarization models is the CNN/DailyMail corpus (Hermann et al., 2015), originally a question answering task, which was repurposed for summarization by Nallapati et al. (2016). The dataset consists of news articles and associated human-created bullet-point summaries. The ROUGE (Lin, 2004b) metric, which measures lexical overlap between generated and target summaries, is then typically used together with crowd-sourced human annotations for model evaluation. While the current setup has become standardized, we believe several factors prevent a more complete comparison of models, thus negatively impacting the progress of the field.
As noted by Hardy et al. (2019), recent papers vastly differ in their evaluation protocol. Existing work often limits model comparisons to only a few baselines and offers human evaluations which are largely inconsistent with prior work. Additionally, despite problems associated with ROUGE when used outside of its original setting (Liu and Liu, 2008;Cohan and Goharian, 2016) as well as the introduction of many variations on ROUGE (Zhou et al., 2006;Ng and Abrecht, 2015;Ganesan, 2015;ShafieiBavani et al., 2018) and other text generation metrics (Peyrard, 2019;Zhao et al., 2019;Zhang* et al., 2020;Scialom et al., 2019;Clark et al., 2019), ROUGE has remained the default automatic evaluation metric. We believe that the shortcomings of the current evaluation protocol are partially caused by the lack of easy-to-use resources for evaluation, both in the form of sim-plified evaluation toolkits and large collections of model outputs.
In parallel, there is an issue with how evaluation metrics are evaluated themselves. Many of the currently used metrics were developed and assessed using the Document Understanding Conference (DUC) and Text Analysis Conference (TAC) shared-tasks datasets (Dang andOwczarzak, 2008, 2009). However, it has recently been shown that the mentioned datasets contain human judgments for model outputs scoring on a lower scale compared to current summarization systems putting into question the true performance of those metrics in the new setting (Peyrard, 2019).
We address these gaps in complementary ways: 1) We re-evaluate 12 automatic evaluation metrics in a comprehensive and consistent fashion using outputs from recent neural summarization models along with expert and crowd-sourced human annotations, 2) We consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics, 3) We release aligned summarization model outputs from 23 papers (44 model outputs) published between 2017 and 2019 trained on the CNN/DailyMail dataset to allow for large-scale comparisons of recent summarization models, 4) We release a toolkit of 12 evaluation metrics with an extensible and unified API to promote the reporting of additional metrics in papers, 5) We collect and release expert, as well crowd-sourced, human judgments for 16 model outputs on 100 articles over 4 dimensions to further research into human-correlated evaluation metrics. Code and data associated with this work is available at https://github.com/ Yale-LILY/SummEval.

Related Work
Previous work examining the research setup of text summarization can be broadly categorized into three groups, based on the subject of analysis: evaluation metrics, datasets, and models.
Dealing with evaluation methods, Lin (2004a) examined the effectiveness of the ROUGE metric in various DUC tasks. The authors concluded that evaluating against multiple references results in higher correlation scores with human judgements, however, a single-reference setting is sufficient for the metric to be effective. Owczarzak et al. (2012) studied the effects of inconsistencies in human annotations on the rankings of evaluated sum-marization systems. Results showed that systemlevel rankings were robust against annotations inconsistencies, however, summary-level rankings were not stable in such settings and largely benefit from improving annotator consistency. Rankel et al. (2013) analyzed the performance of different variants of the ROUGE metric using TAC datasets. The authors found that higher-order and less commonly reported ROUGE settings showed higher correlation with human judgments. In a similar line of work, Graham (2015) conducted a large-scale study of the effectiveness of different ROUGE metric variants and compared the it against the BLEU metric on the DUC datasets. Its results highlighted several superior, non-standard ROUGE settings that achieved strong correlations with human judgements on model-generated summaries. In (Chaganty et al., 2018) the authors investigated using an automatic metric to reduce the cost of human evaluation without introducing bias. Together with the study the authors released a set of human judgments over several model outputs, limited to a small set of model types. Peyrard (2019) showed that standard metrics are in agreement when dealing with summaries in the scoring range found in TAC summaries, but vastly differ in the higher scoring range found in current models. The authors reported that additional human annotations on modern model outputs are necessary to conduct a conclusive study of evaluation metrics. Hardy et al. (2019) underscore the differences in approaches to human summary evaluation while proposing a highlight-based reference-less evaluation metric. Other work has examined the problems with applying ROUGE in settings such as meeting summarization (Liu and Liu, 2008) and summarization of scientific articles (Cohan and Goharian, 2016). We build upon this line of research by examining the performance of several automatic evaluation methods, including ROUGE and its variants, against the performance of expert human annotators.
In relation to datasets, Dernoncourt et al. (2018) presented a detailed taxonomy of existing summarization datasets. The authors highlighted the differences in formats of available corpora and called for creating a unified data standard. In a similar line of research, Grusky et al. (2018) offered a thorough analysis of existing corpora, focusing their efforts on news summarization datasets. The authors also introduced several metrics for Table 1: Example summaries with the corresponding averaged expert and crowd-sourced annotations for coherence, consistency, fluency, and relevance. Examples illustrate common problems found in generated summaries, such as ambiguous pronouns, incorrect references, and repetitive content. Expert annotations better differentiate coherence, consistency, and fluency among the examples when compared to the crowd-sourced annotations.
evaluating the extractiveness of summaries which are included in the toolkit implemented as part of this work. Kryściński et al. (2019a) showed that news-related summarization datasets, such as CNN/DailyMail, contain strong layout biases. The authors revealed that datasets in the current format, where each news article is associated with a single reference summary, leave the task of summarization underconstrained. The paper also highlighted the problem of noisy, low quality data in automatically collected news datasets.
Looking into models, Zhang et al. (2018a) analyzed the level of abstraction of several recent abstractive summarization models. The authors showed that word-level extractive models achieved a similar level of abstraction to fully abstractive models. In (Kedzie et al., 2018) the authors examined the influence of various model components on the quality of content selection. The study revealed that in the current setting the training signal is dominated by biases present in summarization datasets preventing models from learning accurate content selection. Kryściński et al. (2019b) investigate the problem of factual correctness of text summarization models. The authors concluded that the issue of hallucinating facts touches up to 30% of generated summaries and list common types of errors made by generative models. Closely related to that work, Maynez et al. (2020) conducted a large-scale study of abstractive summarizers from the perspective of faithfulness. The authors reached similar conclusions, stating that improving factual faithfulness is a critical issue in summarization. The results also showed that currently available evaluation methods, such as ROUGE and BertScore, are not sufficient to study the problem at hand. Durmus et al. (2020) and Wang et al. (2020) similarly examine faithfulness evaluation, both proposing question answering frameworks as a means of evaluating factual consistency.
Insights and contributions coming from our work are complementary to the conclusions of previous efforts described in this section. To the best of our knowledge, this is the first work in neural text summarization to offer a large-scale, consistent, side-by-side re-evaluation of summarization model outputs and evaluations methods. We also share resources that we hope will prove useful for future work in analyzing and improving summarization models and metrics.
Shortly before publishing this manuscript a library for developing summarization metrics was released by Deutsch and Roth (2020). Our toolkit is complementary to their work as their toolkit includes only 3 of our 12 evaluation metrics.

Evaluation Metrics and Summarization Models
We briefly introduce metrics included in our evaluation toolkit as well as the summarization models for which outputs were collected at the time of releasing this manuscript. Extractive fragment coverage is the percentage of words in the summary that are from the source article, measuring the extent to which a summary is a derivative of a text. Density is defined as the average length of the extractive fragment to which each summary word belongs. Compression ratio is defined as the word ratio between the articles and its summaries: In addition to these measures, we also include the percentage of n-grams in the summary not found in the input document as a novelty score and the percentage of n-grams in the summary which repeat as a score of redundancy. For a comprehensive explanation of each metric, please refer to the corresponding paper.

Summarization models
We broadly categorize the models included in this study into extractive and abstractive approaches. For each model we provide a model code (M*) as well as descriptive model name which will allow for easy matching with the released data.   -Pointer Generator (See et al., 2017) propose a variation of encoder-decoder models, the Pointer Generator Network, where the decoder can choose to generate a word from the vocabulary or copy a word from the input. A coverage mechanism is also proposed to prevent from repeatedly attending to the same part of the source document. M9 -Fast-abs-rl (Chen and Bansal, 2018) propose a model which first extracts salient sentences with a Pointer Network and rewrites these sentences with a Pointer Generator Network. In addition to maximum likelihood training a ROUGE-L reward is used to update the extractor via REIN-FORCE (Williams, 1992).

Resources
We now describe the resources collected and released together with this manuscript.

Model Outputs
The model output collection contains summaries associated with 23 recent papers on neural text summarization described in Section 3.2. We obtained a total of 44 model outputs, as many papers include variations of the main model. All models were trained on the CNN/DailyMail news corpus and the collected summaries were generated using the test split of the dataset. Outputs were solicited from the authors of papers to ensure comparability between results presented in this paper with those in the original works. They are shared publicly with the consent of the authors.

Evaluation Toolkit
The evaluation toolkit contains 12 automatic evaluation metrics described in Section 3.1 consolidated into a Python package. The package provides a high-level, easy-to-use interface unifying all of the underlying metrics. For each metric, we implement both evaluate_example and evaluate_batch functions that return the metric's score on example-and corpus-levels accordingly. Function inputs and outputs are also unified across all metrics to streamline multimetric evaluation and result processing. The toolkit comes with a standard configuration resembling the most popular settings for each of the metrics to enable easy, out-of-the-box use. However, each metric can be further configured using external gin configuration files. We also provide a command-line tool to evaluate a summarization model with several metrics in parallel.

Human Annotations
The collection of human annotations contains summary evaluations of 16 recent neural summarization models solicited from crowd-source and expert judges. Annotations were collected for 100 articles randomly picked from the CNN/DailyMail test set. To ensure high quality of annotations, each summary was scored by 5 crowd-source and 3 expert workers, amounting to 12800 summarylevel annotations. Model outputs were evaluated along the following four dimensions, as in Kryściński et al. (2019a): Coherence -the collective quality of all sentences. We align this dimension with the DUC quality question (Dang, 2005) of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic." Consistency -the factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document.
Fluency -the quality of individual sentences. Drawing again from the DUC quality guidelines, sentences in the summary "should have no formatting problems, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read." Relevance -selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which were too long or contained redundancies and excess information.
The data collection interface provided judges with the source article and associated summaries grouped in sets of 5. Each group of summaries contained the reference summary associated with the source article in order to establish a common point of reference between groups. Summary grouping and order within groups was randomized for each annotator. Judges were asked to rate the summaries on a Likert scale from 1 to 5 (higher better) along the four mentioned dimensions.
Crowd-source annotators were hired through the Amazon Mechanical Turk platform. The hiring criteria were set to a minimum of 10000 approved HITs and an approval rate of 97% or higher. Geographic constrains for workers were set to United States, United Kingdom, and Australia to ensure that summaries were evaluated by native English speakers. Compensation was carefully calculated to ensure an average wage of 12 USD per hour. Gillick and Liu (2010) showed that summary judgments obtained through non-experts may differ greatly from expert annotations and could exhibit worse inter-annotator agreement. As a result, in addition to the hired crowd-source workers, we enlisted three expert annotators who have written  Table 3: Human ratings of summaries along four evaluation dimensions, averaged over three expert annotators, broken down by extractive and abstractive models. The M* codes follow the notation described in Section 3.2. The three highestrated models in each column are in bold.
papers on summarization either for academic conferences (2) or as part of a senior thesis (1). The expert annotators were asked to evaluate the same set of summaries under the same instructions as the hired crowd-source workers. For expert judgments, we proceeded with two rounds of annotation. In the second round, annotators were asked to check all examples for which their score of a dimension differed from another annotator by more than 2 points and where the other annotators were within 1 point of each other. In cases where a score differed by more than 2 points for which such a pattern did not exist, all annotators examined the annotation. The second round of annotation was carried out to correct any obvious mistakes as well as to confirm judgments and ensure a higher quality of annotations.

Human Annotations
Considering the concerns raised in previous work (Gillick and Liu, 2010) about the quality differences between crowd-sourced and expert annotations we study this issue using the human annotations collected as part of this work.
To evaluate the inter-annotator agreement of collected crowd-source and expert annotations we computed the Krippendorff's alpha coefficient (Krippendorff, 2011). We found the interannotator interval kappa to be below an acceptable range -0.4920 and 0.4286 for the crowd-source workers and first round of expert annotations accordingly. However, the second round of expert annotations improved the inter-annotator agreement achieving a kappa coefficient of 0.7187.
To assess the similarity of annotations between the two groups of annotators we averaged the assigned scores per example within the respective annotator groups and computed Pearson's correlation coefficient. The statistic returned a value close to 0, indicating no correlation between expert and crowd-source judges.
We also manually inspected the human annotations and present examples of annotated summaries as well as the differences in human judgments in Table 1. The first row shows a well written, comprehensive summary. The high quality of the summary is reflected by top scores assigned by expert annotators, while being rated as average by crowd-source workers. The second row shows a summary with ambiguous pronoun usage and factual inconsistencies. The errors result in a decrease in coherence, consistency and relevance scores in the expert annotations, but do not see a corresponding decrease in crowd-worker annotations. The last row presents a factually correct summary that contains token and phrase repetitions. The errors were caught by the expert annotators resulting in a low fluency score, while crowd-source annotators incorrectly classified them as issues with factual consistency. These examples again illustrate the disparities in understanding of evaluated dimensions between judges.
Results presented in this section highlight the difficulties of crowd-sourcing high quality annotations and the necessity for protocols for improving human evaluation in text summarization.

Automatic Metrics
Many automatic metrics have been proposed for evaluating both summarization and other text generation models. However, the field lacks a comprehensive study that would offer a consistent sideby-side comparison of their performance. We address this issue with the following experiments.
In Table 2 we show the correlations between automatic metrics and human judgement. The statistics were computed using the available expert annotations to avoid possible quality problems associated with crowd-sourced ratings, as highlighted in the previous subsection. Automatic metrics were computed in a multi-reference setting, us-ing the original reference summary included in the CNN/DailyMail dataset and 10 additional summaries coming from Kryściński et al. (2019a). We report correlations without differentiating between abstractive and extractive models, as most metrics did not exhibit large differences in correlation when reported separately. For completeness, we include correlation tables for a setting with 1 and 6 reference summaries with a separation by model type in the Appendix.
Correlation results show several trends. We find that most metrics have highest correlation within the relevance dimension, although the correlation strength can be classified as either weak or moderate. This finding follows intuition as most metrics either explicitly calculate token overlap, which is seen as a proxy for relevance, or implicitly calculate overlap using token embeddings. Model correlations decrease considerably across the other dimensions, with the notable exception being the measures of extractiveness such as the percentage of novel n-grams in the summary and the extractive coverage. Extractive coverage and the percentage of novel bi-grams in the summary correlate moderately with consistency, which shows how within the current frameworks, abstraction may be at odds with faithfulness. The highest correlation for coherence is found in examining repeated n-grams, as repetition displays a lack of coherence. However, most metric correlations are considerably worse along this dimension as well as along fluency, suggesting that developing metrics to measure these dimensions is a necessary area of future work.
To examine the dependencies between different metrics we computed Pearson's correlation coefficients, pairwise, between all metrics. Results are presented as a correlation matrix in Figure 1. Following intuition, we observe a strong correlation between all metrics that compute, implicitly or explicitly, the lexical overlap between generated and reference summaries. Metrics measuring the n-gram novelty and repetitiveness are show weak negative correlation with all ROUGE-related metrics. Length as a feature is weakly correlated with most metrics apart from S 3 , which might suggest the mentioned metric is biased towards longer summaries. Worth noting is also the weak correlation of SummaQA with all other evaluated metrics, which calls for an additional investigation.
Results presented in this section highlight the evaluation dimensions that are not reliably covered by currently available metrics and pave the way for future work in model evaluation.

Model Re-evaluation
We now turn to an analysis of model scores across human evaluations and automatic metrics. The evaluated models were released between 2017 and 2019, represent different approaches to summarization: abstractive, extractive and hybrid, and their architectures reflect the trends in summarization research. Although in many cases we obtained multiple variants of the same model, in the study we focus on the versions with highest ROUGE-L scores. Table 3 contains the results of human evaluation across the four dimensions described in Section 4.3. Scores for ground truth summaries are included as a point of reference. We find that pretrained models such as Pegasus, BART, and T5 consistently performed best on most dimensions. Notably, the mentioned models scored highest on consistency and fluency while obtaining lower scores for relevance and coherence. Scores for extractive models highlight the known shortcomings of such approaches, which are lack of coherence of summaries and issues with selecting relevant content. Abstractive model ratings show an increasing trend with respect to the date of publication. This is a promising result as it suggests that the quality of models is improving with time. Worth noting is also the fact that reference summaries did not score well on consistency. Upon examination of the annotations, we found that the reference summaries often contained extraneous information, such as hyperlinks and click-bait descriptions of other articles. As this information was not present in the source document, the annotators interpreted it as a hallucination and gave such examples a low consistency score. Table 4 show scores for model outputs across all automatic evaluation metrics. Parameters of metrics used in this study can be found in the evaluation toolkit repository listed in Section 1. The results align with insights coming from human evaluation of models. We found that for most metrics, the highest scores were assigned to large models pretrained on vast quantities of data. However, several metrics, such as S 3 , SummaQA, SMS, 1 The zero shot model was used for evaluation. 2 Annotation will be updated with best performing model.  CHRF, and METEOR tended to favor extractive models, assigning highest scores to their outputs. Presented results provide a comprehensive perspective on the current state of the field and highlight directions for future work on summarization models.

Conclusions
We introduced SummEval, a set of resources for summarization model and evaluation research that include: a collection of summaries generated by recent summarization models on the CNN/DailyMail dataset, an extensible and unified toolkit for summarization model evaluation, and a diverse collection of human annotations of model outputs collected from the crowd-source and expert annotators. Using the accumulated resources we re-evaluated a broad selection of current models and evaluation metrics in a consistent and comprehensive manner. We hope that this work will prove to be a valuable resource for future research on text summarization evaluation and models. We also encourage the research community to join our efforts by contributing model outputs and extending the evaluation toolkit with new metrics.

Appendix
In the pages which follow we provide the remaining tables for correlations between automatic metrics and human judgments across four dimensions.       Table 7: Pearson correlations of automatic metrics metrics with our expert annotations along four quality dimensions using 6 reference summaries. The five most-correlated metrics in each column are bolded. Excluded in this chart are models which only use the source document. Please see Table 2 for these correlations.