Abstract
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%–30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).
1 Introduction
Proper evaluation is key to fields such as machine learning and Natural Language Processing (NLP). Evaluation is particularly challenging for natural language generation (NLG) tasks, as there many be an infinitude of correct solutions (e.g., translations or summaries) for a given source text. While human evaluation is often considered the gold standard, it is slow and costly, thus researchers resort to automatic evaluation. Previously, this was done using simple lexical overlap metrics such as BLEU and ROUGE, but these exhibit low correlations with human judgments, particularly for state-of-the-art NLG systems (Mathur et al., 2020a; Peyrard, 2019). Thus, a popular recent trend is to design automatic evaluation metrics based on large language models such as BERT and its many extensions (Zhang et al., 2020; Zhao et al., 2019; Sellam et al., 2020; Wan et al., 2022).
Nonetheless, these novel metrics also have key limitations. For example, Sai et al. (2021) and Kaster et al. (2021) show that they are not robust to various adversarial attacks including lexical overlap and factuality errors. Taking the currently most popular metric—BERTScore1 —as an example, this adversarial vulnerability is unsurprising. BERTScore computes the semantic similarity between a reference and a system output (the candidate), using a simplified token matching procedure. However, a good candidate is typically not appropriately identified by semantic similarity. For example, a candidate “5 Ukrainian soldiers wounded in Russia” is not an adequate translation of a source corresponding to the reference “50000 Russian soldiers killed in Ukraine”, although the two texts are of course semantically very similar.2 While there have been many attempts to improve BERTScore using better token matching, e.g., using Word Mover Distance (Zhao et al., 2019; Chen et al., 2020; Colombo et al., 2021), we argue that this line of research is a dead-end, as the underlying model of semantic similarity, originally proposed to address issues of lexical variation in BLEU/ROUGE, is simply not (fully) appropriate.
An intuitively more suitable idea to model evaluation metrics is via natural language inference (NLI) (Dagan et al., 2013). For example, in reference-based settings, in which candidates are compared to human references, a candidate is intuitively good if it is equivalent to a human reference via the concept of bi-implication. NLI systems are also promising alternatives because NLI is one of the most researched upstream tasks in NLP, where a lot of emphasis has been placed on concepts such as biases, generalization and adversarial conditions (Poliak et al., 2018; Utama et al., 2020).
In this paper, we ask whether we can directly use pre-trained NLI models as evaluation metrics, thereby establishing a new paradigm (but with predecessors, as indicated in §2). Our contributions:
We design: a novel preference-based adversarial test suite for MT and summarization metrics. Our adversarial benchmark does not need human annotators, is suitable for reference-free (where the candidate is directly compared to the source text, without human reference) and reference-based evaluation, and is challenging: e.g., BLEU, ROUGE, MoverScore, and BERTScore perform below or at random level.
We explore: (i) how NLI metrics can be induced from existing NLI models; (ii) how they perform on benchmark and adversarial datasets, across (iii) two NLG problems, MT and summarization.
We show: (iv) NLI metrics perform particularly well in summarization, but below standard metrics in MT. (v) They substantially outperform existing metrics on our adversarial attacks (e.g., ∼30%–50% margin over the best unsupervised standard metric in MT). (vi) Combining existing metrics with our NLI metrics yields both better (+5%–30%) and more robust metrics (+15%–30%).
We point out that some current metrics already leverage NLI systems—thus, we do not include new information with respect to them—but indirectly and thus (we argue) inadequately: e.g., MoverScore (Zhao et al., 2019) leverages BERT representations fine-tuned on NLI. Mathur et al. (2019) train (pre-BERT) NLI-inspired architectures on MT datasets. In contrast, we show that by directly leveraging NLI systems, much better adversarial and standard benchmark performances can be obtained. We call our novel metrics MENLI (MEtrics from NLI).3
2 Related Work
Our work connects to evaluation metrics and NLI.
Evaluation Metrics for NLG
In the last few years, researchers have come up with a plethora of different BERT-based metrics for varying tasks and setups: e.g., for MT and summarization, reference-based trained (Sellam et al., 2020; Rei et al., 2020a) and untrained approaches (Zhao et al., 2019; Zhang et al., 2020) have been suggested and the same is true for reference-free setups, where both supervised (Ranasinghe et al., 2020) and unsupervised metrics have been explored (Zhao et al., 2020; Song et al., 2021; Belouadi and Eger, 2023). In our work, we consider both reference-based as well as reference-free metrics. Both setups have important differences: Reference-free setups are more challenging, as they require to compare text in different languages (in MT) or of vastly different lengths (in summarization). On the other hand, they are more ‘resource-efficient’, take humans out-of-the-loop, and promise web-scale evaluation. Both approaches are also different in terms of NLI. For example, while reference-based approaches require equivalence between reference and hypothesis, the concept of equivalence is not always appropriate in reference-free situations (e.g., in summarization, source and summary are intuitively not equivalent; rather, source should entail summary).
To realize metrics, different high-level approaches have been suggested as we outline in Table 1 (e.g., metrics from semantic similarity, from text generation or from question answering). There are also some predecessor works on metrics from NLI which we discuss below.
Concept . | Examples . |
---|---|
Semantic Similarity | BERTScore, MoverScore, BaryScore, ... |
Text Generation | BARTScore, PRISM (Thompson and Post, 2020) |
Question Answering | QAEval (Deutsch et al., 2021) |
NLI | MENLI |
Robustness of Evaluation Metrics
has been a central issue of recent interest: Sai et al. (2021) test metrics across several CheckList (Ribeiro et al., 2020) inspired templates, finding that most common standard metrics are not robust even to simple perturbations. Kaster et al. (2021) probe metrics in an adversarial setting with lexical overlap, finding that they can be fooled by text that has high lexical overlap but low semantic similarity (indicating that the proposed BERT-based metrics are not even good models of semantic similarity). We combine the approaches of Sai et al. (2021) and Kaster et al. (2021): While Sai et al. (2021) use human crowd-workers to evaluate robustness, Kaster et al. (2021) use a simpler preference-based setup, which does not need human annotators. We will also use the preference-based setup, but our attacks are largely inspired by Sai et al. (2021).
More recently (contemporaneously with us and after the first Arxiv submission of our work), several other papers have explored the robustness of recent evaluation metrics. For example, He et al. (2022) develop stress test suites according to potential errors arising from certain choices of metric design and pretrained language models used, showing that metrics are biased towards their underlying models—e.g., BARTScore assigns higher scores to texts generated by the models of the metric itself.4 Karpinska et al. (2022) explore the sensitivity of MT metrics to errors of different categories (regarding semantics, syntax, and morphology) and severity, using a preference-based setting; they show that recent metrics like BERTScore dramatically outperform lexical overlap-based metrics such as BLEU and ROUGE, mostly obtaining over 95% accuracy in their experiments. Our setups and that of Karpinska et al. (2022) and He et al. (2022) are differentiated by the tasks considered, the preference specifications, the results, and the solutions proposed. Karpinska et al. (2022) only evaluate metrics for MT while we consider both MT and summarization. They design their preferences in such a way that it would seem that recent metrics are quite robust while our more elaborate preferences expose their weak spots much better. Finally, we propose solutions (e.g., metrics from NLI) to addressing lack of robustness. Like us, He et al. (2022) also consider summarization and MT. Instead of designing preferences, however, they manually introspect how metric scores change as various perturbations are introduced. In this way, they expose blind spots of metrics. As remedies, they suggest to combine heterogeneous metrics to shield against varying blind spots (without performing concrete experiments)—we show that combining metrics with NLI based metrics yields additional robustness.
Finally, Rony et al. (2022) develop RoMe as a robust metric in the context of semantic similarity, fluency and grammatical variability. They evaluate it on an adversarial dataset with five phenomena (entity, adjective and random word replacement; as well as text transformation and passive forms) by correlating against human judgments. Their model is a rather complicated trained metric leveraging semantic and grammatical features—we compare to it in §6.
NLI
NLI is one of the core upstream tasks in the NLP community. Due to its popularity, NLI has been investigated in-depth, where researchers found that trained models often overfit to low-level statistical cues instead of learning generalizable concepts of logical relationships between sentences (Poliak et al., 2018; Gururangan et al., 2018). As a consequence, many approaches to improve generalization have been investigated (e.g., Belinkov et al., 2019; Utama et al., 2020; Zhou and Bansal, 2020). We argue that a high-quality NLI model would be an excellent candidate for an evaluation metric and explore this in this work.
Like us, Mathur et al. (2019) note the similarity of (MT) evaluation and logical equivalence via NLI. They design supervised MT metrics leveraging different pre-BERT inspired architectures, including one from the NLI community called ESIM (Chen et al., 2017) (which performs on par to an LSTM with attention in their experiments). Thus, in contrast to us, they do not leverage NLI models out-of-the-box as evaluation metrics but only fine-tune an NLI-inspired architecture on human scores from MT. MoverScore (Zhao et al., 2019) fine-tunes BERT on NLI, which leads to better metrics. Thus, they, too, use NLI only indirectly. Dušek and Kasner (2020) use NLI to evaluate hallucinations and omissions in reference-free data-to-text generation scenarios. They do not compare to any other metrics and do not consider NLI as a general paradigm for evaluation metrics. While the summarization community uses NLI models for consistency evaluation (Fabbri et al., 2021; Laban et al., 2022), to our knowledge, we are the first to verify the usefulness of NLI systems as general evaluation metrics against a range of strong competitors, both in standard evaluation and adversarial attack settings.
3 Adversarial Setup
Following Sai et al. (2021) and others, we consider an array of adversarial attacks on evaluation metrics—we will give a motivation of our attacks from the perspective of errors committed by real text generation systems below. In contrast to Sai et al. (2021) and similar to the later published work of Karpinska et al. (2022), we implement a preference-based setup, which does not need human annotators. The advantages of the preference-based setup are: (i) lower cost (e.g., no annotation costs), (ii) which can be especially relevant for non-English languages (e.g., in ref-free situations for MT), and (iii) which allows adversarial evaluation at larger scale, yielding more robust estimates of performance. The challenge of the preference setup is to cleverly determine text pairs to compare.
The outcome of preferences in Eq. (1) depend on how we choose candadvand candpara, which we will describe below. In general, a challenging test suite has candadv maximally similar to ref/src, but with a key error. In contrast, candpara should be maximally dissimilar to ref/src (e.g., on surface level) but meaning-equivalent. Table 2 illustrates the general structure of our adversarial test suite.
candadv
To obtain candadv, we consider the following attacks (nine regarding information adequacy/correctness in candidates and three regarding text fluency), which we deem (to a large degree) representative for errors in different NLG tasks:
Addition: We randomly add a noun after an existing one and connect them with “and”. For example, “I love dogs” → “I love dogs and cats.”
Omission: We use the framework of Sai et al. (2021) to randomly drop ∼1%–20% words in the sentence.
Mismatch: We consider mismatching nouns, verbs, and adjectives, which can lead to misunderstanding of an entity, an action, and the speakers’ emotion, respectively. Following Chen et al. (2021), we replace a specific word having the POS tag of noun/verb/adjective with another word having the same POS tag randomly selected from our collected words for that POS tag.
Negation: We use the perturbation tool of Ribeiro et al. (2020) to add/remove negations to/from the verb for generating candadv with contrary claims.
Number error: We replace all numbers (except for those related to dates) in the sentence with random numbers in the same format (e.g., integer to integer, decimal to decimal).
Pronoun error: We replace all pronouns in the sentence with other ones without causing syntax errors (e.g., “he” to “she” and “us” to “them”).
Name error: We use the tool of Ribeiro et al. (2020) to replace exactly one name with a random one of the same gender.
Fluency: We also include three phenomena from Sai et al. (2021) to examine metrics’ robustness against attacks on text fluency: (i) Jumbling word order: Randomly shuffle the word order in a sentence. Spelling error: Add a typo to a word in a sentence. Subject-verb disagreement: Make the subject and verb disagree (e.g., “He like dogs.”).
For ref-based metrics, we apply the perturbation templates to ref to construct candadv. In contrast, for ref-free MT metrics, we first translate the source src using Google Translate to a translation r and then perturb r to obtain candadv. We introduce r to increase the similarity of candadv to src; e.g., we assume that Google Translate translates more literally, i.e., closer to word-by-word translations, than human translators. This may be important to construct challenging test cases, cf. §6 and our above discussion. For ref-free summarization, we apply the perturbation templates to a document r which is maximally similar to src; details follow.
candpara
We use different ways to obtain candpara, because different kinds of paraphrases may yield more/less difficult test cases for metrics. We will analyze this in §6.
In particular, we use data from (1) PAWS (Zhang et al., 2019), (2) PAWS-X (Yang et al., 2019), (3) WMT20-news-commentary-v15 German-to-English (Mathur et al., 2020b) to generate candpara for MT evaluation metrics, and (4) SummEval for summarization metrics. A summary with attributes is shown in Table 3.
dataset . | task . | ref-based . | ref-free . | candpara . | #examples . |
---|---|---|---|---|---|
PAWSori | MT | yes | no | ORI | 2,000 |
PAWSback | MT | yes | no | BACK | 2,000 |
XPAWSx | MT | yes | yes | ORI | 455–474 |
WMT20de | MT | yes | yes | BACK | 200 |
SEadv | SUM | yes | yes | BACK | 199 |
dataset . | task . | ref-based . | ref-free . | candpara . | #examples . |
---|---|---|---|---|---|
PAWSori | MT | yes | no | ORI | 2,000 |
PAWSback | MT | yes | no | BACK | 2,000 |
XPAWSx | MT | yes | yes | ORI | 455–474 |
WMT20de | MT | yes | yes | BACK | 200 |
SEadv | SUM | yes | yes | BACK | 199 |
(1) PAWS contains sentence pairs created by word swapping and backtranslation, labeled as (non-)paraphrases by human raters. From sentence pairs labeled as paraphrase, we derive two datasets for ref-based evaluation metrics:
PAWSori: We take the first sentence of a PAWS sentence pair as ref and the second as candpara.
PAWSback: We use the first sentence of a PAWS sentence pair as ref and generate candpara based on ref using backtranslation (we use German as the pivot language) except for number error, for which we replace the numbers in ref with the corresponding words, using the Python library num2words.
(2) PAWS-X is the multilingual version of PAWS, which includes PAWS sentence pairs in six languages, translated from English PAWS, allowing us to generate test suites for both ref-free and ref-based metrics. We use the first sentence in PAWS-X (e.g., German) as src and the second sentence with the same ID in English PAWS as ref. We select the data for two closer language pairs: German-to-English and French-to-English, and two more distant language pairs: Chinese-to-English and Japanese-to-English. Accordingly, we create 4 datasets: XPAWSde, XPAWSfr, XPAWSzh, and XPAWSja, each of which contains src (first sentence of X-PAWS pair in source language), ref (first sentence of English PAWS pair), and candpara (second sentence of English PAWS pair).
(3) WMT20-news-commentary-v15 contains sentence pairs of source and human reference. From this, we create WMT20de, directly taking the source and reference sentences as src and ref. We obtain candpara as in the case of PAWSback.
(4) SummEval (Fabbri et al., 2021) contains documents and references from CNN DailyMail (CNNDM) (Hermann et al., 2015), with 10 additional human references. We rank the 11 references using ROUGE-L (Lin, 2004) and use the reference r with highest ROUGE score to generate candadv for ref-free setting, while the remaining 10 references serve as ref. We refer to the adversarial dataset induced from SummEval as SEadv in the remainder. We obtain candpara as in the case of PAWSback.5
Real-world Motivation of Attacks
Modern text generation systems are prone to many of the errors we investigate in this work. For example, Freitag et al. (2021a, b, 2022) show, based on fine-grained human error annotations (Lommel et al., 2014), that translations generated by state-of-the-art MT models still contain many accuracy-related errors (e.g., addition and omission of information, inappropriately informal pronouns) and sometimes even fluency-related errors (e.g., wrong spelling). Negation handling is also frequently discussed as an issue of modern MT systems (Bentivogli et al., 2016; Sennrich, 2017; Hossain et al., 2020; Tang et al., 2021). In summarization, system summaries are often factually inconsistent with source documents in terms of numbers, named entities and assigning quotations to a particular person, etc. (Falke et al., 2019; Kryscinski et al., 2020; Chen et al., 2021). More generally, hallucination (of which addition/mismatches/etc. may be considered special cases) is a particular worrisome limitation of recent large language models (Ji et al., 2022). In Table 4, we show selected system translations from real MT systems with specific errors (following WMT MQM annotations) that are very similar to the ones we consider.6 The frequency of errors may differ for various source-target language pairs (e.g., depending on their language distance) and formal/informal context. For example, when translating Chinese to English for news, the names are often directly translated to their Pinyin format (see the 4th row) instead of the official translations; in contrast, this rarely happens in English-to-German translations. But even for such closely related languages, NLG systems may omit information, or choose wrong pronouns or mismatching nouns, particularly when a word has multiple senses.
4 Experimental Setup
4.1 Evaluation Metrics
We explore a large array of recent state-of-the-art transformer based metrics, summarized in Table 5. The variants used are briefly introduced below; further details (e.g., model checkpoints and implementation) can be found on our Github.
Task . | Metrics . | |
---|---|---|
MT | ref-based | MoverScore (Zhao et al., 2019), BERTScore (Zhang et al., 2020), BARTScore (Yuan et al., 2021), SentSim (Song et al., 2021), COMET (Rei et al., 2020b), BLEURT (Sellam et al., 2020) |
ref-free | COMET, SentSim, XMoverScore (Zhao et al., 2020) | |
Summarization | ref-based | BARTScore, DiscoScore (Zhao et al., 2023), MoverScore, BERTScore |
ref-free | BARTScore, SUPERT (Gao et al., 2020) |
Task . | Metrics . | |
---|---|---|
MT | ref-based | MoverScore (Zhao et al., 2019), BERTScore (Zhang et al., 2020), BARTScore (Yuan et al., 2021), SentSim (Song et al., 2021), COMET (Rei et al., 2020b), BLEURT (Sellam et al., 2020) |
ref-free | COMET, SentSim, XMoverScore (Zhao et al., 2020) | |
Summarization | ref-based | BARTScore, DiscoScore (Zhao et al., 2023), MoverScore, BERTScore |
ref-free | BARTScore, SUPERT (Gao et al., 2020) |
We report BERTScore F1 employing a RoBERTa-large model. For MoverScore, we use the unigram variant with a BERT-base model fine-tuned on MNLI (Williams et al., 2018). We use two variants of BARTScore (Precision and F1) for ref-based MT and summarization and BARTScore-FN (FN stands for Faithfulness) for ref-free summarization. We consider two variants of XMoverScore with different remapping strategies for multilingual embeddings (CLP, UMD) and two variants of SentSim with different word matching paradigms (BERTScore, WMD). We report the DiscoScore variant with feature ‘Focus Frequency’.
4.2 Datasets & Evaluation Protocol
We summarize our used datasets in Table 6. To evaluate the metrics’ robustness under adversarial conditions, we use the datasets introduced in §3 and additionally Rank19 (Falke et al., 2019) (only for ref-free summarization), which contains examples composed of documents paired with one correct and one incorrect candidate summary with real-world factuality errors. In general, we check the metrics’ preference between the two candidates and calculate accuracy: the relative frequency that the metrics correctly choose among the two alternatives.
Task . | Datasets . | |
---|---|---|
MT | segment-level | WMT15-17, WMT20-21 |
system-level | WMT20-21 | |
adversary | ref-based: PAWSori/back, WMT20de, XPAWSde; ref-free: XPAWSde/fr/zh/ja, WMT20de | |
Summarization | summary-level | RealSum (Bhandari et al., 2020) |
system-level | RealSum, SummEval | |
adversary | SEadv, Rank19 (Falke et al., 2019) (ref-free only) |
Task . | Datasets . | |
---|---|---|
MT | segment-level | WMT15-17, WMT20-21 |
system-level | WMT20-21 | |
adversary | ref-based: PAWSori/back, WMT20de, XPAWSde; ref-free: XPAWSde/fr/zh/ja, WMT20de | |
Summarization | summary-level | RealSum (Bhandari et al., 2020) |
system-level | RealSum, SummEval | |
adversary | SEadv, Rank19 (Falke et al., 2019) (ref-free only) |
On MT standard benchmarks, we evaluate the metrics on both segment-level (where we correlate metrics scores to human judgments for individual sentences/segments in the datasets) and system-level (where we correlate the average metric scores to the average human scores over the segments generated by each system), using Pearson correlation as the performance indicator. On SummEval for summarization, we compute Kendall correlation with system-level human judgements on four criteria: coherence, consistency, fluency and relevance (we apply two aggregation methods for the multi-reference setting, max and mean). We calculate Pearson correlation with both summary-level (analogous to segment-level in MT) and system-level LitePyramids (Shapira et al., 2019) human ratings in RealSumm.
4.3 NLI as a Metric
NLI systems yield probability distributions over Entailment, Contradiction, and Neutral. We denote the probability values as e, c, and n, where e + c + n = 1 and e, c, n ≥ 0. We first determine how to leverage the three values as NLI metrics.
To do so, we evaluate five simple formulas of their arithmetic combination in a heuristic way: (1) e, (2) -c, (3) e-n, (4) e-c, and (5) e-n-2c, and inspect their effect in three directions, which correspond to the entailment directions implication, reverse implication and bi-implication: (i) ref/src→cand, where ref or src act as premise and cand as hypothesis; (ii) ref/src←cand, where cand acts as premise and ref or src act as hypothesis; and (iii) ref/srccand, as arithmetic average over the two above cases.
For example, to obtain e-n from ref/srccand, we first average the three probability scores over direction ref/src→cand and ref/src←cand, then calculate e-n based on the averaged scores. We only consider direction src→cand for ref-free summarization, since hypothesis does not need to entail source document. The various selections of the formulas and directions result in 15 pooling strategies for NLI-based metrics.
NLI Systems
We explore both monolingual and cross-lingual NLI-based metrics. For each setup, we choose two NLI models, which are obtained from Hugging Face or fine-tuning by ourselves.
For monolingual NLI metrics, we choose (1) a RoBERTa-large model (Liu et al., 2019) fine-tuned on SNLI (Bowman et al., 2015), MNLI, Fever (Nie et al., 2019) and ANLI (Nie et al., 2020) by Nie et al. (2020) and (2) a DeBERTa-large model fine-tuned by He et al. (2021), using MNLI. We denote the NLI metrics induced from these two models as NLI-R and NLI-D. They will be used for ref-based MT evaluation, and both ref-based and -free summarization evaluation tasks. Note that, while NLI-R has been fine-tuned on adversarial NLI (ANLI), which has been shown to increase robustness on (for example) negation and numerical reasoning, NLI-D has not been trained on ANLI. Cross-lingual NLI metrics should handle premises and hypotheses in different languages, so we select the multilingual versions of the underlying models of NLI-R/NLI-D. (1) We fine-tune a XLM-RoBERTa-base model (Conneau et al., 2019), using the datasets for fine-tuning NLI-R as well as XNLI dataset (Conneau et al., 2018). (2) We select an mDeBERTa-base model fine-tuned on MNLI and XNLI. We denote the corresponding cross-lingual NLI metrics as XNLI-R and XNLI-D.
5 Experiment Results
Pooling Strategy
We determine the pooling strategy for NLI metrics in MT evaluation from (1) the accuracy on the adversarial datasets and (2) the correlation with human judgements on the standard (segment-level) MT datasets. We leverage the winning frequency of the pooling strategies to choose the best one; a strategy wins if it works best for an NLI metric among all 15 strategies. Overall, we find that the simple formula e from the direction src/ref cand is a good choice which works well for both standard and adversarial benchmarks, even though slightly better formulas could be chosen in selected subsettings (e.g., ref-based vs. ref-free evaluation), see Table 7 for examples.
(a) Reference-based | |||||
e | -c | e-n | e-c | e-n-2c | |
ref→cand | 3+0 | 3+0 | 2+0 | ||
ref←cand | |||||
ref cand | 0+4 | 0+3 | 0+1 | 0+2 | |
(b) Reference-free | |||||
e | -c | e-n | e-c | e-n-2c | |
src→cand | 2+0 | ||||
src←cand | 0+1 | 0+2 | |||
srccand | 0+1 | 4+6 | 4+0 |
(a) Reference-based | |||||
e | -c | e-n | e-c | e-n-2c | |
ref→cand | 3+0 | 3+0 | 2+0 | ||
ref←cand | |||||
ref cand | 0+4 | 0+3 | 0+1 | 0+2 | |
(b) Reference-free | |||||
e | -c | e-n | e-c | e-n-2c | |
src→cand | 2+0 | ||||
src←cand | 0+1 | 0+2 | |||
srccand | 0+1 | 4+6 | 4+0 |
For summarization, the situation is slightly more complex: (1) e-c from direction ref←cand performs best for ref-based NLI metrics; (2) -c from direction src→cand is the best strategy for ref-free NLI metrics. Thus, we compare NLI metrics adopting these strategies with classic metrics.
Even though we only looked at global aggregate statistics, we still observe that our method of identifying the pooling strategies above leveraged the data on which we will later evaluate the NLI metrics. To avoid leaking information from the test set, we evaluate NLI metrics on each dataset with the pooling strategy selected from the remaining datasets for that task in §6.
5.1 Machine Translation
5.1.1 Adversarial Evaluation
We now compare our NLI metrics with the best pooling strategy to our baseline metrics under adversarial conditions.
From Table 8 (columns “Adv.”), we observe that in the ref-based setup: (1) NLI metrics outperform the great majority of metrics by a huge margin: over 85% vs. 32%–78% (all phenomena) and 92% vs. 27%–80% (adequacy phenomena only) on average. (2) Further, the two NLI metrics perform similarly. In the ref-free setup, the best cross-lingual NLI metric (XNLI-D) is still most robust under our attacks. However, NLI metrics do not as substantially outperform the other metrics as in the ref-based setup. A potential reason is that the cross-lingual NLI models underperform compared to the monolingual setup (the preferences we query for in the reference-free setup may also play a role). Nevertheless, when excluding the fluency-related phenomena from the adversarial datasets, XNLI-D is still on average 10 points better than the best standard metric, COMET (86% vs. 75%).
. | Adv. . | MT . | ||||||
---|---|---|---|---|---|---|---|---|
ref-based . | ref-free . | ref-based . | ref-free . | |||||
all . | adeq. . | all . | adeq. . | seg . | sys . | seg . | sys . | |
Supervised | ||||||||
COMET | 67.4 | 67.0 | 76.8 | 74.5 | 0.676 | 0.808 | 0.620 | 0.698 |
BLEURT | 74.8 | 79.8 | – | – | 0.708 | 0.807 | – | – |
Unsupervised | ||||||||
sentBLEU | 32.9 | 27.2 | – | – | 0.380 | 0.757 | – | – |
Rouge | 34.3 | 28.7 | – | – | 0.425 | 0.774 | – | – |
MoverScore | 48.3 | 46.9 | – | – | 0.567 | 0.806 | – | – |
XMoverS(UMD) | 74.5 | 71.7 | – | – | 0.400 | 0.672 | ||
XMoverS(CLP) | – | – | 73.8 | 70.9 | – | – | 0.422 | 0.673 |
BERTS | 65.3 | 60.9 | – | – | 0.620 | 0.799 | – | – |
BARTS-P | 67.4 | 64.2 | – | – | 0.587 | 0.761 | ||
BARTS-F | 78.4 | 77.8 | – | – | 0.593 | 0.802 | – | – |
SentS(BERTS) | 68.1 | 67.8 | 62.7 | 65.5 | 0.612 | 0.401 | 0.421 | −0.021 |
SentS(WMD) | 62.1 | 61.9 | 63.0 | 65.8 | 0.607 | – | 0.427 | – |
NLI-based | ||||||||
X(NLI)-R | 85.0 | 92.1 | 70.5 | 75.8 | 0.451 | 0.756 | 0.221 | 0.335 |
X(NLI)-D | 86.6 | 92.3 | 79.3 | 85.8 | 0.439 | 0.770 | 0.149 | 0.581 |
. | Adv. . | MT . | ||||||
---|---|---|---|---|---|---|---|---|
ref-based . | ref-free . | ref-based . | ref-free . | |||||
all . | adeq. . | all . | adeq. . | seg . | sys . | seg . | sys . | |
Supervised | ||||||||
COMET | 67.4 | 67.0 | 76.8 | 74.5 | 0.676 | 0.808 | 0.620 | 0.698 |
BLEURT | 74.8 | 79.8 | – | – | 0.708 | 0.807 | – | – |
Unsupervised | ||||||||
sentBLEU | 32.9 | 27.2 | – | – | 0.380 | 0.757 | – | – |
Rouge | 34.3 | 28.7 | – | – | 0.425 | 0.774 | – | – |
MoverScore | 48.3 | 46.9 | – | – | 0.567 | 0.806 | – | – |
XMoverS(UMD) | 74.5 | 71.7 | – | – | 0.400 | 0.672 | ||
XMoverS(CLP) | – | – | 73.8 | 70.9 | – | – | 0.422 | 0.673 |
BERTS | 65.3 | 60.9 | – | – | 0.620 | 0.799 | – | – |
BARTS-P | 67.4 | 64.2 | – | – | 0.587 | 0.761 | ||
BARTS-F | 78.4 | 77.8 | – | – | 0.593 | 0.802 | – | – |
SentS(BERTS) | 68.1 | 67.8 | 62.7 | 65.5 | 0.612 | 0.401 | 0.421 | −0.021 |
SentS(WMD) | 62.1 | 61.9 | 63.0 | 65.8 | 0.607 | – | 0.427 | – |
NLI-based | ||||||||
X(NLI)-R | 85.0 | 92.1 | 70.5 | 75.8 | 0.451 | 0.756 | 0.221 | 0.335 |
X(NLI)-D | 86.6 | 92.3 | 79.3 | 85.8 | 0.439 | 0.770 | 0.149 | 0.581 |
Moreover, our results reveal that: (1) most standard metrics are particularly incapable of detecting name error, number error, and pronoun error (∼29%–70%); (2) standard metrics, especially BLEURT and COMET, are most competitive regarding omission, addition, and jumbling (∼80%–100%); (3) NLI metrics are suboptimal for fluency attacks (mostly at random level), especially the reference-free NLI metrics; and (4) NLI metrics are much better at name error, negation, number error, pronoun error, and adj. mismatch than most of the other metrics, especially ref-based (>90% vs. ∼10%–80%), as shown in Figure 1.
Our observations are inconsistent with Karpinska et al. (2022), where the state-of-the-art MT metrics mostly obtain >95% accuracy in the preference-based evaluation. The reason is that our test suites are much more difficult for the evaluation metrics because we challenge them by lexical overlap between source/reference and candidate sentences during attacks: Metrics must choose between high lexical overlap adversarial candidates (with key errors) over low lexical overlap paraphrases. In contrast, in Karpinska et al. (2022), metrics are challenged to assign correct preferences for score(ref, t) vs. score(ref, t′) where t is a candidate and t′ the perturbed candidate. This is a much easier comparison because neither are ref and t maximally dissimilar (but meaning equivalent) nor are ref and t′ maximally similar. This is an important lesson: How to design the adversarial preferences may critically affect the assessment of whether recent metrics are robust or not.
5.1.2 Standard Benchmarks
Ref-based
We give average results over all datasets in Table 8 (columns ‘MT’; individual results are available in our Github). For segment-level evaluation, we observe: (1) trained metrics (COMET and BLEURT) substantially outperform the others, with average performance of ∼0.7 Pearson. (2) Unsupervised SOTA metrics have average correlation of ∼0.6 Pearson, BERTScore is the best among them. (3) Our NLI-based metrics are not competitive, with correlations of ∼0.45 Pearson. When correlating with system-level human judgments, NLI metrics still underperform most of the SOTA metrics, but the margin is much smaller.
Ref-free
Trained metrics also dominate in segment-level evaluation (>0.6 Pearson), whereas the two NLI-based metrics perform much worse than the others (0.15-0.22 Pearson). Nevertheless, XNLI-D performs on par with COMET and better than the others on WMT20 at system-level.
Overall, we conclude that our NLI metrics are not competitive with state-of-the-art evaluation metrics on standard MT datasets, especially at segment-level and ref-free.
5.1.3 Combined Metrics
We illustrate the performance of the combined evaluation metrics with (X)NLI-R on both adversarial and standard benchmarks (segment-level) in Figure 2; the results for (X)NLI-D and for system-level are similar. The x-axis denotes the average accuracy over the adversarial datasets, while y-axis is the average Pearson correlation over the standard benchmarks (MT datasets). Each dot in each graph shows the value C(wnli) for a specific weight wnli. As seen from Figure 2, the graphs show an intriguing concave curvature. In standard MT evaluation, the combination boosts the metric performance when wnli is small (from 0.1 to 0.4) in virtually all cases. We then see a simultaneous increase of adversarial robustness and quality on standard benchmarks. In ref-based setup, e.g., for wnli = 0.2, we observe: (1) MoverScore and BARTScore-P improve most, with ∼8% (from 0.57/0.59 to 0.61/0.64 Pearson, respectively) and 21%–36% improvements on adversarial datasets (from 48%/67% to 66%/82% accuracy on average). (2) The best unsupervised metric on segment-level MT, BERTScore, increases ∼4% Pearson on standard benchmarks and ∼24% accuracy on adversarial datasets. (3) The most robust untrained metric, BARTScore-F, improves about ∼11% in robustness, whereas its performance on standard benchmarks also rises ∼5%. (4) The improvements on MT for trained metrics are smaller compared to those untrained metrics, with COMET improving only 1.5% and BLEURT even becoming worse with the choice wnli = 0.2. However, their performance in defending adversarial attacks still improves ∼10%–20%. In ref-free setups, all metrics improve ∼6%–7% on adversarial datasets. Such setting only substantially boosts XMoverScore’s performance on standard benchmarks, with ∼6%–9%.
We summarize the improvements for all combinations in Figure 3(a), which are averages over all experiments considered here. We can observe that the line denoting improvements on standard benchmarks peaks at wnli = 0.2, and the average improvements are positive when wnli ≤ 0.5. Further, on the adversarial datasets, the improvement monotonously increases with wnli and the gain is a concave function of wnli which saturates as wnli becomes larger. The sweet spots are wnli ∈ [0.2,0.3], which leads to 5%–6% improvement on standard benchmarks and 14%–16% improvement in adversarial robustness on average. When excluding the fluency phenomena from the adversarial datasets, the combined metrics consistently gain larger improvements in adversarial robustness, with 20%–24% improvements at the sweet spots.
5.2 Summarization
Evaluation
As Table 9 shows, similar to MT evaluation, NLI-based metrics exhibit much stronger robustness under adversarial conditions (our best NLI metrics have at least ∼8 points higher accuracy than the best standard metrics; right-most columns). The difference is that the vanilla NLI metrics are now also comparably effective to the SOTA metrics on standard benchmarks. For instance, in ref-based setup, NLI-D with max aggregation beats all metrics except for DiscoScore with mean on SummEval and both NLI metrics highly correlate with system-level human ratings in RealSumm (above 0.8 Pearson), where most standard metrics obtain only 0.5–0.7 Pearson correlations. When considering all evaluation dimensions of SummEval and RealSumm, NLI-D outperforms all other metrics, followed by NLI-R. Besides, we observe that NLI metrics correlate much better with human judgments regarding consistency and (somewhat surprisingly) fluency in SummEval compared to the other metrics. For the ref-free setup, BARTScore-FN performs best on SummEval—it outperforms the other metrics by above 0.1 Kendall on average. However, it does not correlate well with both summary-level and system-level human judgments in RealSumm. NLI metrics are comparable or better than standard metrics on system-level. For example, NLI-R performs best among the examined metrics and is about 0.06 Pearson better than the best standard metric (SUPERT) on system-level in RealSumm. Nevertheless, reference-free NLI metrics also perform worse than the reference-based ones as in MT; an explicit bottleneck for the two NLI metrics is that they were only trained on NLI data with short sentences, but reference-free summarization evaluation requires metrics to deal with source documents which contain many more sentences.
(a) Reference-based | ||||||||||||||
metric | SummEval | RealSumm | Adv. | |||||||||||
coherence | consistency | fluency | relevance | avg | litePyr | SEadv | ||||||||
mean | max | mean | max | mean | max | mean | max | mean | max | sum | sys | all | adeq. | |
BLEU | 0.294 | 0.279 | 0.044 | −0.029 | 0.244 | 0.229 | 0.397 | 0.382 | 0.245 | 0.215 | 0.480 | 0.124 | 0.182 | 0.109 |
Rouge | 0.191 | 0.176 | 0.088 | −0.279 | −0.037 | −0.081 | 0.118 | 0.103 | 0.090 | −0.020 | 0.540 | 0.457 | 0.185 | 0.117 |
MoverS | 0.206 | 0.324 | 0.456 | 0.103 | 0.421 | 0.362 | 0.368 | 0.515 | 0.363 | 0.326 | 0.585 | 0.501 | 0.287 | 0.251 |
BERTS | 0.618 | 0.618 | 0.221 | 0.044 | 0.273 | 0.185 | 0.603 | 0.515 | 0.429 | 0.340 | 0.574 | 0.380 | 0.598 | 0.574 |
BARTS-P | 0.485 | 0.441 | 0.176 | −0.044 | 0.376 | 0.185 | 0.500 | 0.368 | 0.385 | 0.237 | 0.478 | 0.531 | 0.697 | 0.692 |
BARTS-F | 0.515 | 0.647 | 0.206 | 0.250 | 0.317 | 0.450 | 0.529 | 0.632 | 0.392 | 0.495 | 0.583 | 0.687 | 0.788 | 0.792 |
DiscoS | 0.676 | 0.279 | 0.279 | 0.676 | 0.539 | 0.554 | 0.632 | 0.353 | 0.532 | 0.466 | −0.199 | −0.066 | 0.334 | 0.294 |
NLI-based | ||||||||||||||
NLI-R | 0.147 | 0.074 | 0.632 | 0.676 | 0.494 | 0.450 | 0.279 | 0.206 | 0.388 | 0.352 | 0.525 | 0.856 | 0.864 | 0.905 |
NLI-D | 0.250 | 0.265 | 0.706 | 0.750 | 0.568 | 0.613 | 0.471 | 0.397 | 0.499 | 0.506 | 0.489 | 0.840 | 0.806 | 0.843 |
(b) Reference-free | ||||||||||||||
metric | SummEval | RealSumm | Adv. | |||||||||||
coherence | consistency | fluency | relevance | avg | litePyr | SEadv | Rank19 | |||||||
summary | system | all | adeq. | avg | ||||||||||
BARTS-FN | 0.735 | 0.132 | 0.391 | 0.662 | 0.480 | 0.178 | −0.023 | 0.427 | 0.389 | 0.796 | 0.612 | |||
SUPERT | 0.147 | 0.603 | 0.465 | 0.279 | 0.374 | 0.522 | 0.626 | 0.296 | 0.273 | 0.668 | 0.482 | |||
NLI-based | ||||||||||||||
NLI-R | 0.221 | 0.235 | 0.391 | 0.500 | 0.337 | 0.300 | 0.688 | 0.720 | 0.722 | 0.866 | 0.793 | |||
NLI-D | 0.162 | 0.647 | 0.332 | 0.324 | 0.366 | −0.076 | 0.568 | 0.624 | 0.629 | 0.885 | 0.755 |
(a) Reference-based | ||||||||||||||
metric | SummEval | RealSumm | Adv. | |||||||||||
coherence | consistency | fluency | relevance | avg | litePyr | SEadv | ||||||||
mean | max | mean | max | mean | max | mean | max | mean | max | sum | sys | all | adeq. | |
BLEU | 0.294 | 0.279 | 0.044 | −0.029 | 0.244 | 0.229 | 0.397 | 0.382 | 0.245 | 0.215 | 0.480 | 0.124 | 0.182 | 0.109 |
Rouge | 0.191 | 0.176 | 0.088 | −0.279 | −0.037 | −0.081 | 0.118 | 0.103 | 0.090 | −0.020 | 0.540 | 0.457 | 0.185 | 0.117 |
MoverS | 0.206 | 0.324 | 0.456 | 0.103 | 0.421 | 0.362 | 0.368 | 0.515 | 0.363 | 0.326 | 0.585 | 0.501 | 0.287 | 0.251 |
BERTS | 0.618 | 0.618 | 0.221 | 0.044 | 0.273 | 0.185 | 0.603 | 0.515 | 0.429 | 0.340 | 0.574 | 0.380 | 0.598 | 0.574 |
BARTS-P | 0.485 | 0.441 | 0.176 | −0.044 | 0.376 | 0.185 | 0.500 | 0.368 | 0.385 | 0.237 | 0.478 | 0.531 | 0.697 | 0.692 |
BARTS-F | 0.515 | 0.647 | 0.206 | 0.250 | 0.317 | 0.450 | 0.529 | 0.632 | 0.392 | 0.495 | 0.583 | 0.687 | 0.788 | 0.792 |
DiscoS | 0.676 | 0.279 | 0.279 | 0.676 | 0.539 | 0.554 | 0.632 | 0.353 | 0.532 | 0.466 | −0.199 | −0.066 | 0.334 | 0.294 |
NLI-based | ||||||||||||||
NLI-R | 0.147 | 0.074 | 0.632 | 0.676 | 0.494 | 0.450 | 0.279 | 0.206 | 0.388 | 0.352 | 0.525 | 0.856 | 0.864 | 0.905 |
NLI-D | 0.250 | 0.265 | 0.706 | 0.750 | 0.568 | 0.613 | 0.471 | 0.397 | 0.499 | 0.506 | 0.489 | 0.840 | 0.806 | 0.843 |
(b) Reference-free | ||||||||||||||
metric | SummEval | RealSumm | Adv. | |||||||||||
coherence | consistency | fluency | relevance | avg | litePyr | SEadv | Rank19 | |||||||
summary | system | all | adeq. | avg | ||||||||||
BARTS-FN | 0.735 | 0.132 | 0.391 | 0.662 | 0.480 | 0.178 | −0.023 | 0.427 | 0.389 | 0.796 | 0.612 | |||
SUPERT | 0.147 | 0.603 | 0.465 | 0.279 | 0.374 | 0.522 | 0.626 | 0.296 | 0.273 | 0.668 | 0.482 | |||
NLI-based | ||||||||||||||
NLI-R | 0.221 | 0.235 | 0.391 | 0.500 | 0.337 | 0.300 | 0.688 | 0.720 | 0.722 | 0.866 | 0.793 | |||
NLI-D | 0.162 | 0.647 | 0.332 | 0.324 | 0.366 | −0.076 | 0.568 | 0.624 | 0.629 | 0.885 | 0.755 |
Combined Metrics
In Figure 3(b), we summarize the median improvements of combined summarization metrics (the median smooths some outliers). In contrast to MT, the combination brings almost equal benefits to performance of standard metrics on standard and adversarial benchmarks concerning only adequacy—we again observe a decrease in improvements on adversarial datasets when adding our fluency phenomena. We identify a best wnli, namely, 0.8, with which the standard metrics gain about 25%–30% improvements in both types of performances (adversarial and standard).
6 Discussion & Analysis
Selected Failure Cases of Metrics:
Table 10 shows selected failure cases of four popular metrics (BERTScore, BARTScore, BLEURT, COMET), where the NLI metrics are correct in each case. In the examples, BERTScore prefers text with the wrong gendered pronoun over a legitimate paraphrase and even trained metrics like BLEURT fail on severe name changes such as “Melissa” (a person name) vs. “Mali” (a country name). Leveraging more subtle cases (e.g., mismatches based on wrong word senses instead of random mismatches with the same POS or replacing names with names of the same ‘type’) would likely constitute even harder test cases for future metrics.
No Metric is Good Everywhere:
Across distinct dimensions, different metrics perform differently, indicating that they capture varying aspects. For example, NLI metrics are not so good on fluency adversarial attacks, e.g., typos. This may be unsurprising, given that fluency is a low-level phenomenon while NLI concerns high-level logical relationships between sentences (some fluency phenomena would best be treated by switching to a lower-level representation space, such as character-level [Vu et al., 2022]; this could seamlessly be integrated in existing NLI models). The NLI metrics are also weaker concerning segment-level MT evaluation on standard benchmarks. However, NLI metrics alone perform surprisingly well: In ref-based MT, they win on 7 out of 19 dimensions (12 adversarial phenomena and 7 standard datasets, evaluated segment- and system-level), only beaten by BLEURT (8 wins); ref-free, they win 5 out of 19 dimensions, second only to COMET (11 wins). In ref-based summarization, they are clearly ahead of all standard metrics, winning not only 8 out of 12 adversarial dimensions, but also system-level LitePyramid, consistency and fluency (thus, 11 out of 18 wins), clearly ahead of BARTScore-P (4 of 18); ref-free, they are also best and win 13 out of 18 dimensions. The best overall metrics, measured as average performance over standard and adversarial datasets, always include NLI: for ref-based MT, this is BLEURT+0.2 ×NLI-R, for ref-free MT, it is COMET+0.3 ×NLI-D. For summarization, NLI-R alone and combined with BARTScore-F perform best on average.
Rescaling:
The min-max normalization we used (a standard technique for normalizing data in machine learning, typically applied to input features) for metric combination requires batch processing. It is necessary to account for the different ranges of metrics, e.g., some metrics take negative values. An alternative would include to enforce more formal constraints on evaluation metrics, i.e., that they should take outputs in [0,1]. When applying our combined metrics in practice, one could also replace them by surrogate metrics trained on the outputs of the original combined metrics or simply take the min-max values inferred from the datasets already evaluated on—the larger these datasets the more reliably are min and max estimated.
Sensitvity to wnli:
Having different weights wnli for different tasks is undesirable, because it requires considering each task individually. However, in our experiments, we found that all small wnli (below 0.5) yield good performances and are thus safe choices: They increase adversarial robustness and also lead to better metrics on standard benchmarks.
Adversarial Performance vs. Standard Performance:
From our experiments, it might seem that adversarial and standard performance are anti-correlated: A metric with higher adversarial performance may have lower performance on standard benchmarks and vice versa. While this would not necessarily be a major surprise as adversarial conditions oftentimes test phenomena that are otherwise not represented in standard benchmarks (Niven and Kao, 2019), a statistical analysis reveals that standard performance generally positively correlates to the adversarial performance in our case, consistent with our earlier argument that existing NLG systems in the real world do commit similar errors as we check for. To do so, we first convert the metrics’ standard performance to rankings for each performance category (e.g., ref-based/-free segment/system-level MT performance, performance on SummEval/RealSumm), then we correlate the ranking-based standard performance to the corresponding adversarial performance rankings, obtaining 0.37 Spearman. When excluding NLI metrics, the correlation increases to 0.60.
The Choice of candpara Matters:
As indicated in §3, we speculate that a good adversarial setting maximizes (surface) dissimilarity between ref and candpara (which can better trick the metrics). To investigate, we compute the normalized edit distance between ref and candpara;7 a larger edit distance means a greater dissimilarity. If our assumption is true, then larger edit distances represent harder test cases for the metrics. We find: (1) the average edit distance for the test cases where the metrics fail to defend against the adversarial attacks is 0.01–0.6 larger than that for where they succeed, averaged over metrics; (2) for PAWSback and PAWSori (both induced from PAWS) where the candpara are obtained in different ways, all metrics achieve 0.02-0.15 lower accuracy on PAWSori, which has 0.46 larger average edit distance than PAWSback, in turn. Both findings confirm our above assumption. In addition, we observe that NLI metrics have the smallest difference between the edit distances for failure and success cases (0.01–0.26) as well as that between the accuracy on PAWSback and PAWSori (0.02) among all evaluation metrics. This implies that they are least affected by surface overlap and instead better consider the logical relationship between sentences. This is what makes them attractive as evaluation metrics.
The Choice of candadv Matters, Too:
We evaluate on one complex attack combining Number error with Negation which increases the difference between ref and candadvbased on the test cases for Number error in WMT20de. The accuracy increases by an average of 0.28 over all metrics. This confirms our assumption that maximizing the (surface) similarity between ref and candadv (but with key errors) leads to harder test suites and vice versa.
Ensemble with NLI Metrics Are More Effective:
We compare the ensembles with NLI metrics to ensembles with standard metrics, i.e., w · A + (1 −w) · M, where A is a fixed standard metric and M is any of the remaining metrics. To do so, we combine standard metrics with the rest metrics for each category of MT/summarization and ref-based/-free setting. We take the arithmetic average of the accuracy on adversarial benchmarks and correlations on standard benchmarks as the overall metric performance here. We calculate the mean/maximal improvement of ensembles to the original metric M over w ∈ [0.1,0.9] and observe: (i) While the ensembles with standard metrics are better for ref-free MT metrics because cross-lingual NLI metrics perform very poorly in our experiments, (ii) the monolingual NLI metrics lead to much better ensembles—17/15 points larger mean/max improvement—compared to the standard metrics. (iii) Overall, the ensembles with NLI metrics yield 10/7 points larger mean/max improvement in overall performance than with standard metrics (averaged over all 4 tasks: ref-based/-free MT/summarization). Thus, (monolingual) NLI metrics have unique properties, compared to standard metrics, making them attractive in ensembles.
To illustrate, Figure 4 shows ensembles with BERTScore. These show minor or no improvements on standard benchmarks and also mixed (often negative) results for adversarial robustness.
SummaCZS and Falsesum:
In §5, we applied NLI systems on whole input texts, not taking into account the multi-sentence nature of source texts and outputs, especially in summarization.
To remedy the mismatch between the granularities of the training data of NLI models and the input data of summarization evaluation, i.e., sentence- vs. document-level, Laban et al. (2022) propose both supervised and unsupervised NLI-based summarization metrics for inconsistency detection. We test their unsupervised variant (SummaCZS),8 which segments documents into sentence units and aggregates scores between pairs of sentences, with the underlying model of NLI-R. However, SummaCZS does not consistently outperform NLI-R across all datasets; in contrast, NLI-R performs much better in our adversarial test compared to SummaCZS (72% vs. 53%). Besides, to match the training data of NLI models with the task of factual inconsistency detection in summarization, Utama et al. (2022) introduce an augmented NLI dataset with task-oriented examples based on CNNDM—FalseSum; we evaluate three Roberta-large models finetuned on it and MNLI. Similar to SummaCZS, this also does not always yield better performance compared to simple NLI metrics (∼55%–68% vs. 72% on adversarial datasets). Overall, both approaches work well on SummEval, but not so well on RealSumm and our adversarial benchmark.
Choice of Pooling Strategy:
To examine the issue of data leakage discussed in §5, we now evaluate the NLI metrics on each dataset with the pooling strategy selected from the remaining datasets (excluding the one for evaluation) based on winning frequency. For example, for the segment-level MT evaluation on WMT15, we choose the pooling strategy which wins most times on all MT datasets (including all standard datasets for both segment/system-level evaluation and the adversarial datasets) except for WMT15. We observe that this change in pooling strategy induction results in minor performance variation: −1.9% for segment-level evaluation, +0.8% for system-level evaluation, and −0.7% for adversarial evaluation. For summarization, as only one direction—i.e., src→cand—is considered for ref-free NLI metrics, we separately select the pooling strategy for ref-based and ref-free NLI metrics. Overall, we have no performance change for the ref-free setting and −3.6% performance on average over all five criteria (correlations on SummEval with max/mean aggregation, summary/system-level correlations on RealSumm, and accuracy on SEadv) ref-based. Thus, the changes are again minor.
Comparison to RoMe:
As the authors of RoMe did not publish their adversarial dataset, we compare RoMe’s performance with our metrics on one of our adversarial datasets, WMT20de, instead. RoMe has an average accuracy of 43%, with >90% accuracy only on the phenomena SVD and omission, which are the easiest for most standard metrics. In contrast, our NLI metrics have above 80% average accuracy. As RoMe does not evaluate on MT or summarization, we also evaluate our NLI metrics on one (randomly chosen) data-to-text generation dataset used in Rony et al. (2022)—BAGEL (Mairesse et al., 2010). RoMe and our NLI metrics perform on par here (∼0.23 Spearman’s ρ). Overall, this seems to imply that simple NLI models taken out of the box are better and more robust metrics than a specially trained approach such as RoMe.
7 Concluding Remarks
In this work, we explored NLI as a general paradigm for evaluation metrics. We showed that NLI metrics yield adversarial robustness, and are also strong—though not always state-of-the-art—when it comes to standard metric evaluation benchmarks. By linearly interpolating established (BERT-based) metrics with our NLI metrics, we obtained high-quality metrics along both axes: adversarial robustness and standard benchmarks, with substantial gains over recent BERT-based metrics.
A potential reason why NLI based metrics perform subpar on some standard benchmarks (especially in MT) is the training data mismatch, i.e., typical NLI datasets contain many artificial sentences of the type “A girl is playing on a piano”. A further limitation is that cross-lingual NLI models are not yet high-quality enough and that most current NLI models are sentence-level, not document-level—with a few recent exceptions (Yin et al., 2021). Once these limitations of NLI are overcome, we believe that even better performances from NLI based metrics can be expected, which, we believe, is one of the most promising directions for future high-quality and robust evaluation metric design. Future work should also consider NLI metrics for other text generation tasks; the NLI paradigm looks especially promising for tasks that require comparison with human references, which oftentimes involve the concept of logical equivalence.
Acknowledgments
We thank Zuojun Shi for conducting initial experiments related to this paper as part of her Bachelor thesis at TU Darmstadt. We appreciate the reviewers and editors from TACL for their time, effort, and greatly helpful comments. We also thankfully acknowledge support from the BMBF via the grant “Metrics4NLG”. Steffen Eger is financed by DFG grant EG 375/5–1.
Notes
Published in 2020, BERTScore has more than 1700 citations as of March 2023.
That semantic similarity metrics are inherently incapable of identifying this puts them at great risk of being attacked by malicious agents, with serious real-world consequences, as the metrics cannot distinguish between truthful translations and semantically similar but factually incorrect translations.
Code +data: http://github.com/cyr19/MENLI.
Robustness is also related to model biases. For example, Sun et al. (2022) show that BERTScore encodes social biases such as gender biases. And Deutsch et al. (2022) claim that reference-free metrics are inherently biased, which implies that they have unreasonable preferences. Our results show that many current reference-based metrics also have unreasonable preferences. Robustness checks are also related to explainability (Leiter et al., 2022; Golovneva et al., 2023) of evaluation metrics as they help to understand metric limitations.
As we generate our adversarial test instances fully automatically from backtranslation or automatic tools, they may contain some errors (including upper-/lower-case). For example, we note that in candpara, “…billion dollars” is sometimes incorrectly formulated as “…dollars billion”; however, such cases occur only in ∼1% of all test cases for number error, which we argue is still on an acceptable noise level.
Ref-free, the edit distance between r and ref is considered.
We do not compare to the supervised one as it is trained on a consistency dataset for summarization task, for a fairer comparison.
References
Author notes
Action Editor: Benjamin Van Durme