MENLI: Robust Evaluation Metrics from Natural Language Inference

Abstract Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%–30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).


Introduction
Proper evaluation is key to fields such as machine learning and Natural Language Processing (NLP).Evaluation is particularly challenging for natural language generation (NLG) tasks, as there many be an infinitude of correct solutions (e.g., translations or summaries) for a given source text.While human evaluation is often considered the gold standard, it is slow and costly, thus researchers resort to automatic evaluation.Previously, this was done using simple lexical overlap metrics such as BLEU and ROUGE, but these exhibit low correlations with human judgments, particularly for state-of-the-art NLG systems (Mathur et al., 2020a;Peyrard, 2019).Thus, a popular recent trend is to design automatic evaluation metrics based on large language models such as BERT and its many extensions (Zhang et al., 2020;Zhao et al., 2019;Sellam Concept Examples Semantic Similarity BERTScore, MoverScore, BaryScore, ...

Text Generation
BARTScore, PRISM (Thompson and Post, 2020) Question Answering QAEval (Deutsch et al., 2021) NLI MENLI  et al., 2020;Wan et al., 2022).Nonetheless, these novel metrics also have key limitations.For example, Sai et al. (2021) and Kaster et al. (2021) show that they are not robust to various adversarial attacks including lexical overlap and factuality errors.Taking the currently most popular metric-BERTScore 1 -as an example, this adversarial vulnerability is unsurprising.BERTScore computes the semantic similarity between a reference and a system output (the candidate), using a simplified token matching procedure.However, a good candidate is typically not appropriately identified by semantic similarity.For example, a candidate "5 Ukrainian soldiers wounded in Russia" is not an adequate translation of a source corresponding to the reference "50000 Russian soldiers killed in Ukraine", although the two texts are of course semantically very similar. 2 While there have been many attempts to improve BERTScore using better token matching, e.g., using Word Mover Distance (Zhao et al., 2019;Chen et al., 2020;Colombo et al., 2021), we argue that this line of research is a dead-end, as the underlying model of semantic similarity, originally proposed to address issues of lexical variation in 1 Published in 2020, BERTScore has more than 1700 citations as of March 2023.
2 That semantic similarity metrics are inherently incapable of identifying this puts them at great risk of being attacked by malicious agents, with serious real-world consequences, as the metrics cannot distinguish between truthful translations and semantically similar but factually incorrect translations.BLEU/ROUGE, is simply not (fully) appropriate.
An intuitively more suitable idea to model evaluation metrics is via natural language inference (NLI) (Dagan et al., 2013).For example, in reference-based settings, in which candidates are compared to human references, a candidate is intuitively good if it is equivalent to a human reference via the concept of bi-implication.NLI systems are also promising alternatives because NLI is one of the most researched upstream tasks in NLP, where a lot of emphasis has been placed on concepts such as biases, generalization and adversarial conditions (Poliak et al., 2018;Utama et al., 2020).
In this paper, we ask whether we can directly use pre-trained NLI models as evaluation metrics, thereby establishing a new paradigm (but with predecessors, as indicated in §2).Our contributions: • We design: a novel preference-based adversarial test suite for MT and summarization metrics.Our adversarial benchmark does not need human annotators, is suitable for referencefree (where the candidate is directly compared to the source text, without human reference) and reference-based evaluation, and is challenging: e.g., BLEU, ROUGE, MoverScore and BERTScore perform below or at random level.
• We explore: (i) how NLI metrics can be induced from existing NLI models; (ii) how they perform on benchmark and adversarial datasets, across (iii) two NLG problems, MT and summarization.
We point out that some current metrics already leverage NLI systems-thus, we do not include new information with respect to them-but indirectly and thus (we argue) inadequately: e.g., MoverScore (Zhao et al., 2019) leverages BERT representations fine-tuned on NLI.Mathur et al. (2019) train (pre-BERT) NLI-inspired architectures on MT datasets.In contrast, we show that by directly leveraging NLI systems, much better adversarial and standard benchmark performances can be obtained.We call our novel metrics MENLI (MEtrics from NLI).3

Related Work
Our work connects to evaluation metrics and NLI.
Evaluation Metrics for NLG In the last few years, researchers have come up with a plethora of different BERT-based metrics for varying tasks and setups: e.g., for MT and summarization, referencebased trained (Sellam et al., 2020;Rei et al., 2020a) and untrained approaches (Zhao et al., 2019;Zhang et al., 2020) have been suggested and the same is true for reference-free setups, where both supervised (Ranasinghe et al., 2020) and unsupervised metrics have been explored (Zhao et al., 2020;Song et al., 2021;Belouadi and Eger, 2023).In our work, we consider both reference-based as well as reference-free metrics.Both setups have important differences: Reference-free setups are more challenging, as they require to compare text in different languages (in MT) or of vastly different lengths (in summarization).On the other hand, they are more 'resource-efficient', take humans out-of-theloop, and promise web-scale evaluation.Both approaches are also different in terms of NLI.For example, while reference-based approaches require equivalence between reference and hypothesis, the concept of equivalence is not always appropriate in reference-free situations (e.g., in summarization, source and summary are intuitively not equivalent; rather, source should entail summary).
To realize metrics, different high-level approaches have been suggested as we outline in Table 1 (e.g., metrics from semantic similarity, from text generation or from question answering).There are also some predecessor works on metrics from NLI which we discuss below.

Robustness of Evaluation
Metrics has been a central issue of recent interest: Sai et al. ( 2021) test metrics across several CheckList (Ribeiro et al., 2020) inspired templates, finding that most common standard metrics are not robust even to simple perturbations.Kaster et al. (2021) probe metrics in an adversarial setting with lexical overlap, finding that they can be fooled by text that has high lexical overlap but low semantic similarity (indicating that the proposed BERT-based metrics are not even good models of semantic similarity).We combine the approaches of Sai et al. (2021) and Kaster et al. (2021): While Sai et al. (2021) use human crowdworkers to evaluate robustness, Kaster et al. (2021) use a simpler preference-based setup, which does not need human annotators.We will also use the preference-based setup, but our attacks are largely inspired by Sai et al. (2021).
More recently (contemporaneously with us and after the first Arxiv submission of our work), several other papers have explored the robustness of recent evaluation metrics.For example, He et al. (2022) develop stress test suites according to potential errors arising from certain choices of metric design and pretrained language models used, showing that metrics are biased towards their underlying models-e.g., BARTScore assigns higher scores to texts generated by the models of the metric itself. 4 Karpinska et al. (2022) explore the sensitivity of MT metrics to errors of different categories (regarding semantics, syntax and morphology) and severity, using a preference-based setting; they show that recent metrics like BERTScore dramatically outperform lexical overlap-based metrics such as BLEU and ROUGE, mostly obtaining over 95% accuracy in their experiments.Our setups and that of Karpinska et al. (2022) and He et al. (2022) are differentiated by the tasks considered, the preference specifications, the results, and the solutions proposed.Karpinska et al. (2022) only evaluate metrics for MT while we consider both MT and summarization.They design their preferences in such a way that it would seem that recent metrics are quite robust while our more elaborate preferences expose their weak spots much better.Finally, we propose solutions (e.g., metrics from NLI) to addressing lack of robustness.Like us, He et al. (2022) also consider summarization and MT.Instead of designing preferences, however, they manually introspect how metric scores change as various perturbations are introduced.In this way, they expose blind spots of metrics.As remedies, they suggest to combine heterogeneous metrics to shield against varying blind spots (without performing concrete experiments)-we show that combining 4 Robustness is also related to model biases.For example, Sun et al. (2022) show that BERTScore encodes social biases such as gender biases.And Deutsch et al. (2022) claim that reference-free metrics are inherently biased, which implies that they have unreasonable preferences.Our results show that many current reference-based metrics also have unreasonable preferences.Robustness checks are also related to explainability (Leiter et al., 2022;Golovneva et al., 2023) of evaluation metrics as they help to understand metric limitations.metrics with NLI based metrics yields additional robustness.
Finally, Rony et al. (2022) develop RoMe as a robust metric in the context of semantic similarity, fluency and grammatical variability.They evaluate it on an adversarial dataset with five phenomena (entity, adjective and random word replacement; as well as text transformation and passive forms) by correlating against human judgments.Their model is a rather complicated trained metric leveraging semantic and grammatical features-we compare to it in §6.
NLI NLI is one of the core upstream tasks in the NLP community.Due to its popularity, NLI has been investigated in-depth, where researchers found that trained models often overfit to low-level statistical cues instead of learning generalizable concepts of logical relationships between sentences (Poliak et al., 2018;Gururangan et al., 2018).As a consequence, many approaches to improve generalization have been investigated, e.g., (Belinkov et al., 2019;Utama et al., 2020;Zhou and Bansal, 2020).We argue that a high-quality NLI model would be an excellent candidate for an evaluation metric and explore this in this work.
Like us, Mathur et al. (2019) note the similarity of (MT) evaluation and logical equivalence via NLI.They design supervised MT metrics leveraging different pre-BERT inspired architectures, including one from the NLI community called ESIM (Chen et al., 2017) (which performs on par to an LSTM with attention in their experiments).Thus, in contrast to us, they do not leverage NLI models out-of-the-box as evaluation metrics but only finetune an NLI-inspired architecture on human scores from MT. MoverScore (Zhao et al., 2019) finetunes BERT on NLI, which leads to better metrics.Thus, they, too, use NLI only indirectly.Dušek and Kasner (2020) use NLI to evaluate hallucinations and omissions in reference-free data-to-text generation scenarios.They do not compare to any other metrics and do not consider NLI as a general paradigm for evaluation metrics.While the summarization community uses NLI models for consistency evaluation (Fabbri et al., 2021;Laban et al., 2022), to our knowledge, we are the first to verify the usefulness of NLI systems as general evaluation metrics against a range of strong competitors, both in standard evaluation and adversarial attack settings.
Following Sai et al. (2021) and others, we consider an array of adversarial attacks on evaluation metrics-we will give a motivation of our attacks from the perspective of errors committed by real text generation systems below.In contrast to Sai et al. (2021) and similar to the later published work of Karpinska et al. (2022), we implement a preference-based setup, which does not need human annotators.The advantages of the preferencebased setup are: (i) lower cost (e.g., no annotation costs), (ii) which can be especially relevant for non-English languages (e.g., in ref-free situations for MT), and (iii) which allows adversarial evaluation at larger scale, yielding more robust estimates of performance.The challenge of the preference setup is to cleverly determine text pairs to compare.
In our design, we use an anchor text (either the reference ref or the source src), a paraphrase cand para of the anchor text, and an adversarial text cand adv which is maximally similar to the anchor text, but contains an adversarial attack.We expect a good metric m to prefer cand para over cand adv : The outcome of preferences in Eq. ( 1) depend on how we choose cand adv and cand para , which we will describe below.In general, a challenging test suite has cand adv maximally similar to ref /src, but with a key error.In contrast, cand para should be maximally dissimilar to ref /src (e.g., on surface level) but meaning-equivalent.Table 2 illustrates the general structure of our adversarial test suite.cand adv To obtain cand adv , we consider the following attacks (nine regarding information adequacy/correctness in candidates and three regarding text fluency), which we deem (to a large degree) representative for errors in different NLG tasks: • Addition: We randomly add a noun after an existing one and connect them with "and".For example, "I love dogs" → "I love dogs and cats." • Omission: We use the framework of Sai et al.
• Mismatch: We consider mismatching nouns, verbs and adjectives, which can lead to misunderstanding of an entity, an action, and the speakers' emotion, respectively.Following Chen et al. ( 2021), we replace a specific word having the POS tag of noun/verb/adjective with another word having the same POS tag randomly selected from our collected words for that POS tag.
• Negation: We use the perturbation tool of Ribeiro et al. ( 2020) to add/remove negations to/from the verb for generating cand adv with contrary claims.
• Number error: We replace all numbers (except for those related to dates) in the sentence with random numbers in the same format (e.g., integer to integer, decimal to decimal).
• Pronoun error: We replace all pronouns in the sentence with other ones without causing syntax errors (e.g., "he" to "she" and "us" to "them").
• Name error: We use the tool of Ribeiro et al. ( 2020) to replace exactly one name with a random one of the same gender.
• Fluency: We also include three phenomena from Sai et al. ( 2021) to examine metrics' robustness against attacks on text fluency: (i) Jumbling word order: Randomly shuffle the word order in a sentence.Spelling error: Add a typo to a word in a sentence.Subject-verb disagreement: Make the subject and verb disagree (e.g., "He like dogs.").
For ref-based metrics, we apply the perturbation templates to ref to construct cand adv .In contrast, for ref-free MT metrics, we first translate the source src using Google Translate to a translation r and then perturb r to obtain cand adv .We introduce r to increase the similarity of cand adv to src; e.g., we assume that Google Translate translates more literally, i.e., closer to word-by-word translations, than human translators.This may be important to construct challenging test cases, cf.§6 and our above discussion.For ref-free summarization, we apply the perturbation templates to a document r which is maximally similar to src; details follow.cand para We use different ways to obtain cand para , because different kinds of paraphrases may yield more/less difficult test cases for metrics.We will analyze this in §6.
In particular, we use data from (1) PAWS (Zhang et al., 2019), (2) PAWS-X (Yang et al., 2019), (3) WMT20-news-commentary-v15 German-to-English (Mathur et al., 2020b) to generate cand para for MT evaluation metrics, and (4) SummEval for summarization metrics.A summary with attributes is shown in Table 3.   (1) PAWS contains sentence pairs created by word swapping and backtranslation, labeled as (non-)paraphrases by human raters.From sentence pairs labeled as paraphrase, we derive two datasets for ref-based evaluation metrics: • PAWS ori : We take the first sentence of a PAWS sentence pair as ref and the second as cand para .
• PAWS back : We use the first sentence of a PAWS sentence pair as ref and generate cand para based on ref using backtranslation (we use German as the pivot language) except for number error, for which we replace the numbers in ref with the corresponding words, using the Python library num2words.
(2) PAWS-X is the multilingual version of PAWS, which includes PAWS sentence pairs in six languages, translated from English PAWS, al-lowing us to generate test suites for both reffree and ref-based metrics.We use the first sentence in PAWS-X (e.g., German) as src and the second sentence with the same ID in English PAWS as ref.
We select the data for two closer language pairs: German-to-English and French-to-English, and two more distant language pairs: Chinese-to-English and Japanese-to-English. Accordingly, we create 4 datasets: XPAWS de , XPAWS fr , XPAWS zh and XPAWS ja , each of which contains src (first sentence of X-PAWS pair in source language), ref (first sentence of English PAWS pair), and cand para (second sentence of English PAWS pair).
(3) WMT20-news-commentary-v15 contains sentence pairs of source and human reference.From this, we create WMT20 de , directly taking the source and reference sentences as src and ref.
We obtain cand para as in the case of PAWS back .

Mismatch/noun
Listen, I don't want to make my people mad," she said.

Pronoun
Williams wasn't the only one who received a fine at this year's Wimbledon, though hers was the most costly.
Table 4: Examples of errors in WMT MQM annotations for Chinese-to-English and English-to-German.Red texts are the annotated errors ("[...]" indicates the missing translation) and the green texts in the bracket refer to a more correct translation accordingly; the green texts in source sentences denote the parts being mistranslated or omitted.
Real-world Motivation of Attacks Modern text generation systems are prone to many of the errors we investigate in this work.For example, Freitag et al. (2021aFreitag et al. ( ,b, 2022) ) show, based on finegrained human error annotations (Lommel et al., 2014), that translations generated by state-of-theart MT models still contain many accuracy-related errors (e.g., addition and omission of information, inappropriately informal pronouns) and sometimes even fluency-related errors (e.g., wrong spelling).Negation handling is also frequently discussed as an issue of modern MT systems (Bentivogli et al., 2016;Sennrich, 2017;Hossain et al., 2020;Tang et al., 2021).In summarization, system summaries are often factually inconsistent with source documents in terms of numbers, named entities and assigning quotations to a particular person, etc. (Falke et al., 2019;Kryscinski et al., 2020;Chen et al., 2021).More generally, hallucination (of which addition/mismatches/etc. may be considered special cases) is a particular worrisome limitation of recent large language models (Ji et al., 2022).In Table 4, we show selected system translations from real MT systems with specific errors (following WMT MQM annotations) that are very similar to incorrectly formulated as "... dollars billion"; however, such cases occur only in ∼1% of all test cases for number error, which we argue is still on an acceptable noise level.
the ones we consider. 6The frequency of errors may differ for various source-target language pairs (e.g., depending on their language distance) and formal/informal context.For example, when translating Chinese to English for news, the names are often directly translated to their Pinyin format (see the 4th row) instead of the official translations; in contrast, this rarely happens in English-to-German translations.But even for such closely related languages, NLG systems may omit information, or choose wrong pronouns, or mismatching nouns, particularly when a word has multiple senses.
4 Experimental Setup

Evaluation Metrics
We explore a large array of recent state-of-the-art transformer based metrics, summarized in Table 5.The variants used are briefly introduced below; further details (e.g., model checkpoints and implementation) can be found on our Github.We report BERTScore F1 employing a RoBERTa-large model.For MoverScore, we use the unigram variant with a BERT-base model finetuned on MNLI (Williams et al., 2018).We use two variants of BARTScore (Precision and F1) for ref-based MT and summarization and BARTScore-  Table 6: We use the to-English language pairs in WMT15-17 datasets (Stanojević et al., 2015;Bojar et al., 2016Bojar et al., , 2017)).In segment-level evaluation on WMT20-21 (Mathur et al., 2020b;Freitag et al., 2021a,b), we use the data with MQM scores for zh-en, while in system-level evaluation, we correlate the metrics with DA scores for all to-English language pairs.The datasets for system-level evaluation before WMT20 are skipped, as all metrics mostly get very high correlations on them.
FN (FN stands for Faithfulness) for ref-free summarization.We consider two variants of XMover-Score with different remapping strategies for multilingual embeddings (CLP, UMD) and two variants of SentSim with different word matching paradigms (BERTScore, WMD).We report the Dis-coScore variant with feature 'Focus Frequency'.

Datasets & Evaluation Protocol
We summarize our used datasets in Table 6.To evaluate the metrics' robustness under adversarial conditions, we use the datasets introduced in §3 and additionally Rank19 (Falke et al., 2019) (only for ref-free summarization), which contains examples composed of document paired with one correct and one incorrect candidate summary with real-world factuality errors.In general, we check the metrics' preference between the two candidates and calculate accuracy: the relative frequency that the metrics correctly choose among the two alternatives.
On MT standard benchmarks, we evaluate the metrics on both segment-level (where we correlate metrics scores to human judgements for individual sentences/segments in the datasets) and systemlevel (where we correlate the average metric scores to the average human scores over the segments generated by each system), using Pearson correlation as the performance indicator.On SummEval for summarization, we compute Kendall correlation with system-level human judgements on four criteria: coherence, consistency, fluency and relevance (we apply two aggregation methods for the multireference setting, max and mean).We calculate Pearson correlation with both summary-level (analogous to segment-level in MT) and system-level LitePyramids (Shapira et al., 2019) human ratings in RealSumm.

NLI as a Metric
NLI systems yield probability distributions over Entailment, Contradiction and Neutral.We denote the probability values as e, c, and n, where e + c + n = 1 and e, c, n ≥ 0. We first determine how to leverage the three values as NLI metrics.
To do so, we evaluate five simple formulas of their arithmetic combination in a heuristic way: (1) e, (2) -c, (3) e-n, (4) e-c and (5) e-n-2c, and inspect their effect in three directions, which correspond to the entailment directions implication, reverse implication and bi-implication: (i) ref /src → cand, where ref or src act as premise and cand as hypothesis; (ii) ref /src ← cand, where cand acts as premise and ref or src act as hypothesis and (iii) ref /src ↔ cand, as arithmetic average over the two above cases.
For example, to obtain e-n from ref /src ↔ cand, we first average the three probability scores over direction ref /src → cand and ref /src ← cand, then calculate e-n based on the averaged scores.We only consider direction src → cand for ref-free summarization, since hypothesis does not need to entail source document.The various selections of the formulas and directions result in 15 pooling strategies for NLI-based metrics.

NLI Systems
We explore both monolingual and cross-lingual NLI-based metrics.For each setup, we choose two NLI models, which are obtained from Hugging Face or fine-tuning by ourselves.
For monolingual NLI metrics, we choose (1) a RoBERTa-large model (Liu et al., 2019) finetuned on SNLI (Bowman et al., 2015), MNLI, Fever (Nie et al., 2019) and ANLI (Nie et al., 2020) by Nie et al. (2020) and (2) a DeBERTalarge model fine-tuned by He et al. (2021), using MNLI.We denote the NLI metrics induced from these two models as NLI-R and NLI-D.They will be used for ref-based MT evaluation, and both ref-based and -free summarization evaluation tasks.Note that, while NLI-R has been fine-tuned on adversarial NLI (ANLI), which has been shown to increase robustness on (for example) negation and numerical reasoning, NLI-D has not been trained on ANLI.Cross-lingual NLI metrics should handle premises and hypotheses in different languages, so we select the multilingual versions of the underlying models of NLI-R/NLI-D.(1) We fine-tune a XLM-RoBERTa-base model (Conneau et al., 2019), using the datasets for fine-tuning NLI-R as well as XNLI dataset (Conneau et al., 2018).(2) We select an mDeBERTa-base model fine-tuned on MNLI and XNLI.We denote the corresponding cross-lingual NLI metrics as XNLI-R and XNLI-D.

Experiment Results
Before outlining our main results in §5.1 (MT) and §5.2 (summarization), we first discuss good pooling strategies for NLI metrics.

Pooling Strategy
We determine the pooling strategy for NLI metrics in MT evaluation from (1) the accuracy on the adversarial datasets and (2) the correlation with human judgements on the standard (segment-level) MT datasets.We leverage the winning frequency of the pooling strategies to choose the best one; a strategy wins if it works  best for an NLI metric among all 15 strategies.Overall, we find that the simple formula e from the direction src/ref↔cand is a good choice which works well for both standard and adversarial benchmarks, even though slightly better formulas could be chosen in selected subsettings (e.g., ref-based vs. ref-free evaluation), see Table 7 for examples.For summarization, the situation is slightly more complex: (1) e-c from direction ref←cand performs best for ref-based NLI metrics; (2) -c from direction src→cand is the best strategy for ref-free NLI metrics.Thus, we compare NLI metrics adopting these strategies with classic metrics.
Even though we only looked at global aggregate statistics, we still observe that our method of identifying the pooling strategies above leveraged the data on which we will later evaluate the NLI metrics.To avoid leaking information from the test set, we evaluate NLI metrics on each dataset with the pooling strategy selected from the remaining datasets for that task in §6.

Adversarial Evaluation
We now compare our NLI metrics with the best pooling strategy to our baseline metrics under adversarial conditions.
(2) Further, the two NLI metrics perform similarly.In the ref-free setup, the best crosslingual NLI metric (XNLI-D) is still most robust under our attacks.However, NLI metrics do not as substantially outperform the other metrics as in the ref-based setup.A potential reason is that the cross-lingual NLI models underperform compared to the monolingual setup (the preferences we query for in the reference-free setup may also play a role).Nevertheless, when excluding the fluencyrelated phenomena from the adversarial datasets, XNLI-D is still on average 10 points better than the best standard metric, COMET (86% vs. 75%).
Our observations are inconsistent with Karpinska et al. (2022), where the state-of-the-art MT metrics mostly obtain >95% accuracy in the preference-based evaluation.The reason is that our test suites are much more difficult for the evaluation metrics because we challenge them by lexical overlap between source/reference and candidate sentences during attacks: Metrics must choose between high lexical overlap adversarial candidates (with key errors) over low lexical overlap paraphrases.In contrast, in Karpinska et al. (2022), metrics are challenged to assign correct preferences for score (ref, t) vs. score(ref, t ′ ) where t is a candidate and t ′ the perturbed candidate.This is a much easier comparison because neither are ref and t maximally dissimilar (but meaning equivalent) nor are ref and t ′ maximally similar.This is an important lesson: How to design the adversarial preferences may critically affect the assessment of whether recent metrics are robust or not.

Standard Benchmarks
Ref-based We give average results over all datasets in Table 8 (columns 'MT'; individual results are available in our Github).For segmentlevel evaluation, we observe: (1) trained metrics (COMET and BLEURT) substantially outperform the others, with average performance of ∼0.7 Pearson.(2) Unsupervised SOTA metrics have average correlation of ∼0.6 Pearson, BERTScore is the best among them.(3) Our NLI-based metrics are not competitive, with correlations of ∼0.45 Pearson.When correlating with system-level human judgments, NLI metrics still underperform most of the SOTA metrics, but the margin is much smaller.

Ref-free
Trained metrics also dominate in segment-level evaluation (>0.6 Pearson), whereas the two NLI-based metrics perform much worse than the others (0.15-0.22 Pearson).Nevertheless, XNLI-D performs on par with COMET and better than the others on WMT20 at system-level.
Overall, we conclude that our NLI metrics are not competitive with state-of-the-art evaluation metrics on standard MT datasets, especially at segment-level and ref-free.

Combined Metrics
Observing that NLI metrics are strong on adversarial setups, but comparatively weaker in standard evaluation, we examine how to get more robust met-rics which also perform well on standard benchmarks.To do so, we take the weighted average of NLI and classical metrics: where w nli ∈ [0, 1] is the weight for NLI metric N and M is a classical metric.Before combination, we rescale M and N to [0, 1], using min-max normalization.
We illustrate the performance of the combined evaluation metrics with (X)NLI-R on both adversarial and standard benchmarks (segment-level) in Figure 2; the results for (X)NLI-D and for system-level are similar.The x-axis denotes the average accuracy over the adversarial datasets, while y-axis is the average Pearson correlation over the standard benchmarks (MT datasets).Each dot in each graph shows the value C(w nli ) for a specific weight w nli .As seen from Figure 2, the graphs show an intriguing concave curvature.In standard MT evaluation, the combination boosts the metric performance when w nli is small (from 0.1 to 0.4) in virtually all cases.We then see a simultaneous increase of adversarial robustness and quality on standard benchmarks.In ref-based setup, Figure 3: Improvements of all metrics on standard benchmarks and adversarial datasets for w nli = 0.1,...0.9, averaged over all experiments.We show 95% confidence interval.e.g. for w nli = 0.2, we observe: (1) MoverScore and BARTScore-P improve most, with ∼8% (from 0.57/0.59 to 0.61/0.64Pearson, respectively) and 21%-36% improvements on adversarial datasets (from 48%/67% to 66%/82% accuracy on average).( 2) The best unsupervised metric on segmentlevel MT, BERTScore, increases ∼4% Pearson on standard benchmarks and ∼24% accuracy on adversarial datasets.(3) The most robust untrained metric, BARTScore-F, improves about ∼11% in robustness, whereas its performance on standard benchmarks also rises ∼5%.(4) The improvements on MT for trained metrics are smaller compared to those untrained metrics, with COMET improving only 1.5% and BLEURT even becoming worse with the choice w nli = 0.2.However, their performance in defending adversarial attacks still improves ∼10%-20%.In ref-free setups, all metrics improve ∼6%-7% on adversarial datasets.Such setting only substantially boosts XMover-Score's performance on standard benchmarks, with ∼6%-9%.
We summarize the improvements for all combinations in Figure 3(a), which are averages over all experiments considered here.We can observe that the line denoting improvements on standard benchmarks peaks at w nli = 0.2, and the average improvements are positive when w nli ≤ 0.5.Further, on the adversarial datasets, the improvement monotonously increases with w nli and the gain is a concave function of w nli which saturates as w nli becomes larger.The sweet spots are w nli ∈ [0.2, 0.3], which leads to 5%-6% improvement on standard benchmarks and 14%-16% improvement in adversarial robustness on average.When excluding the fluency phenomena from the adversarial datasets, the combined metrics consistently gain larger improvements in adversarial robustness, with 20%-24% improvements at the sweet spots.

Summarization
Evaluation As Table 9 shows, similar to MT evaluation, NLI-based metrics exhibit much stronger robustness under adversarial conditions (our best NLI metrics have at least ∼8 points higher accuracy than the best standard metrics; right-most columns).The difference is that the vanilla NLI metrics are now also comparably effective to the SOTA metrics on standard benchmarks.For instance, in ref-based setup, NLI-D with max aggregation beats all metrics except for DiscoScore with mean on SummEval and both NLI metrics highly correlate with system-level human ratings in RealSumm (above 0.8 Pearson), where most standard metrics obtain only 0.5-0.7 Pearson correlations.When considering all evaluation dimensions of SummEval and RealSumm, NLI-D outperforms all other metrics, followed by NLI-R.Besides, we observe that NLI metrics correlate much better with human judgments regarding consistency and (somewhat surprisingly) fluency in SummEval compared to the other metrics.For the ref-free setup, BARTScore-FN performs best on SummEval-it outperforms the other metrics by above 0.1 Kendall on average.However, it does not correlate well with both summary-level and system-level human judgments in RealSumm.NLI metrics are comparable or better than standard metrics on system-level.For example, NLI-R performs best among the examined metrics and is about 0.06 Pearson better than the best standard metric (SUPERT) on system-level in RealSumm.Nevertheless, reference-free NLI metrics also perform worse than the reference-based ones as in MT; an explicit bottleneck for the two NLI metrics is that they were only trained on NLI data with short sentences, but reference-free summarization evaluation requires metrics to deal with source documents which contain many more sentences.
Combined Metrics In Figure 3(b), we summarize the median improvements of combined sum- marization metrics (the median smooths some outliers).In contrast to MT, the combination brings almost equal benefits to performance of standard metrics on standard and adversarial benchmarks concerning only adequacy-we again observe a decrease in improvements on adversarial datasets when adding our fluency phenomena.We identify a best w nli , namely 0.8, with which the standard metrics gain about 25%-30% improvements in both types of performances (adversarial and standard).

Discussion & Analysis
Selected Failure Cases of Metrics: Table 10 shows selected failure cases of four popular metrics (BERTScore, BARTScore, BLEURT, COMET), where the NLI metrics are correct in each case.
In the examples, BERTScore prefers text with the wrong gendered pronoun over a legitimate paraphrase and even trained metrics like BLEURT fail on severe name changes such as "Melissa" (a person name) vs. "Mali" (a country name).Leveraging more subtle cases (e.g., mismatches based on wrong word senses instead of random mismatches with the same POS or replacing names with names of the same 'type') would likely constitute even harder test cases for future metrics.
No Metric is Good Everywhere: Across distinct dimensions, different metrics perform differently, indicating that they capture varying aspects.For example, NLI metrics are not so good on fluency adversarial attacks, e.g., typos.This may be unsurprising, given that fluency is a low-level phenomenon while NLI concerns high-level logical relationships between sentences (some fluency phenomena would best be treated by switching to a lower-level representation space, such as characterlevel (Vu et al., 2022); this could seamlessly be integrated in existing NLI models).The NLI metrics are also weaker concerning segment-level MT evaluation on standard benchmarks.However, NLI metrics alone perform surprisingly well

Rescaling:
The min-max normalization we used (a standard technique for normalizing data in machine learning, typically applied to input features) for metric combination requires batch processing.
It is necessary to account for the different ranges of metrics, e.g., some metrics take negative values.An alternative would include to enforce more formal constraints on evaluation metrics, i.e., that they should take outputs in [0,1].When applying our combined metrics in practice, one could also replace them by surrogate metrics trained on the outputs of the original combined metrics or simply take the min-max values inferred from the datasets already evaluated on-the larger these datasets the more reliably are min and max estimated.
Sensitvity to w nli : Having different weights w nli for different tasks is undesirable, because it requires considering each task individually.However, in our experiments, we found that all small w nli (below 0.5) yield good performances and are thus safe choices: They increase adversarial robustness and also lead to better metrics on standard benchmarks.
Adversarial Performance vs. Standard Performance: From our experiments, it might seem that adversarial and standard performance are anticorrelated: A metric with higher adversarial performance may have lower performance on standard benchmarks and vice versa.While this would not necessarily be a major surprise as adversarial conditions oftentimes test phenomena that are otherwise not represented in standard benchmarks (Niven and Kao, 2019), a statistical analysis reveals that standard performance generally positively correlates to the adversarial performance in our case, consistent with our earlier argument that existing NLG systems in the real world do commit similar errors as we check for.To do so, we first convert the metrics' standard performance to rankings for each performance category (e.g., ref-based/free segment/system-level MT performance, performance on SummEval/RealSumm), then we correlate the ranking-based standard performance to the corresponding adversarial performance rankings, obtaining 0.37 Spearman.When excluding NLI metrics, the correlation increases to 0.60.
The Choice of cand para Matters: As indicated in §3, we speculate that a good adversarial setting maximizes (surface) dissimilarity between ref and cand para (which can better trick the metrics).
To investigate, we compute the normalized edit distance between ref and cand para ;7 a larger edit distance means a greater dissimilarity.If our assumption is true, then larger edit distances represent harder test cases for the metrics.We find: (1) the average edit distance for the test cases where the metrics fail to defend against the adversarial attacks is 0.01-0.6larger than that for where they succeed, averaged over metrics; (2) for PAWS back and PAWS ori (both induced from PAWS) where the cand para are obtained in different ways, all metrics achieve 0.02-0.15lower accuracy on PAWS ori , which has 0.46 larger average edit dis- tance than PAWS back , in turn.Both findings confirm our above assumption.In addition, we observe that NLI metrics have the smallest difference between the edit distances for failure and success cases (0.01-0.26) as well as that between the accuracy on PAWS back and PAWS ori (0.02) among all evaluation metrics.This implies that they are least affected by surface overlap and instead better consider the logical relationship between sentences.This is what makes them attractive as evaluation metrics.
The Choice of cand adv Matters, Too: We evaluate on one complex attack combining Number error with Negation which increases the difference between ref and cand adv based on the test cases for Number error in WMT20 de .The accuracy increases by an average of 0.28 over all metrics.This confirms our assumption that maximizing the (surface) similarity between ref and cand adv (but with key errors) leads to harder test suites and vice versa.
Ensemble with NLI Metrics Are More Effective: We compare the ensembles with NLI metrics to ensembles with standard metrics, i.e., w • A + (1 − w) • M where A is a fixed standard metric and M is any of the remaining metrics.To do so, we combine standard metrics with the rest metrics for each category of MT/summarization and ref-based/-free setting.We take the arithmetic average of the accuracy on adversarial benchmarks and correlations on standard benchmarks as the overall metric performance here.We calculate the mean/maximal improvement of ensembles to the original metric M over w ∈ [0.1, 0.9] and observe: (i) While the ensembles with standard metrics are better for ref-free MT metrics because cross-lingual NLI metrics perform very poorly in our experiments, (ii) the monolingual NLI metrics lead to much better ensembles-17/15 points larger mean/max improvement-compared to the standard metrics.(iii) Overall, the ensembles with NLI metrics yield 10/7 points larger mean/max improvement in overall performance than with standard metrics (averaged over all 4 tasks: refbased/-free MT/summarization).Thus, (monolingual) NLI metrics have unique properties, compared to standard metrics, making them attractive in ensembles.
To illustrate, Figure 4 shows ensembles with BERTScore.These show minor or no improvements on standard benchmarks and also mixed (often negative) results for adversarial robustness.
SummaCZS and Falsesum: In §5, we applied NLI systems on whole input texts, not taking into account the multi-sentence nature of source texts and outputs, especially in summarization.
To remedy the mismatch between the granularities of the training data of NLI models and the input data of summarization evaluation, i.e., sentencevs.document-level, Laban et al. (2022) propose both supervised and unsupervised NLI-based summarization metrics for inconsistency detection.We test their unsupervised variant (SummaCZS),8 which segments documents into sentence units and aggregates scores between pairs of sentences, with the underlying model of NLI-R.However, Sum-maCZS does not consistently outperform NLI-R across all datasets; in contrast, NLI-R performs much better in our adversarial test compared to SummaCZS (72% vs. 53%).Besides, to match the training data of NLI models with the task of factual inconsistency detection in summarization, Utama et al. (2022) introduce an augmented NLI dataset with task-oriented examples based on CNNDM-FalseSum; we evaluate three Roberta-large models finetuned on it and MNLI.Similar to SummaCZS, this also does not always yield better performance In this work, we explored NLI as a general paradigm for evaluation metrics.We showed that NLI metrics yield adversarial robustness, and are also strong-though not always state-of-theart-when it comes to standard metric evaluation benchmarks.By linearly interpolating established (BERT-based) metrics with our NLI metrics, we obtained high-quality metrics along both axes: adversarial robustness and standard benchmarks, with substantial gains over recent BERT-based metrics.
A potential reason why NLI based metrics perform subpar on some standard benchmarks (especially in MT) is the training data mismatch, i.e., typical NLI datasets contain many artificial sentences of the type "A girl is playing on a piano".A further limitation is that cross-lingual NLI models are not yet high-quality enough and that most current NLI models are sentence-level, not documentlevel-with a few recent exceptions (Yin et al., 2021).Once these limitations of NLI are overcome, we believe that even better performances from NLI based metrics can be expected, which, we believe, is one of the most promising directions for future high-quality and robust evaluation metric design.Future work should also consider NLI metrics for other text generation tasks; the NLI paradigm looks especially promising for tasks that require comparison with human references, which oftentimes involve the concept of logical equivalence.

Figure 1 :
Figure 1: Average accuracy (values in each block) of all metrics per phenomenon over the adversarial datasets for ref-based MT evaluation.Darker color indicates higher accuracy and vice versa.

Figure 2 :
Figure 2: Accuracy on adversarial datasets and Pearson correlation with segment-level human judgements in WMT datasets of combined metrics with (X)NLI-R, averaged over datasets.The points on each path from the original metric to the NLI metric indicate w nli = 0, 0.1, ..., 1.The purple line denoting the combination with ref-based COMET ends at another point since the corresponding adversarial performance is averaged over the 2 adversarial datasets containing source texts.

Figure 4 :
Figure 4: Accuracy on adversarial datasets and Pearson correlation with segment-level human judgements in WMT datasets of combined metrics with BERTScore, averaged over datasets.The green line denoting the combination with COMET ends at another point since the corresponding adversarial performance is only averaged over the 2 adversarial datasets containing source texts.

Table 1 :
Different paradigms for metric induction proposed in recent years.

Table 2 :
Examples of our adversarial test suite taken from WMT20 de .Red words indicate specific adversarial perturbations of the words in green.cand adv (ref-based) builds on ref, whereas cand adv (reffree) builds on r (indicated by corresponding coloring in the first column).The preferences we query for are given in Eq. (1).

Table 3 :
Adversarial datasets."Yes/no" indicates whether the dataset supports ref-based/free adversarial evaluation."ORI/BACK" denotes whether cand para (except for number error) is from the original datasets or backtranslation."#examples" refers to the avg.number of examples per phenomenon.XPAWS x denotes XPAWS de/fr/zh/ja .
and use the reference r with highest ROUGE score to generate cand adv for ref-free setting, while the remaining 10 references serve as ref.We refer to the adversarial dataset induced from SummEval as SE adv in the remainder.We obtain cand para as in the case of PAWS back . 5Bought it for his (my) son, he said it was good.

Table 9 :
Kendall correlation with system-level human judgments in SummEval.Pearson correlation with summary/system-level litePyramid in RealSumm.Accuracy on adversarial benchmarks, averaged over phenomena in SE adv .We bold the best performance on each criterion."max/mean" denotes the aggregation method used for multi-reference setting in ref-based evaluation on SummEval.

Table 10 :
Sample instances in adversarial datasets where standard metrics failed while NLI-R succeeded; ref-based setup.In the 4th and 5th columns, we show [score assigned to cand para ]: [score assigned to cand adv ] by standard metrics and NLI-R, respectively; robust metrics should give cand para higher scores.Green bold texts indicate the anchor words/phrases to be perturbed and the red ones in cand adv refer to the corresponding perturbed texts.