Abstract
Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g., n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human–LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.
1 Introduction
The evaluation of natural language generation (NLG) is an important but challenging issue. The lack of a single standard answer and the presence of multiple quality criteria make evaluating NLG more challenging than other NLP tasks. For example, in news summarization, a good summary should capture the key information from the source document, remain faithful to the source document, and be expressed in logically coherent and fluent language, but there is not a single “correct” way to achieve this. The inherent difficulty of NLG evaluation means that human evaluation is always needed and regarded as the gold standard. However, due to the high cost and time-consuming nature of human evaluation, automatic evaluation metrics remain indispensable and play a crucial role in model development. Over the past two decades, many automatic evaluation metrics such as BLEU (Papineni et al. 2002) and BARTScore (Yuan, Neubig, and Liu 2021) have been developed, but none have been fully satisfactory. Some studies (Sai et al. 2021; He et al. 2023) have highlighted their deficiencies in robustness, such as insensitivities, biases, or even loopholes when evaluating challenging texts.
Recently, large language models (LLMs) have emerged and demonstrated unprecedented capacities in following instructions, understanding content, and generating text, which inspires researchers to utilize LLMs for NLG evaluation. Although this is a research direction that only emerged in 2023, the past year has seen an enormous amount of relevant work. While there have been surveys on automatic evaluation metrics or human evaluation practices in NLG evaluation (Celikyilmaz, Clark, and Gao 2020; Hämäläinen and Al-Najjar 2021; Zhou et al. 2022; Sai, Mohankumar, and Khapra 2023; Gehrmann, Clark, and Sellam 2023; Zhou, Ringeval, and Portet 2023), none of them addresses the LLM-based evaluation approach, and a comprehensive survey of this area is urgently needed.
The survey will mainly focus on research on LLM-based approaches for NLG evaluation, which involves language models with over one billion parameters. Moreover, it mainly considers the typical scope of NLG tasks where both input and output are natural languages including machine translation, text summarization, story generation, dialogue response generation, data-to-text, text simplification, paraphrase generation, grammatical error correction, and creative writing. Broader areas like evaluation of LLMs are not included (Zhuang et al. 2023; Chang et al. 2024) because this work focuses on LLMs used for evaluation, rather than the evaluation of LLMs’ capabilities. We search the literature on Google Scholar with an end date of June 2024 with keywords. Because this is a new direction that emerged in 2023, a considerable number of arXiv preprints are included in addition to papers published in *ACL venues or other related venues. About 100 pieces of work will be included. To maintain focus, this article neither discusses datasets and benchmarks in NLG evaluation (Gehrmann et al. 2021; Kim et al. 2024a) nor analyzes evaluation metrics statistically (Ni’mah et al. 2023; Xiao et al. 2023).
As shown in Figure 1, we categorize related studies into four categories according to how LLMs are utilized for NLG evaluation:
LLM-derived Metrics (§ 2): developing or deriving evaluation metrics from embeddings or generation probabilities of LLMs.
Prompting LLMs (§ 3): directly inquiring of existing LLMs via specific prompts and processes designed for evaluation.
Fine-tuning LLMs (§ 4): using labeled evaluation data to fine-tune existing LLMs and improving their NLG evaluation capabilities.
Human–LLM Collaborative Evaluation (§ 5): leveraging distinctive strengths of both human evaluators and LLMs to achieve robust and nuanced evaluations through human–LLM collaboration.
Schematic representation of our proposed four categories of LLM-based NLG evaluation.
Schematic representation of our proposed four categories of LLM-based NLG evaluation.
LLMs have driven NLG evaluation toward a more human-centered direction, and the four categories we propose reflect this evolution: LLM-derived metrics are a continuation of traditional evaluation metrics and can only handle coarse-grained evaluation; prompting and fine-tuning methods enable users to express flexible evaluation requirements in natural language; collaborative evaluation takes it a step further, making it possible for humans and LLMs to leverage their strength respectively. We will review each type of evaluation method and discuss the pros and cons, respectively. Last but not least, we will provide our suggestions and conclusions, and discuss future directions in this area (§ 6).
It is worth stating that since LLM-based evaluation has shown unprecedented generality across NLG tasks, we do not summarize the literature for each task separately. Nevertheless, we will draw a list documenting all the approaches in this survey, indicating which NLG tasks each approach has been experimented on.
2 LLM-derived Metrics
LLM-derived metrics can be viewed as a continuation of early model-based NLG evaluation metrics such as BERTScore and BARTScore, replacing traditional pre-trained language models with stronger LLMs. Such work can be categorized into two main types: embedding-based metrics (Es et al. 2023) and probability-based metrics. The latter can be further divided into two categories based on different ways of using probabilities: directly converting the probabilities into scores (Fu et al. 2023a; Varshney et al. 2023) and leveraging the variation in probabilities under changed conditions (Jia et al. 2023; Xie et al. 2023).
2.1 Embedding-based Metrics
The embedding-based methods, like BERTScore, generally utilize representations of language models and thus compute the semantic similarity between the reference and the target text to evaluate, with different possible ways of implementation. However, unlike traditional embedding-based evaluation metrics, which require references, many LLM-based embedding evaluation metrics do not. This is because their application scenarios and implementation methods differ from those of traditional metrics. For example, when Es et al. (2023) evaluate the answer relevance of Retrieval Augmented Generation, given the original question q and the answer Y to be evaluated, they first prompt the LLM to generate n possible questions qi for Y. Then, the relevance of Y is represented by the average similarity between qi and q, denoted as , where sim(qi, q) refers to the cosine similarity of the embeddings of qi and q. The embedding is generated by OpenAI text-embedding-ada-002, which can efficiently convert text into a 1536-dimensional vector, capturing semantic information and ensuring that similar texts are positioned close to each other in the vector space. Furthermore, Sheng et al. (2024) developed a more sophisticated method based on embeddings from the open-source decoder-only LLM, utilizing Principal Component Analysis to adapt it for both pointwise scoring and pairwise comparison.
2.2 Probability-based Metrics
Similarly, Murugadoss et al. (2024) score the task output Y to be evaluated by its perplexity under the corresponding large language model θ, given only the task context X. They believe this approach is unbiased by prompts, which transparently measures alignment with model training data. Furthermore, such methods have also been applied to the hallucination detection of the LLM-generated text (Varshney et al. 2023) with three different attempts for calculating the probability score.
On the other hand, some works leverage the variation in probabilities under changed conditions as the evaluation metric. FFLM (Jia et al. 2023) proposes to evaluate the faithfulness of the target text by calculating a combination of probability changes based on the intuition that the generation probability of a given text segment increases when more consistent information is provided, and vice versa. Similarly, DELTASCORE (Xie et al. 2023) measures the quality of different story aspects according to the likelihood difference between pre- and post-perturbation states with LLMs including GPT-3.5 (text-davinci-003) that provide logits. They believe that the sensitivity to specific perturbations indicates the quality of related aspects, and their experiments demonstrate the effectiveness of their approach.
2.3 Pros and Cons
Traditional NLG evaluation approaches always fall short due to their surface-form similarity when the target text and reference convey the same meaning but use different expressions. In contrast, LLM-derived metrics offer a remedy for the limitation and demonstrate stronger correlations with human judgments benefiting from the evolving modeling techniques. However, the flaws within LLMs can lead to some issues, as introduced in the following.
Robustness
Some research has investigated the robustness of LLM-derived metrics and found that they lack robustness in different attack scenarios. Specifically, He et al. (2023) develop a set of stress tests to assess the robustness of various model-based metrics on some common NLG tasks. They show a catalogue of the blind spots and potential errors identified that are not detected by different metrics.
Efficiency
Compared with traditional metrics, LLM-derived evaluation methods are more time-consuming and require more computational resources, especially when adopting LLMs with quite large parameter scales. To address this, Eddine et al. (2022) propose an approach to learning a lightweight version of LLM-derived metrics, and some fast LLM inference and serving tools like popular vLLM (Kwon et al. 2023) have been launched. vLLM improves memory utilization during inference through the PagedAttention algorithm, as well as the optimized memory management and batching strategies, thereby increasing LLMs’ generation throughput. However, closed-source LLMs often do not make their parameters, representations, or logits public and available, thus making it impossible to apply LLM-derived methods to them.
Fairness
Sun et al. (2022) assess the social bias across various metrics for NLG evaluation on six sensitive attributes: race, gender, religion, physical appearance, age, and socioeconomic status. Their findings reveal that model-based metrics carry noticeably more social bias than traditional metrics. Relevant biases can be categorized into two types: intrinsic bias encoded within pre-trained language models and extrinsic bias injected during the computation of similarity. Therefore, current LLM-derived methods may have similar issues.
3 Prompting LLMs
The remarkable generation abilities of LLMs have expanded the possibilities for NLG evaluation. For a long time, human evaluation has been viewed as the gold standard for NLG evaluation. Recently, some studies claim that LLMs are on par with crowdsourcing annotators in several tasks (Törnberg 2023; Gilardi, Alizadeh, and Kubli 2023; Ostyakova et al. 2023; Cegin, Simko, and Brusilovsky 2023). This raises questions about whether LLMs could replace human evaluators. Studies in this area often involve feeding LLMs with detailed prompts that include both instructions and the text to be evaluated, with LLMs producing the evaluation outcomes. An example of prompting LLMs is shown in Figure 2. From this example, we can see that such a prompt is quite similar to the guidelines given to human evaluators. The main differences between this prompting method for LLMs and LLM-derived metrics are twofold: (1) LLM-derived metrics generally do not involve highly human-like prompts that require the LLM to perform an evaluation. (2) The evaluation results from prompting LLMs are typically generated directly by the LLM, whereas LLM-derived metrics require further transformation from embeddings and probabilities. We will describe existing works according to the five elements that they mainly focus on:
Evaluation Methods: The way the evaluation results of LLM evaluators are obtained, such as scoring and comparison.
Task Instructions: How LLM evaluators should read or manipulate different parts to complete the annotation.
Input Content: The target text to be evaluated and other required content. Other required content including source documents, references, and external knowledge is provided as needed.
Evaluation Criteria: The general definition of how good or bad the text to be evaluated is in a particular aspect of quality, e.g., fluency, faithfulness.
Role and Interaction: The roles LLM evaluators play in the evaluation and the interactions between them.
An example of prompting LLMs to evaluate the aspect of consistency of the summary. There are role and interaction, task instructions, evaluation criteria, input content, and evaluation methods in the prompt, as well as the evaluation results, including the rating and explanation generated by LLMs.
An example of prompting LLMs to evaluate the aspect of consistency of the summary. There are role and interaction, task instructions, evaluation criteria, input content, and evaluation methods in the prompt, as well as the evaluation results, including the rating and explanation generated by LLMs.
3.1 Evaluation Methods
Diverse evaluation methods have been used in prompting LLMs to obtain their preferences for the text to be evaluated: scoring, comparison, ranking, Boolean QA, and error analysis (Table 1).
Representative studies on prompting LLMs for NLG evaluation.
Related Work . | Evaluation Method . | NLG Task . |
---|---|---|
Chiang and Lee (2023a) | Scoring | Story Generation |
Wang et al. (2023a) | Scoring | Summarization, Data-to-text & Story Generation |
Kocmi and Federmann (2023b) | Scoring | Translation |
Lin and Chen (2023) | Scoring | Dialogue |
Mendonça et al. (2023) | Scoring | Dialogue |
Naismith, Mulcaire, and Burstein (2023) | Scoring | Discourse Generation |
Liusie, Manakul, and Gales (2023) | Scoring & Comparison | Summarization, Dialogue & Data-to-text |
Wang et al. (2023d) | Comparison | Personalized Text Generation |
Ji et al. (2023) | Ranking | Open-end Text Generation |
Liu et al. (2023c) | Scoring, Ranking & Comparison | Summarization |
Wang, Funakoshi, and Okumura (2023) | Boolean QA | Question Generation |
Manakul, Liusie, and Gales (2023) | Boolean QA | Fact Verification |
Guan et al. (2023) | Boolean QA | Fact Verification |
Es et al. (2023) | Boolean QA | Retrieval Augmented Generation |
Kocmi and Federmann (2023a) | Error Analysis | Translation |
Lu et al. (2023) | Error Analysis | Translation |
Chang et al. (2023) | Error Analysis | Summarization |
Related Work . | Evaluation Method . | NLG Task . |
---|---|---|
Chiang and Lee (2023a) | Scoring | Story Generation |
Wang et al. (2023a) | Scoring | Summarization, Data-to-text & Story Generation |
Kocmi and Federmann (2023b) | Scoring | Translation |
Lin and Chen (2023) | Scoring | Dialogue |
Mendonça et al. (2023) | Scoring | Dialogue |
Naismith, Mulcaire, and Burstein (2023) | Scoring | Discourse Generation |
Liusie, Manakul, and Gales (2023) | Scoring & Comparison | Summarization, Dialogue & Data-to-text |
Wang et al. (2023d) | Comparison | Personalized Text Generation |
Ji et al. (2023) | Ranking | Open-end Text Generation |
Liu et al. (2023c) | Scoring, Ranking & Comparison | Summarization |
Wang, Funakoshi, and Okumura (2023) | Boolean QA | Question Generation |
Manakul, Liusie, and Gales (2023) | Boolean QA | Fact Verification |
Guan et al. (2023) | Boolean QA | Fact Verification |
Es et al. (2023) | Boolean QA | Retrieval Augmented Generation |
Kocmi and Federmann (2023a) | Error Analysis | Translation |
Lu et al. (2023) | Error Analysis | Translation |
Chang et al. (2023) | Error Analysis | Summarization |
Scoring
Scoring is the most commonly used evaluation method in human evaluation for NLG (van der Lee et al. 2021), and it is naturally applied to LLM-based evaluation. Chiang and Lee (2023a) have conducted relevant studies early, using a Likert scale from 1 to 5 to evaluate story generation and adversarial attacks with InstructGPT (Ouyang et al. 2022) and ChatGPT,1 showing that the evaluation results of LLMs are consistent with expert human evaluators. Kocmi and Federmann (2023b) find that GPT-3.5 and GPT-4 achieve the state-of-the-art accuracy of evaluating translation quality compared to human labels with a rating scale from 1 to 5 or 0 to 100, outperforming all the results from the metric shard task of WMT22 (Freitag et al. 2022). Furthermore, Wang et al. (2023a) experiment on five datasets across summarization, story generation, and data-to-text; ChatGPT with similar rating scales achieves the state-of-the-art or comparative correlations with human judgments in most settings, compared with prior metrics. Similar conclusions are also observed in open-domain dialogue response generation (Lin and Chen 2023). Besides English, Mendonça et al. (2023) show that ChatGPT with simple rating prompts is a strong evaluator for multilingual dialogue evaluation, surpassing prior metrics based on encoders.
Comparison
Different from absolute scoring, comparison refers to choosing the better of the two. Luo, Xie, and Ananiadou (2023) and Gao et al. (2023) use ChatGPT to compare the factual consistency of two summaries. AuPEL (Wang et al. 2023d) evaluates personalized text generation from three aspects in the form of comparison with the PaLM 2 family (Anil et al. 2023). According to Liusie, Manakul, and Gales (2023), pairwise comparison is better than scoring when medium-sized LLMs (e.g., FlanT5 [Chung et al. 2022] and LLaMa2 [Touvron et al. 2023]) are adopted as evaluators.
Ranking
Ranking can be viewed as an extended form of comparison. In comparison, only two examples are involved at a time, whereas in ranking, the order of more than two examples needs to be decided at once. Ji et al. (2023) use ChatGPT to rank five model-generated responses across several use cases at once, indicating the ranking preferences of ChatGPT align with those of humans to some degree. Similarly, GPTRank is a method to rank summaries in a list-wise manner (Liu et al. 2023c). Moreover, Liu et al. (2023b) compare different evaluation methods in LLM-based summarization including scoring, comparison, and ranking, showing that the optimal evaluation method for each backbone LLM may vary.
Boolean QA
Boolean QA requires LLMs to answer “Yes” or “No” to a question. It is adopted more in scenarios where human annotations are binary, such as grammaticality (Hu et al. 2023), faithfulness of summaries and statements (Luo, Xie, and Ananiadou 2023; Gao et al. 2023; Es et al. 2023; Hu et al. 2023), factuality of generated text (Fu et al. 2023b; Guan et al. 2023; Manakul, Liusie, and Gales 2023), and answerability of generated questions (Wang, Funakoshi, and Okumura 2023).
Error Analysis
Error Analysis refers to the evaluation of a text by looking for errors that occur in the text according to a set of predefined error categories. Multidimensional Quality Metrics (MQM) (Jain et al. 2023) is an error analysis framework prevalent in machine translation evaluation. According to MQM, Lu et al. (2023) and Kocmi and Federmann (2023a) use ChatGPT or GPT-4 to automatically detect translation quality error spans. BooookScore (Chang et al. 2023), an LLM-based evaluation metric, assesses the coherence of book summaries by identifying eight types of errors.
3.2 Task Instructions
In human evaluation, task instruction usually comes in the form of a task description or evaluation steps. They can also exist at the same time. The task description states the annotation in a more general way, and the evaluation steps, which can be considered as Chain-of-Thought, explicitly describe what to do at each step. In the context of prompting LLMs for NLG evaluation, we discuss three broad categories of influences: various templates of prompts (Leiter et al. 2023; Kim et al. 2023a; Kotonya et al. 2023; He, Zhang, and Roth 2023), in-context examples (Jain et al. 2023; Kotonya et al. 2023; Hasanbeig et al. 2023), and whether LLMs are required to provide analyses or explanations (Chiang and Lee 2023b; Naismith, Mulcaire, and Burstein 2023).
Form and Requirements
Several studies from an Eval4NLP 2023 shared task (Leiter et al. 2023) have explored task instructions in various settings. For example, Kim et al. (2023a) conduct experiments on different templates and lengths of task descriptions and evaluation steps, finding that providing clear and straightforward instructions, akin to those explained to humans, is more effective. Kotonya et al. (2023) generate task instructions with LLMs or improve existing task instructions with LLMs. Moreover, Leiter and Eger (2024) conduct a larger-scale prompt exploration for the evaluation of machine translation and summarization based on the Eval4NLP 2023 shared task. Somewhat differently, He, Zhang, and Roth (2023) evaluate generative reasoning using LLMs by asking them first to generate their own answers, and then conduct a quantitative analysis of the text to be evaluated. Additionally, explicit evaluation requirements and output formats are typically included in the instructions, and the evaluation results are extracted using regular expression matching. Early LLMs may sometimes provide unrecognizable evaluation results or refuse to conduct evaluation due to their limited instruction-following capabilities (Gao et al. 2023). This issue can be mitigated through multiple sampling or setting random outputs, and it basically does not exist in the currently more advanced and powerful LLMs.
Analysis and Explanations
LLMs are able to include analysis or explanation in their evaluations, which is a key point that distinguishes them from previous automatic evaluation metrics. Early explorations into prompting LLMs for NLG evaluation mostly do not examine the impact of whether LLMs are required to analyze and explain the evaluation results. However, Chiang and Lee (2023b) explore different types of evaluation instructions in summarization evaluation and dialogue evaluation, finding that explicitly asking large models to provide analysis or explanation achieve higher correlation with human judgments. Besides, the quality of the analysis and explanation generated by LLMs itself requires additional manual evaluation (Leiter et al. 2023). Naismith, Mulcaire, and Burstein (2023) compare the explanations written by humans and generated by GPT-4 and conduct a simple corpus analysis on the generated explanations, finding that GPT-4 has strong potential to produce ratings that are comparable to human ratings on discourse coherence, accompanied by clear rationales.
In-context Examples
Similarly to other fields, sometimes demonstrations are needed when prompting LLMs for NLG evaluation. Specifically, Jain et al. (2023) use only in-context examples as task instructions, relying on LLMs to evaluate the quality of summaries. In scenarios where task descriptions or evaluation steps are included, Kotonya et al. (2023) compare the performance of LLMs as evaluators in both zero-shot and one-shot settings, finding that one-shot prompting does not bring improvements. Moreover, Hasanbeig et al. (2023) improve the performance of LLM evaluators by updating the in-context examples iteratively.
3.3 Input Content
The types of input content mainly depend on the evaluation criteria and are relatively fixed. For most task-specific evaluation criteria, such as the faithfulness of a summary (Luo, Xie, and Ananiadou 2023; Gao et al. 2023), the source document is needed in addition to the target text to be evaluated. For task-independent criteria, such as fluency (Hu et al. 2023; Chiang and Lee 2023b), only the text to be evaluated needs to be provided, though many studies also provide the source document (Wang et al. 2023a; Liusie, Manakul, and Gales 2023). Other types of input content can be provided as required by the specific task. Kocmi and Federmann (2023b) use two different settings when evaluating machine translation: providing references and not providing references and find that GPT-4 without references can also outperform all existing reference-based metrics. Guan et al. (2023) provide relevant facts and context when evaluating whether a text conforms to the facts. Exceptionally, Shu et al. (2023) add the output of other automatic evaluation metrics to the input of the LLM.
3.4 Evaluation Criteria
The evaluation targeting specific aspects is used in numerous studies of human evaluation for NLG, such as text summarization, story generation, dialogue, and text simplification. Evaluation criteria, that is, the definitions of aspects, are key in this context. Most evaluation criteria in LLM-based evaluation are directly derived from human evaluation. However, a few studies have attempted to let LLMs generate or improve evaluation criteria. Liu et al. (2023e) use a few human-rated examples as seeds to let LLMs draft some candidate evaluation criteria, and then further filter them based on the performance of LLMs using these criteria on a validation set, to obtain the final evaluation criteria. Kim et al. (2023c) designed an LLM-based interactive evaluation system, which involves using LLMs to review the evaluation criteria provided by users, including eliminating ambiguities in criteria, merging criteria with overlapping meanings, and decomposing overly broad criteria. Additionally, Ye et al. (2023a) propose a hierarchical aspect classification system with 12 subcategories, demonstrating that under the proposed fine-grained aspect definitions, human evaluation and LLM-based evaluation are highly correlated. Additionally, the chain-of-aspects approach improves LLMs’ ability to evaluate on a specific aspect by having LLMs score on some related aspects before generating the final score (Gong and Mao 2023).
3.5 Role and Interaction
We include in this section the evaluation strategies that either use the same LLMs in different ways or involve different LLMs (Bai et al. 2023; Li, Patel, and Du 2023; Cohen et al. 2023). The former can be further divided into chain-style (Yuan et al. 2024; Fu et al. 2023b; Hu et al. 2023) and network-style interactions (Chan et al. 2023; Zhang et al. 2023b; Saha et al. 2023; Wu et al. 2023).
Chain-style Interaction
Inspired by human evaluators, Yuan et al. (2024) have LLMs score a batch of examples to be evaluated each time. Specifically, the evaluation process is divided into three stages: analysis, ranking, and scoring. Similar to QA-based evaluation metrics (Durmus, He, and Diab 2020), Fu et al. (2023b) assess the faithfulness of summaries in two stages: treating LLMs as question generators to generate a question from the summary; then having LLMs answer the question using the source document. Differently, when Hu et al. (2023) use GPT-4 to evaluate the faithfulness of summaries, they first ask GPT-4 to extract event units from the summary, then verify whether these event units meet the requirements, and finally judge whether the event units are faithful to the source document.
Network-style Interaction
Unlike chain-style interactions, network-style interactions involve the dispersion and aggregation of information. In network-style interactions, LLMs on the same layer play similar roles. ChatEval (Chan et al. 2023) is a framework for evaluating content through debates among multiple LLMs, with three communication strategies designed among the three types of LLMs: One-By-One, Simultaneous-Talk, and Simultaneous-Talk-with-Summarizer. Zhang et al. (2023b) find that under certain conditions, widening and deepening the network of LLMs can better align its evaluation with human judgments. Saha et al. (2023) propose a Branch-Solve-Merge strategy, assigning LLMs the roles of decomposing problems, solving them, and aggregating answers, thereby improving the accuracy and reliability of evaluations. Wu et al. (2023) assume that different people such as politicians and the general public have different concerns about the quality of news summaries, use LLMs to play different roles in evaluation accordingly, and aggregate the results finally.
Different LLMs
Different from having the same LLM play different roles, some research has used different LLMs (such as GPT-4 and Claude) in their studies. The use of a single LLM as evaluator may introduce bias, resulting in unfair evaluation results. In light of this, Bai et al. (2023) design a decentralized peer-examination method, using different LLMs as evaluators and then aggregating the results. Further, Li, Patel, and Du (2023) let different LLMs serve as evaluators in pairwise comparisons and then have them go through a round of discussion to reach the final result. Additionally, Cohen et al. (2023) evaluate the factuality of texts through the interaction of two LLMs, where the LLM that generated the text acts as the examinee and the other LLM as the examiner.
3.6 Pros and Cons
The benefits of prompting LLMs for NLG evaluation are exciting. First, for the first time, people can express evaluation criteria and evaluation methods in natural language within the prompts given to LLMs, providing great flexibility. Where previously people needed to design specific evaluation metrics for different NLG tasks or even different aspects of a single task, now they only need to modify the prompts for LLMs. Secondly, surprisingly, LLMs have the ability to generate explanations while assessing texts, making this approach somewhat interpretable. Furthermore, in many NLG tasks, prompting LLMs for evaluation has achieved state-of-the-art correlations with human judgments.
However, as many studies have pointed out, this type of approach still has many limitations. Wang et al. (2023b) note that when using ChatGPT and GPT-4 for pairwise comparisons, the order of the two texts can affect the evaluation results, which is known as position bias. To alleviate this issue, Li et al. (2023c) propose a strategy of splitting, aligning, and then merging the two texts to be evaluated into the prompt. Also, LLM evaluators tend to favor longer, more verbose responses (Zheng et al. 2023) and responses generated by themselves (Liu et al. 2023a). Wu and Aji (2023) show that compared with answers that are too short or grammatically incorrect, answers with factual errors are considered better by LLMs. Liu et al. (2023d) demonstrate through adversarial meta-evaluation that LLMs without references are not suitable for evaluating dialogue responses in closed-ended scenarios: They tend to score highly on responses that conflict with the facts in the dialogue history. Zhang et al. (2023a) also present the robustness issues of LLMs in dialogue evaluation through adversarial perturbations. Shen et al. (2023) indicate that LLM evaluators have a lower correlation with human assessments when scoring high-quality summaries. In addition, Hada et al. (2023) state that LLM-based evaluators have a bias towards high scores, especially in non-Latin languages like Chinese and Japanese. Bavaresco et al. (2024) find that the performance of LLM-based evaluators exhibits significant variance depending on the dataset, evaluation criteria, and whether the evaluated texts are human-generated. Beyond these shortcomings of performance, both ChatGPT and GPT-4 are proprietary models, and their opacity could lead to irreproducible evaluation results.
4 Fine-tuning LLMs
As mentioned above, despite the exciting performance of prompting LLMs like ChatGPT and GPT-4 for NLG evaluation, several shortcomings in practice are inevitable, such as high costs, possibly irreproducible results, and potential biases in LLMs. In response, recent research has shifted towards fine-tuning smaller, open-source LLMs specifically for evaluation purposes, aiming to achieve performance close to GPT-4 in NLG evaluation. Representative works of this type include PandaLM (Wang et al. 2023e), Prometheus (Kim et al. 2023b), Prometheus 2 (Kim et al. 2024b), Shepherd (Wang et al. 2023c), TIGERScore (Jiang et al. 2023), INSTRUCTSCORE (Xu et al. 2023), Auto-J (Li et al. 2023a), CritiqueLLM (Ke et al. 2023), JudgeLM (Zhu, Wang, and Wang 2023), Themis (Hu et al. 2024), CompassJudger-1 (Cao et al. 2024), and Self-Taught (Wang et al. 2024). Their main ideas are similar, involving the elaborate construction of high-quality evaluation data, followed by fine-tuning open-source foundation LLMs with specific methods. Nevertheless, there are certain discrepancies in the designs across different works, such as the usage of references and evaluation criteria. We have summarized the key different components of these methods in Table 2 and Table 3 for comparison, and we will elaborate on these in the following sections.
Comparison of the different key components among the representative methods of fine-tuning LLMs (Part 1).
Method . | Data Construction . | Foundation LLM . | ||
---|---|---|---|---|
Instruction Source . | Annotator . | Scale . | ||
PandaLM | Alpaca 52K | GPT-3.5 | 300K | LLaMA 7B |
Prometheus | GPT-4 Construction | GPT-4 | 100K | LLaMA-2-Chat 7B & 13B |
Prometheus 2 | FEEDBACK COLLECTION | GPT-4 | 200K | Mistral-7B Mixtral-8×7B |
Shepherd | Community Critique Data & 9 NLP Tasks Data | Human | 1317 | LLaMA 7B |
TIGERScore | 23 Distinctive Text Generation Datasets | GPT-4 | 48K | LLaMA-2 7B & 13B |
INSTRUCTSCORE | GPT-4 Construction | GPT-4 | 40K | LLaMA 7B |
AUTO-J | Real-world User Queries from Preference Datasets | GPT-4 | 4,396 | LLaMA-2-Chat 13B |
CritiqueLLM | AlignBench & ChatGPT Augmentation | GPT-4 | 9,332 | ChatGLM-2 6B, 12B & 66B |
JudgeLM | GPT4All-LAION, ShareGPT Alpaca-GPT4 & Dolly-15K | GPT-4 | 100K | Vicuna 7B, 13B & 33B |
Themis | NLG-Eval with 58 NLG Evaluation Datasets | Human & GPT-4 | 67K | LLaMa-3-8B |
Self-Taught | Screened WildChat | LLaMa-3-70B | 20K | LLaMa-3-70B |
CompassJudger-1 | Sampling from existing datasets | Mixture | 900K | Qwen-2.5 1.5B, 7B, 14B & 32B |
Method . | Data Construction . | Foundation LLM . | ||
---|---|---|---|---|
Instruction Source . | Annotator . | Scale . | ||
PandaLM | Alpaca 52K | GPT-3.5 | 300K | LLaMA 7B |
Prometheus | GPT-4 Construction | GPT-4 | 100K | LLaMA-2-Chat 7B & 13B |
Prometheus 2 | FEEDBACK COLLECTION | GPT-4 | 200K | Mistral-7B Mixtral-8×7B |
Shepherd | Community Critique Data & 9 NLP Tasks Data | Human | 1317 | LLaMA 7B |
TIGERScore | 23 Distinctive Text Generation Datasets | GPT-4 | 48K | LLaMA-2 7B & 13B |
INSTRUCTSCORE | GPT-4 Construction | GPT-4 | 40K | LLaMA 7B |
AUTO-J | Real-world User Queries from Preference Datasets | GPT-4 | 4,396 | LLaMA-2-Chat 13B |
CritiqueLLM | AlignBench & ChatGPT Augmentation | GPT-4 | 9,332 | ChatGLM-2 6B, 12B & 66B |
JudgeLM | GPT4All-LAION, ShareGPT Alpaca-GPT4 & Dolly-15K | GPT-4 | 100K | Vicuna 7B, 13B & 33B |
Themis | NLG-Eval with 58 NLG Evaluation Datasets | Human & GPT-4 | 67K | LLaMa-3-8B |
Self-Taught | Screened WildChat | LLaMa-3-70B | 20K | LLaMa-3-70B |
CompassJudger-1 | Sampling from existing datasets | Mixture | 900K | Qwen-2.5 1.5B, 7B, 14B & 32B |
Comparison of the different key components among the representative methods of fine-tuning LLMs (Part 2).
Method . | Evaluation Method . | Reference Required . | ||
---|---|---|---|---|
Result Mode . | Details . | Specific Criteria . | ||
PandaLM | Comparison | Reason & Reference | Unified | No |
Prometheus | Scoring | Reason | Explicit | Yes |
Prometheus 2 | Scoring & Comparison | Reason | Explicit | Yes |
Shepherd | Overall Judgment | Error Identifying & Refinement | Unified | No |
TIGERScore | MQM | Error Analysis | Implicit | No |
INSTRUCTSCORE | MQM | Error Analysis | Implicit | Yes |
AUTO-J | Scoring & Comparison | Reason | Implicit | No |
CritiqueLLM | Scoring | Reason | Unified | Flexible |
JudgeLM | Scoring & Comparison | Reason | Unified | Flexible |
Themis | Scoring | Reason | Explicit | No |
Self-Taught | Comparison | Reason | Unified | No |
CompassJudger-1 | Scoring & Comparison | Reason | Explicit | No |
Method . | Evaluation Method . | Reference Required . | ||
---|---|---|---|---|
Result Mode . | Details . | Specific Criteria . | ||
PandaLM | Comparison | Reason & Reference | Unified | No |
Prometheus | Scoring | Reason | Explicit | Yes |
Prometheus 2 | Scoring & Comparison | Reason | Explicit | Yes |
Shepherd | Overall Judgment | Error Identifying & Refinement | Unified | No |
TIGERScore | MQM | Error Analysis | Implicit | No |
INSTRUCTSCORE | MQM | Error Analysis | Implicit | Yes |
AUTO-J | Scoring & Comparison | Reason | Implicit | No |
CritiqueLLM | Scoring | Reason | Unified | Flexible |
JudgeLM | Scoring & Comparison | Reason | Unified | Flexible |
Themis | Scoring | Reason | Explicit | No |
Self-Taught | Comparison | Reason | Unified | No |
CompassJudger-1 | Scoring & Comparison | Reason | Explicit | No |
4.1 Data Construction
Diverse data with high-quality annotations is crucial for the fine-tuning of evaluation models, which mainly involves task scenarios, inputs, target texts to evaluate, and evaluation results. Early NLG evaluation research primarily focused on conventional NLG tasks, such as summarization and dialogue generation. Thus, the task scenarios, inputs, and target texts refer to the corresponding NLP task, source inputs of the task, and outputs generated by specialized systems based on task requirements, respectively. Mainstream datasets for these tasks predominantly use human annotators to provide evaluation results, which are often considered reliable.
With the recent rise of LLMs, the spectrum of NLG tasks has been broadened to scenarios of instruction and response that are more aligned with human needs. Traditional tasks like summarization with corresponding source inputs can be viewed as kinds of instructions and requirements. Meanwhile, responses generated by various general LLMs generally serve as the target texts now and require more flexible evaluation so that the performance of different LLMs can be compared, promoting further developments. Therefore, to keep pace with the current advancement of modeling techniques, most evaluation methods have adopted a similar instruction-response scenario.
The primary differences in these works actually lie in the construction of instructions, with the purpose of improving either diversity or reliability for the better generalization ability of the fine-tuned model. PandaLM and JudgeLM entirely sample from common instruction datasets, such as Alpaca 52K, while CritiqueLLM adopts small-scale sampling followed by ChatGPT augmentation. In contrast, Prometheus and INSTRUCTSCORE rely on GPT-4 to generate all the instructions based on seed data, whereas Auto-J and Shepherd use real-world data. Moreover, because large-scale human annotation is impractical, most work utilizes GPT-4 as the powerful annotator, except for PandaLM and Shepherd, which use GPT-3.5 and human annotation on small-scale community data, respectively. Specifically, Themis focuses on NLG tasks and combines existing human evaluations with additional evaluations from GPT-4, selecting more consistent training data. Self-Taught uses the evaluation results from the model to fine-tune itself (LLaMa-3-70B), considering that it already possesses strong capabilities. During the construction, these studies basically all design detailed prompts or guidance and apply heuristic filtering strategies and post-processing methods to mitigate noise. Overall, despite the possible higher quality of human annotation, the corresponding drawback is the difficulty in constructing large-scale datasets, which in turn may hinder adequate model training, while using LLMs for construction is the opposite situation.
4.2 Evaluation Method
As with prompting LLMs, the evaluation methods adopted in these studies are highly diversified, involving different evaluation criteria, result modes, and usages of the reference. Given that current instruction-response scenarios encompass different types of tasks, it is unsuitable to specify unified evaluation criteria as in traditional NLG tasks. However, some work still does it this way, while some other methods let LLM annotators adaptively and implicitly reflect the required criteria in their evaluations, like PandaLM, TIGERScore, and AUTO-J. In particular, AUTO-J has meticulously crafted 332 evaluation criteria, matched to different tasks. Furthermore, Prometheus and Themis explicitly incorporate evaluation criteria into the evaluation instructions, and CompassJudger-1 can work either with or without evaluation criteria, enabling flexible evaluation based on various customized criteria.
More details about the evaluation methods are shown in Table 3. All the methods require models to provide detailed information, such as reasons for their evaluation results. And the MQM mode can achieve more informative error analysis, offering stronger interpretability. Moreover, some methods do not necessarily require references and then have greater value in practice. A more optimal method is to concurrently support both reference-based and reference-free evaluations as JudgeLM and CritiqueLLM.
4.3 Fine-tuning Implementation
The fine-tuning process is implemented by different studies on their selected open-source foundation LLMs, like LLaMA, and respective constructed data, with some targeted settings. Specifically, Prometheus maintains balanced data distributions during fine-tuning, including the length and label. JudgeLM eliminates potential biases by randomly swapping sample pairs to be compared and randomly removing references. INSTRUCTSCORE utilizes GPT-4 to provide error annotations for the intermediate outputs of the fine-tuned model for further supervised reinforcement. And based on some preliminary experiments and manual analysis, TIGERScore determines appropriate ratios of different types of data during fine-tuning, which are claimed to be crucial by them. Moreover, CritiqueLLM implements separately, with and without references, and explores the effects of data and model scale. Themis uses additional rating-guided preference optimization after the fine-tuning process. Specifically, Self-Taught utilizes the evaluation results of the fine-tuned model itself for self-iterative optimization, leading to surprising improvements. Compared with the vanilla fine-tuning setting, these methods have improved the efficiency of model training and the robustness of evaluations.
4.4 Pros and Cons
The shortcomings of prompting LLMs for NLG evaluation can be significantly alleviated by the customized construction of training data and specifically fine-tuned LLMs. For instance, most models in Table 2 have less than 14B parameters, facilitating low-cost inference in practice and good reproducibility, with performance comparable to GPT4. And specific measures can be adopted to prevent certain biases found in GPT4 during different stages, such as randomly changing the order of training pairs for position bias. Furthermore, this type of approach allows for continuous iteration and improvement of the model to address potential deficiencies or emerging issues discovered in future applications.
However, some inherent biases associated with GPT4 may still persist, like self-biases, as the data construction of most methods uses GPT4 for critical evaluation annotation. On the other hand, many studies have chosen open-source foundation LLMs spanning three generations of the LLaMa series. With the recent rapid updates and improvements of open-source LLMs, it is intuitive that utilizing a more powerful foundation LLM should lead to better evaluation performance of the fine-tuned model. However, this means repetitive fine-tuning processes and computational expenses from scratch since directly migrating existing fine-tuned models to the new foundation LLM is challenging.
Additionally, although many existing methods aspire to more flexible and comprehensive evaluation through fine-tuning, demanding excessive evaluation settings may ultimately lead to poor performance or failure in model training, as AUTO-J and CritiqueLLM were found to have difficulties with criteria and references, respectively. However, there are some disagreements here since Prometheus, JudgeLM, and CompassJudger-1 show different results, indicating such a seemingly straightforward fine-tuning process is actually quite complex. Moreover, considering the different evaluation settings in existing work, conducting a horizontal comparison among them is challenging. These issues require further exploration in future research.
5 Human–LLM Collaborative Evaluation
Human evaluation remains the gold standard for NLG due to its ability to capture nuanced aspects of quality. However, it is expensive, time-consuming, and prone to subjective biases (van der Lee et al. 2021; Deriu et al. 2021; Li et al. 2023b). While LLMs offer a promising avenue for automated evaluation, their reliability and correlation with human judgment are still areas of active development (Li et al. 2023c; Liu et al. 2023d). Human–LLM collaborative evaluation seeks to leverage the strengths of both: the nuanced judgment of humans and the scalability and efficiency of LLMs. This section explores emerging paradigms in this collaborative space, focusing on how humans and LLMs can work together to improve the accuracy, efficiency, and trustworthiness of NLG evaluation. This includes collaborative approaches like: traditional evaluation tasks such as scoring and explaining (Zhang, Ren, and de Rijke 2021; Li et al. 2023b), general evaluation tasks such as testing and debugging (Ribeiro and Lundberg 2022), auditing NLG models to ensure fairness (Rastogi et al. 2023), aligning LLM-assisted evaluation of LLM outputs with human preferences (Shankar et al. 2024), and addressing the intricate challenge of scalable oversight (Amodei et al. 2016; Saunders et al. 2022).
5.1 Human-guided LLM Evaluation
Some work (Zhang, Ren, and de Rijke 2021; Li et al. 2023b) focuses on approaches where LLMs perform the primary evaluation task, but with significant guidance and oversight from humans. This guidance can take several forms, from designing detailed evaluation criteria to refining LLM outputs.
One common method is called checklist-based evaluation. A key challenge in open-ended NLG tasks is the lack of consistent evaluation criteria. Li et al. (2023b) address this with COEVAL, a collaborative pipeline where humans design a task-specific checklist. LLMs then use this checklist to generate initial evaluations and explanations, drawing on developments in explainable NLP (Yin and Neubig 2022; Jung et al. 2022; Ribeiro and Lundberg 2022; Ye et al. 2023b). Humans then scrutinize these LLM-generated evaluations, refining scores and explanations. This approach leverages the LLM’s ability to process large amounts of text while retaining human oversight to ensure accuracy and reduce outliers. Notably, human review still leads to revisions in approximately 20% of LLM scores, highlighting the importance of human judgment. Furthermore, InteractEval (Chu, Kim, and Yi 2025) combines human and LLM-generated attributes using Think Aloud methods to create questions and produce final prediction scores. Think Aloud methods mean that human experts verbalize their thoughts and LLMs articulate their knowledge to generate text attribute insights using sample texts and evaluation rubrics, which highlights the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation.
Collaborative assignment is also useful for human-guided LLM evaluation. Zhang, Ren, and de Rijke (2021) propose HMCEval, a framework that frames dialogue evaluation as a sample assignment problem. This approach aims to optimize the allocation of evaluation tasks between humans and machines to maximize accuracy while minimizing human effort. HMCEval achieves high accuracy (99%) with significantly reduced human involvement (half the effort). Additionally, EvalAssist (Ashktorab et al. 2024) can help practitioners refine evaluation criteria using both direct and pairwise assessment strategies. Ashktorab et al. (2024) also examine how users refine their criteria and identify key differences between the two evaluation approaches examined.
5.2 LLM-assisted Human Evaluation
Some studies (Ribeiro and Lundberg 2022; Rastogi et al. 2023; Pozdniakov et al. 2024) explore scenarios where humans remain the primary evaluators, but LLMs provide assistance to improve efficiency, identify flaws, or audit for biases.
Ribeiro and Lundberg (2022) introduce AdaTest, a system where LLMs generate unit tests to identify bugs in a target NLG model. Human feedback guides the LLM, significantly increasing the effectiveness of bug detection (5x–10x improvement). This demonstrates the power of LLMs in generating diverse test cases, guided by human intuition. In the task of evaluating machine translation systems, Zouhar, Kocmi, and Sachan (2025) assist annotators by pre-filling error annotations with recall-oriented automatic quality estimation, which achieves the effect of reducing the time per span annotation by half while maintaining the same annotation quality level and further cutting the annotation budget by almost 25%.
Addressing biases and irresponsible behavior in LLMs is crucial (Blodgett et al. 2020; Jones and Steinhardt 2022). AdaTest++ (Rastogi et al. 2023), drawing on human-AI collaboration research, facilitates collaborative auditing. Humans leverage their strengths in schematization and hypothesis testing, while LLMs assist in identifying a wide range of failure modes. This collaborative approach uncovered both previously known and under-reported issues.
Evaluating LLMs on complex tasks can be challenging even for humans (Chen et al. 2021; Nakano et al. 2021; Li et al. 2022; Menick et al. 2022). The concept of scalable oversight (Amodei et al. 2016) suggests using AI to assist in evaluation. Saunders et al. (2022) explore using LLM-generated critiques to help humans identify flaws in model outputs, demonstrating that this form of assistance improves human performance. What’s more, Pozdniakov et al. (2024) focus on designing conversational user interfaces, which helps educators use LLMs to evaluate assignments of students.
5.3 Pros and Cons
Human–LLM collaborative evaluation offers a compelling balance between the accuracy of human judgment and the efficiency of automated methods. Key advantages include: (1) Efficiency and Cost-Effectiveness: LLMs can significantly reduce the time and resources required for evaluation. (2) Complementary Strengths: Humans excel at nuanced judgment and critical thinking, while LLMs excel at processing large amounts of data and generating diverse outputs. (3) Improved Accuracy: Combining human and LLM strengths can lead to more accurate and reliable evaluations than either approach alone.
However, challenges remain: (1) Prompt Sensitivity: LLM evaluation results can be sensitive to the phrasing of prompts, requiring careful prompt engineering (Li et al. 2023b; Rastogi et al. 2023). (2) Confidence Calibration: LLMs’ ability to accurately assess their own confidence is still limited, making it difficult to know when to trust their judgments. (3) Need for Human Oversight: Although reduced, human supervision is still necessary, limiting the potential for full automation. (4) Explainability: Ensuring the collaborative process is transparent and understandable can be challenging.
6 Conclusions and Future Trends
6.1 Comparison with Traditional Evaluation Metrics
Traditional evaluation metrics are criticized for their poor correlation with human judgments (Stent, Marge, and Singhai 2005), uninterpretable evaluation results (Zhang, Vogel, and Waibel 2004), and inability to adapt to specific evaluation criteria (Wiseman, Shieber, and Rush 2017), which are being greatly mitigated by LLM-based evaluation. However, the higher cost, the requirements for computing resources, and the issues of reproducibility may be the downside.
6.2 Comparison Between Different Types of LLM-based NLG Evaluation
We compare different types of LLM-based evaluation according to flexibility and reproducibility due to the difficulty of comparing the effectiveness of different types of methods in various scenarios.
Flexibility
Human–LLM Collaborative Evaluation > Prompting LLMs > Fine-tuning LLMs > LLM-derived Metrics. Human–LLM Collaborative Evaluation involves human annotators, which provides the highest flexibility. LLM-derived Metrics are typically designed to evaluate specific aspects, such as text similarity, and do not fully allow evaluation criteria to be expressed in natural language, making them the least flexible. When comparing Prompting LLMs and Fine-tuning LLMs, the former, which uses proprietary models, generally performs better at following instructions compared with smaller open-source models.
Reproducibility
LLM-derived Metrics ≈ Prompting LLMs > Fine-tuning LLMs > Human–LLM Collaborative Evaluation. Human–LLM Collaborative Evaluation requires human annotators, and the recruitment and training of these annotators pose greater challenges to reproducibility. LLM-derived Metrics and Prompting LLMs do not modify the existing models, and therefore have better reproducibility than Fine-tuning LLMs. However, they may still become non-reproducible if proprietary models are deprecated.
Performance
Human–LLM Collaborative Evaluation > Fine-tuning LLMs ≈ Prompting LLMs > LLM-derived Metrics. We compare the performance of different LLM-based evaluation approaches on the most commonly used NLG evaluation benchmark on summarization, SummEval (Fabbri et al. 2021), as shown in Table 4. When using the same LLMs, LLM-derived metrics perform worse than directly prompting LLMs for evaluation, and the latter is more convenient. Moreover, among methods of fine-tuning LLMs, only models focused on NLG evaluation scenarios, such as Themis, outperform prompting-based methods, including those using GPT-4. Other studies either use relatively outdated foundation LLMs or lack training on specific evaluation aspects like those in SummEval, leading to relatively weaker performance. Furthermore, Human–LLM Collaborative Evaluation enhances the LLM evaluation by incorporating checklists elaborated with human expert insights and LLM knowledge, resulting in the strongest performance.
Method . | Parameter . | Coherence . | Consistency . | Fluency . | Relevance . | Overall . |
---|---|---|---|---|---|---|
Traditional Metrics | ||||||
BERTScore | 355M | 0.285 | 0.151 | 0.186 | 0.302 | 0.231 |
BARTScore | 400M | 0.474 | 0.266 | 0.258 | 0.318 | 0.329 |
LLM-derived Metrics | ||||||
GPTScore (FT5) | 11B | 0.456 | 0.438 | 0.424 | 0.343 | 0.415 |
GPTScore (OPT) | 66B | 0.359 | 0.453 | 0.380 | 0.337 | 0.382 |
GPTScore (GPT-3) | 175B | 0.434 | 0.449 | 0.403 | 0.381 | 0.417 |
GPTScore (Phi-4) | 14B | 0.319 | 0.436 | 0.386 | 0.154 | 0.324 |
GPTScore (LLaMa-3.1) | 70B | 0.415 | 0.478 | 0.437 | 0.288 | 0.405 |
GPTScore (Qwen-2.5) | 72B | 0.447 | 0.486 | 0.437 | 0.376 | 0.436 |
Prompting LLMs | ||||||
G-Eval (GPT-3.5) | – | 0.440 | 0.386 | 0.424 | 0.385 | 0.409 |
G-Eval (GPT-4) | – | 0.582 | 0.507 | 0.455 | 0.548 | 0.523 |
Phi-4 | 14B | 0.479 | 0.454 | 0.421 | 0.452 | 0.451 |
LLaMa-3.1 | 70B | 0.510 | 0.387 | 0.317 | 0.494 | 0.427 |
Qwen-2.5 | 72B | 0.515 | 0.509 | 0.435 | 0.528 | 0.497 |
Fine-tuning LLMs | ||||||
INSTRUCTSCORE | 7B | 0.328 | 0.232 | 0.260 | 0.211 | 0.258 |
Prometheus 2 | 7B | 0.403 | 0.318 | 0.269 | 0.356 | 0.336 |
Themis | 8B | 0.566 | 0.600 | 0.571 | 0.474 | 0.553 |
TIGERScore | 13B | 0.381 | 0.427 | 0.363 | 0.366 | 0.384 |
CompassJudger-1 | 32B | 0.494 | 0.424 | 0.318 | 0.410 | 0.411 |
Human–LLM Collaborative Evaluation | ||||||
InteractEval (GPT-3.5 1st) | – | 0.583 | 0.630 | 0.734 | 0.614 | 0.640 |
InteractEval (GPT-3.5 2nd) | – | 0.590 | 0.614 | 0.726 | 0.623 | 0.638 |
InteractEval (GPT-4 1st) | – | 0.649 | 0.799 | 0.783 | 0.626 | 0.714 |
InteractEval (GPT-4 2nd) | – | 0.660 | 0.781 | 0.816 | 0.642 | 0.725 |
Method . | Parameter . | Coherence . | Consistency . | Fluency . | Relevance . | Overall . |
---|---|---|---|---|---|---|
Traditional Metrics | ||||||
BERTScore | 355M | 0.285 | 0.151 | 0.186 | 0.302 | 0.231 |
BARTScore | 400M | 0.474 | 0.266 | 0.258 | 0.318 | 0.329 |
LLM-derived Metrics | ||||||
GPTScore (FT5) | 11B | 0.456 | 0.438 | 0.424 | 0.343 | 0.415 |
GPTScore (OPT) | 66B | 0.359 | 0.453 | 0.380 | 0.337 | 0.382 |
GPTScore (GPT-3) | 175B | 0.434 | 0.449 | 0.403 | 0.381 | 0.417 |
GPTScore (Phi-4) | 14B | 0.319 | 0.436 | 0.386 | 0.154 | 0.324 |
GPTScore (LLaMa-3.1) | 70B | 0.415 | 0.478 | 0.437 | 0.288 | 0.405 |
GPTScore (Qwen-2.5) | 72B | 0.447 | 0.486 | 0.437 | 0.376 | 0.436 |
Prompting LLMs | ||||||
G-Eval (GPT-3.5) | – | 0.440 | 0.386 | 0.424 | 0.385 | 0.409 |
G-Eval (GPT-4) | – | 0.582 | 0.507 | 0.455 | 0.548 | 0.523 |
Phi-4 | 14B | 0.479 | 0.454 | 0.421 | 0.452 | 0.451 |
LLaMa-3.1 | 70B | 0.510 | 0.387 | 0.317 | 0.494 | 0.427 |
Qwen-2.5 | 72B | 0.515 | 0.509 | 0.435 | 0.528 | 0.497 |
Fine-tuning LLMs | ||||||
INSTRUCTSCORE | 7B | 0.328 | 0.232 | 0.260 | 0.211 | 0.258 |
Prometheus 2 | 7B | 0.403 | 0.318 | 0.269 | 0.356 | 0.336 |
Themis | 8B | 0.566 | 0.600 | 0.571 | 0.474 | 0.553 |
TIGERScore | 13B | 0.381 | 0.427 | 0.363 | 0.366 | 0.384 |
CompassJudger-1 | 32B | 0.494 | 0.424 | 0.318 | 0.410 | 0.411 |
Human–LLM Collaborative Evaluation | ||||||
InteractEval (GPT-3.5 1st) | – | 0.583 | 0.630 | 0.734 | 0.614 | 0.640 |
InteractEval (GPT-3.5 2nd) | – | 0.590 | 0.614 | 0.726 | 0.623 | 0.638 |
InteractEval (GPT-4 1st) | – | 0.649 | 0.799 | 0.783 | 0.626 | 0.714 |
InteractEval (GPT-4 2nd) | – | 0.660 | 0.781 | 0.816 | 0.642 | 0.725 |
Cost
LLM-derived Metrics ≈ Prompting open-source LLMs < Fine-tuning LLMs ≈ Prompting proprietary LLMs < Human–LLM Collaborative Evaluation. When using the same open-source LLM, the inference costs of LLM-derived metrics, prompting LLM, and fine-tuning LLM methods are the same, while fine-tuning LLM incurs additional training costs. When prompting proprietary LLMs, the cost is high and mainly concentrated in API calls during evaluation, making it difficult to directly compare with the training cost required for fine-tuning LLM. Moreover, human–LLM collaborative evaluation requires the involvement of human experts for each task, making it the most expensive approach.
6.3 Future Directions
Unified Benchmarks for LLM-based NLG Evaluation Approaches
As mentioned above, each of the studies that fine-tuned LLMs to construct specialized evaluation models uses different settings and data during testing, making them incomparable. In the research on prompting LLMs for NLG evaluation, there are some publicly available human judgments on the same NLG task, such as SummEval for summarization. However, the existing human judgments have many problems. First, most of the existing data only involve one type of NLG task and a single human evaluation method (e.g., scoring), making it difficult to evaluate LLMs’ performance on different tasks, as well as using different evaluation methods on the same task. Second, many of the texts in these human judgments are generated by outdated models (such as Pointer Network) and do not include texts generated by more advanced LLMs. Lastly, many human evaluation datasets are too small in scale. There is an urgent need for large-scale, high-quality human evaluation data covering various NLG tasks and evaluation methods as a benchmark.
NLG Evaluation for Low-resource Languages and New Task Scenarios
Almost all existing research focuses on English data. However, it is doubtful whether LLMs have similar levels of NLG evaluation capability for texts in other languages, especially low-resource languages. As Zhang et al. (2023a) point out, we should be more cautious about using LLMs to evaluate texts in non-Latin languages. We believe that the lack of evaluation capability of LLM-based evaluators on low-resource languages may be due to the insufficient presence of these languages in the pretraining corpus. Therefore, further fine-tuning on certain low-resource languages may be a potential strategy to address this issue, and Hada et al. (2024) have already shown promising preliminary results. Additionally, existing research mainly focuses on more traditional NLG tasks such as translation, summarization, and dialogue. However, there are many new scenarios in reality with different requirements and evaluation criteria. For example, using LLMs to automatically evaluate scientific reviews could be valuable in identifying and flagging content that is unfaithful or unclear, alerting reviewers to potential issues. Research on low-resource languages and new task scenarios will provide a more comprehensive understanding of LLMs’ evaluation capabilities.
Diverse Forms of Human–LLM Collaborative NLG Evaluation
According to the literature reviewed above, there is little research on collaborative evaluation between humans and LLMs. Neither humans nor LLMs are perfect, and each has its strengths. Because the ultimate goal of NLG research is to evaluate text quality more accurately and efficiently, we believe that collaboration between humans and LLMs can achieve better results than pure human evaluation or automatic evaluation. In the collaboration between humans and LLMs, technologies in the field of human–computer interaction may bring new implementation methods to the collaboration. In addition, what roles humans and LLMs should play in the evaluation and how they can better complement each other are still worth researching.
Acknowledgments
This work was supported by Beijing Science and Technology Program (Z231100007423011) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We appreciate the anonymous reviewers for their helpful comments. Xiaojun Wan is the corresponding author.
Notes
References
Author notes
Equal contribution.
Action Editor: Kevin Duh