Abstract
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.
1 Introduction
Recent years have seen an explosion of large language models (LLMs), which generalize to unseen tasks via natural language instructions. Various LLM evaluation benchmarks, such as BIG-bench and HELM, use a single instruction template per task, evaluating all models against it (Srivastava et al., 2023; Liang et al., 2023). However, there could be a myriad of ways to phrase an instruction template for a given task; see Figure 1 for examples of different templates for the task of recognizing homophones. Naturally, LLM performance depends on the chosen template.
We explore the question of robustly comparing different models on a given task. We first create a dataset of paraphrased instructions, employing three automatic paraphrasing methods based on recent techniques such as chain-of-thought. We manually verify and filter a large collection of more than 175 paraphrases for different tasks (5K instruction paraphrases in total), which we make publicly available for future research.1
Next, we use our dataset to perform a large-scale statistical evaluation of over 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that models perform very differently on different instruction paraphrases. For example, Figure 1 shows four models evaluated on four semantically equivalent prompts, with both absolute and relative performance varying widely; one can even observe cases where the same model performs the best on one instruction and the worst on a semantically equivalent instruction (e.g., GPT-3.5-Turbo on P1 vs. P4). Subsequently, we argue that very little can be said on either absolute or relative performance based on single-instruction evaluation. This may also partially explain why some models seem less accurate in practice than their formal evaluation suggests.
Note that while the claim that evaluating against a single instruction template leads to brittle results is not surprising per se, to the best of our knowledge it has never been subjected to rigorous empirical testing before.
To address the limitations of single-instruction evaluation, we propose to take a step back and consider multi-prompt LLM evaluation — a set of metrics which measure aggregated performance over a set of instruction template paraphrases.
We argue that different use cases should entail different evaluation metrics. For example, LLM developers may be interested in measuring the robustness of performance across multiple instruction templates. In contrast, developers aiming to integrate an LLM into a specific downstream task may be interested in comparing models according to their corresponding top-performing instruction.
We evaluate 20 LLMs with our metrics, finding that their absolute and relative performance differ from results obtained with the benchmarks’ original instructions. We demonstrate that different models excel in different metrics: For instance, in the LMentry benchmark, LLaMA-based models are comparable to T5-based models when looking at top-performing instructions, but lag behind when average performance is considered, due to poor performance on a large number of paraphrases. We also show that our automatic paraphrasing method is effective, and there is no need to manually verify the paraphrases.
Our results suggest that future work should use multi-prompt LLM evaluations and choose a metric for aggregating the results according to the extrinsic needs of the evaluators. We hope that our work will help spur more consistency and comparability in LLM evaluation, which is strongly tied to real-world usage of LLMs.
2 Background and Definitions
Below we survey how generalization to a new task format is evaluated and compared between LLMs, finding that the common practice involves a single (or very few) task instruction templates. In the rest of the paper, we will argue that such practice leads to brittle, unreliable results.
Task Instruction Templates.
Following Mishra et al. (2022) and Chung et al. (2024), we separate between task instruction, samples, and input-output exemplars which may be provided during in-context learning. We define an instruction template for a given task as a string with placeholders where the input samples are to be inserted. As seen in Figure 1, the same task can be described using different task instruction templates.
Evaluation Benchmarks.
Several recent efforts aim to standardize LLM evaluation. Notable examples include MMLU (Hendrycks et al., 2020), BIG-bench (Srivastava et al., 2023; Suzgun et al., 2023), and HELM (Liang et al., 2023). In all of these, each task has a single instruction template, against which all models are evaluated. Another benchmark, LMentry (Efrat et al., 2023), reports models’ average performance on three instruction templates. The instruction templates are provided with these benchmarks, allowing new models to be tested against the same template.
We note that many notable works do not disclose the instruction templates used for evaluation (e.g., LLaMA [Touvron et al., 2023], PALM [Chowdhery et al., 2023], GPT-4 [Achiam et al., 2023], Gemini [Team et al., 2023]). While there are reasons to withhold instructions (e.g., avoid potential leakage), this practice exacerbates the challenge of meaningful comparative evaluation.
Prompt Robustness.
Related to this study is a line of work measuring LLMs’ robustness to prompt (or instruction template) modifications. Unlike our work, these typically aim to measure model performance against adversarial paraphrasing approaches. PromptBench (Zhu et al., 2023) measures performance on erroneous instructions (e.g., instructions written by non-native English speakers). They then compare performance on perturbed instructions vs. the benchmark’s original instructions, which are considered the gold-standard reference. Gu et al. (2023) examined a single LLM’s robustness under various instruction perturbations, including word-, sentence-, and instruction-level changes. Sun et al. (2023) show that LLMs perform better on instructions they have seen in training compared to manual paraphrases. We later incorporate their manual paraphrases in our evaluation of BIG-bench Lite.
In contrast to works on prompt robustness, we analyze the impact of the choice of prompt in terms of both absolute and relative model performance, covering a wide range of models and several different metrics.
3 Experimental Setup
3.1 Tasks
We evaluate 39 diverse tasks from three evaluation benchmarks, as itemized below.
10 Tasks from LMentry (Efrat et al., 2023).
LMentry consists of simple linguistic tasks (e.g., “write a word that doesn’t contain the letter l”), each accompanied by three associated instruction templates. The tasks are designed to capture explainable and controllable linguistic phenomena. We choose the 10 tasks that received the lowest scores in the original paper, as these more challenging tasks are likely to better highlight the differences between models.
14 Tasks from BIG-bench Lite (BBL; Srivastava et al., 2023).
15 Tasks from BIG-bench Hard (BBH; Suzgun et al., 2023).
This is another curated subset of BIG-bench, containing particularly challenging tasks on which LLMs underperform the average human score. We focused on a set of 15 classification and multiple choice tasks to streamline the evaluation process. Each task in BBH is associated with a single instruction template.
Measuring Performance.
In LMentry we measure performance using the official evaluation script, while in Big-Bench we perform exact string matching. We note that while exact matching is somewhat strict, we believe it is also fair and straightforward.
3.2 Models
We evaluate 16 instruction-tuned LLMs from 11 diverse model families (Chung et al., 2024; Sanh et al., 2021; Taori et al., 2023; Zheng et al., 2024; Durbin, 2023; Ding et al., 2023; NousResearch, 2023; Almazrouei et al., 2023; Team, 2023; Collective, 2023) (see Table 1). We refrain from including closed, API-based models (e.g., OpenAI models) in our main evaluation for two reasons. First, using them at scale is an expensive prospect. For example, running our entire evaluation suite on GPT-4 will cost thousands of dollars. Second, and more importantly, the closed API for these models reportedly manipulates the input prompts in an undisclosed manner (e.g., wrapping them with meta-prompts, or rerouting to other models) (Rao et al., 2023) which interferes with our evaluation. We do however perform a small-scale evaluation of OpenAI models in Section 7 to show that they are also sensitive to prompt paraphrasing.
Model . | Model size . | Base model . | # Params . |
---|---|---|---|
Flan-T5 | Small | T5 | 80M |
Base | 250M | ||
Large | 780M | ||
XL | 3B | ||
XXL | 11B | ||
T0 | Small | T5 | 3B |
T0pp | 11B | ||
Alpaca | Small | LLaMA | 7B |
Big | 13B | ||
Vicuna | LLaMA | 13B | |
Airoboros | LLaMA | 13B | |
UltraLM | LLaMA | 13B | |
Nous-Hermes | LLaMA | 13B | |
Falcon-Instruct | Falcon | 7B | |
MPT | MPT | 7B | |
Minotaur | StarCoder Plus | 15B |
Model . | Model size . | Base model . | # Params . |
---|---|---|---|
Flan-T5 | Small | T5 | 80M |
Base | 250M | ||
Large | 780M | ||
XL | 3B | ||
XXL | 11B | ||
T0 | Small | T5 | 3B |
T0pp | 11B | ||
Alpaca | Small | LLaMA | 7B |
Big | 13B | ||
Vicuna | LLaMA | 13B | |
Airoboros | LLaMA | 13B | |
UltraLM | LLaMA | 13B | |
Nous-Hermes | LLaMA | 13B | |
Falcon-Instruct | Falcon | 7B | |
MPT | MPT | 7B | |
Minotaur | StarCoder Plus | 15B |
4 Evaluating against a Single Prompt Leads to Instability in Results
As discussed in the previous section, LLMs are usually evaluated against a single instruction template. In this section, we will show that this approach is quite brittle. Indeed, a simple rephrasing of the instruction template can lead to drastic changes in both absolute and relative model performance.
In Section 4.1 we create a large number of automatically generated instruction paraphrases for tasks from the LMentry and BBH benchmarks. Paraphrases are created using an LLM and verified by human annotators. In Section 4.2, we statistically analyze the performance of various LLMs against these instruction templates and quantify the variation in model performance. Finally, in Section 4.3, we show that models exhibit similar brittleness with manually written paraphrases for tasks from the BBL benchmark.
4.1 Paraphrasing Instruction Templates
We use three prompting methods which were found useful in previous works: (1) instruction template rephrasing: asking an LLM to rephrase a seed prompt (Lester et al., 2021; Gonen et al., 2023; Honovich et al., 2023a); (2) Chain-of-Thought prompting (Wei et al., 2022): we provided the model with a sequence of steps in which the model is asked first to produce a task description, and then to generate various instruction templates for the task; and (3) Gradual template generation: inspired by Honovich et al. (2023b), we split the COT approach into three LLM calls. The first for generating a task description from a seed instruction template, the second for generating instruction provided by input-output examples, and the third for processing the instruction and examples into an instruction template.
In all of the above, we use GPT-3.5-Turbo for generation, and the original instruction templates for each of our tasks to seed these three generation methods, resulting on average in more than 200 automatically generated instruction template paraphrases for each of our tasks (see Table 2). We make this collection, as well as the code used to generate it, publicly available for reproducibility and to enable future work.
Benchmark . | Method . | #Automatic . | #Correct . | Correct . |
---|---|---|---|---|
Paraphrases . | Paraphrases . | Ratio . | ||
LMentry | All | 2429 | 2186 | 90.00% |
Rephrase | 461 | 408 | 88.50% | |
CoT | 1286 | 1234 | 95.96% | |
Gradual | 652 | 514 | 78.83% | |
BBH | All | 2615 | 2209 | 84.47% |
Rephrase | 734 | 627 | 85.42% | |
CoT | 775 | 630 | 81.29% | |
Gradual | 1091 | 937 | 85.88% |
Benchmark . | Method . | #Automatic . | #Correct . | Correct . |
---|---|---|---|---|
Paraphrases . | Paraphrases . | Ratio . | ||
LMentry | All | 2429 | 2186 | 90.00% |
Rephrase | 461 | 408 | 88.50% | |
CoT | 1286 | 1234 | 95.96% | |
Gradual | 652 | 514 | 78.83% | |
BBH | All | 2615 | 2209 | 84.47% |
Rephrase | 734 | 627 | 85.42% | |
CoT | 775 | 630 | 81.29% | |
Gradual | 1091 | 937 | 85.88% |
Manual Validation and Filtering of Automatic Instruction Paraphrases.
All automatically generated paraphrases were manually verified and filtered by an annotator from our group to ensure their coherence and relevance to the task. A portion of the data involving 15 randomly selected templates from each task, totaling in 375 instructions, was also given to a second annotator; results show reliable agreement (Table 3), indicating our evaluation process is calibrated.
Benchmark . | Correct (%) . | Agreement . | Agreement . |
---|---|---|---|
(accuracy) . | (Cohen’s κ) . | ||
LMentry | 86.0 | .953 | .774 |
BBH | 86.7 | .916 | .491 |
Benchmark . | Correct (%) . | Agreement . | Agreement . |
---|---|---|---|
(accuracy) . | (Cohen’s κ) . | ||
LMentry | 86.0 | .953 | .774 |
BBH | 86.7 | .916 | .491 |
See Table 2 for a fine-grained distribution across the different generation metrics. Overall, we found that 90% of the generated paraphrases created for LMentry were correct, and roughly 84% of the paraphrases for BBH were correct.
On average, the validation process yields 240 validated instruction paraphrases per task for LMentry and 175 paraphrases per task for BBH. Next, we use these paraphrases to quantify performance variability due to instruction template paraphrasing across ∼ 6.5M instances.2
4.2 Quantifying Performance Variance due to Instruction Paraphrasing
We leverage the collection of validated paraphrases to assess how model performance varies with paraphrasing. Our main finding is that the common approach of evaluating against a single prompt is unstable, leading to unreliable results.
Instance Sampling and Prompt Construction.
Our study involves a large number of tasks, models, and instruction paraphrases. However, evaluating LLMs can become prohibitively expensive with the increase of the number of samples, datasets, models, and instruction templates (Perlitz et al., 2023). To make our evaluation feasible, we chose to evaluate each instruction template on a randomly selected subset of 100 task samples. Furthermore, we found that all models struggle on BBH, beyond the point of meaningful comparison. To address this, we evaluate 11 out of the 16 models on it (the ones with the largest number of parameters), and add an example of the prediction format to all instruction template paraphrases.
Using a Single-instruction Template Leads to Brittle Ranking.
Kendall’s W would be 1 for all tasks if model ranking were the same among all instruction templates (in other words, they are interchangeable for the sake of evaluation). In contrast, the more W approaches 0, the lesser the rankings induced by different instructions agree.
The results (Table 4) demonstrate that a single instruction template leads to unreliable rankings for many of the tasks, with 10 of the tasks exhibiting only slight to moderate ranking agreement, and only two exhibiting strong agreement. To complement the analysis, we performed the Friedman test with tied data (Corder and Foreman, 2011), showing that different instructions lead to statistically significant differences in performance for 21 out of the 25 tasks.
Tasks . | Kendall’s W . | Friedman p-val . |
---|---|---|
LMentry | ||
not containing | .271 (weak) | 0.0* |
word before | .367 (weak) | 0.0* |
first alphabet | .436 (weak) | 0.0* |
less letters | .485 (weak) | 0.0* |
rhyming word | .496 (weak) | 0.0* |
ends with word | .518 (weak) | 0.0* |
homophones | .518 (weak) | 0.0* |
all words | .522 (weak) | 0.0* |
any words | .527 (weak) | 0.0* |
more letters | .540 (weak) | 0.0* |
BIG-bench Hard | ||
recommendations | .628 (medium) | .897 |
formal fallacies | .704 (medium) | 5.6E-13 |
geometric shapes | .710 (medium) | .167 |
hyperbaton | .730 (medium) | 1.0E-4 |
logical deduction 3 | .740 (medium) | 4.9E-16 |
disambiguation qa | .764 (medium) | 2.1E-17 |
ruin names | .776 (medium) | .366 |
logical deduction 7 | .778 (medium) | 1.4E-13 |
translation error | .800 (medium) | 6.9E-9 |
logical deduction 5 | .818 (medium) | 3.0E-9 |
snarks | .823 (medium) | .604 |
penguins in a table | .830 (medium) | 7.3E-15 |
navigate | .838 (medium) | 5.6E-10 |
causal judgement | .851 (strong) | 4.9E-7 |
sports | .873 (strong) | 8.0E-13 |
BIG-bench Lite | ||
known unknown | .316 (weak) | 4.4E-5 |
play dialog | .355 (weak) | 4.3E-5 |
winowhy | .520 (weak) | 6.0E-4 |
strategic qa | .529 (weak) | .191 |
hindu knowledge | .560 (weak) | .569 |
conceptual | .731 (medium) | .132 |
strange stories | .731 (medium) | .431 |
code desc | .756 (medium) | .002 |
novel concepts | .787 (medium) | .620 |
logic grid puzzle | .796 (medium) | .010 |
lang. identification | .811 (medium) | .002 |
vitaminc | .888 (strong) | .772 |
bbq lite | .890 (strong) | .023 |
logical deduction | .913 (strong) | .895 |
Tasks . | Kendall’s W . | Friedman p-val . |
---|---|---|
LMentry | ||
not containing | .271 (weak) | 0.0* |
word before | .367 (weak) | 0.0* |
first alphabet | .436 (weak) | 0.0* |
less letters | .485 (weak) | 0.0* |
rhyming word | .496 (weak) | 0.0* |
ends with word | .518 (weak) | 0.0* |
homophones | .518 (weak) | 0.0* |
all words | .522 (weak) | 0.0* |
any words | .527 (weak) | 0.0* |
more letters | .540 (weak) | 0.0* |
BIG-bench Hard | ||
recommendations | .628 (medium) | .897 |
formal fallacies | .704 (medium) | 5.6E-13 |
geometric shapes | .710 (medium) | .167 |
hyperbaton | .730 (medium) | 1.0E-4 |
logical deduction 3 | .740 (medium) | 4.9E-16 |
disambiguation qa | .764 (medium) | 2.1E-17 |
ruin names | .776 (medium) | .366 |
logical deduction 7 | .778 (medium) | 1.4E-13 |
translation error | .800 (medium) | 6.9E-9 |
logical deduction 5 | .818 (medium) | 3.0E-9 |
snarks | .823 (medium) | .604 |
penguins in a table | .830 (medium) | 7.3E-15 |
navigate | .838 (medium) | 5.6E-10 |
causal judgement | .851 (strong) | 4.9E-7 |
sports | .873 (strong) | 8.0E-13 |
BIG-bench Lite | ||
known unknown | .316 (weak) | 4.4E-5 |
play dialog | .355 (weak) | 4.3E-5 |
winowhy | .520 (weak) | 6.0E-4 |
strategic qa | .529 (weak) | .191 |
hindu knowledge | .560 (weak) | .569 |
conceptual | .731 (medium) | .132 |
strange stories | .731 (medium) | .431 |
code desc | .756 (medium) | .002 |
novel concepts | .787 (medium) | .620 |
logic grid puzzle | .796 (medium) | .010 |
lang. identification | .811 (medium) | .002 |
vitaminc | .888 (strong) | .772 |
bbq lite | .890 (strong) | .023 |
logical deduction | .913 (strong) | .895 |
Examples of Differences in Model Ranking.
We illustrate the implications of ranking differences in Figure 2. In all three cases, P1 and P2 are valid paraphrases, yet they lead to vastly different rankings. For example, T0pp ranks first on the BBH task (center) according to P1 and only 9th according to P2. Similarly, Alpaca-13B and Alpaca-7B are in the top-performing models on the LMentry task P2, while they rank last for P1.
Absolute Model Performance Varies Widely on Single-Instruction Templates.
Aside from vastly different relative model rankings, instruction template paraphrases often result in varying absolute model performances. To quantify this variance, we calculated divergence, defined as the number of standard deviations by which the performance, as assessed using the original instruction templates, deviates from the model’s average performance over all paraphrases.
The results in Figure 3 reveal noticeable divergence for the LMentry benchmark, defined as surpassing one standard deviation (Kazmier et al., 2003). For instance, the performance of the Alpaca-13B with the original instruction templates outperformed its average performance by more than one standard deviation in 7 out of 10 LMentry tasks. For lack of space, the figure does not depict the BBH benchmark, but similar patterns of divergence were observed there as well.
In line with Lou et al. (2023), we find that major differences in performance can occur even for very similar paraphrase pairs. For example, the Flan-T5-large model demonstrated an average performance degradation of 28% when changing the word ‘excludes’ to ‘lacks’, while the Flan- T5-XL model showed an average performance improvement of 46% on that same edit. See a comprehensive edit distance comparison in Figure 4 and Table 5.
Change . | Model . | P1 . | Acc. . | P2 . | Acc. . | Diff. . |
---|---|---|---|---|---|---|
‘.’ → ‘:’ | nous-hermes | Create a word that does not include the letter “{letter}”. | .04 | Create a word that does not include the letter “{letter}”: | .65 | +.61 |
alpaca-13b | Create a sentence that concludes with the term “{word}”. | .61 | Create a sentence that concludes with the term “{word}”: | .19 | −.42 | |
+ ‘.’ | alpaca-13b | Write a word that lacks the letter “letter” | .04 | Write a word that lacks the letter “letter”. | .42 | +.38 |
flan-t5-xl | Write a word that omits the letter “letter” | .77 | Write a word that omits the letter “letter”. | .54 | −.23 | |
+ ‘using’ | flan-t5-large | Your task is to write a word without the letter “{letter}” | .46 | Your task is to write a word without using the letter “{letter}” | .12 | −.35 |
falcon-7b | Write a word without the letter {letter}.∖nOutput word: | .12 | Write a word without using the letter {letter}.∖nOutput word: | .35 | +.23 | |
omits → lacks | ultralm-13b | Write a word that omits the letter “{letter}”. | .62 | Write a word that lacks the letter “{letter}”. | .19 | −.42 |
flan-t5-xl | Write a word that omits the letter “{letter}”. | .54 | Write a word that lacks the letter “{letter}”. | .81 | +.27 | |
contain → have | falcon-7b | Write a word that does not contain the letter “{letter}” | .81 | Write a word that does not have the letter “{letter}” | .19 | −.62 |
flan-t5-xxl | Please write a word that does not contain the letter “{letter}”. | .62 | Please write a word that does not have the letter “{letter}”. | .88 | +.27 | |
include → have | falcon-7b | Write a word that does not include the letter “{letter}”. | .81 | Write a word that does not have the letter “{letter}”. | .19 | −.62 |
flan-t5-xl | Write a word that does not include the letter “{letter}”. | .42 | Write a word that does not have the letter “{letter}”. | .73 | +.31 | |
ultralm-13b | Please write a word that does not include the letter “{letter}”. | .46 | Please write a word that does not have the letter “{letter}”. | .12 | −.35 | |
excludes → lacks | flan-t5-large | Write a word that excludes the letter “{letter}”. | .54 | Write a word that lacks the letter “{letter}”. | .12 | −.42 |
flan-t5-xl | Write a word that excludes the letter “{letter}”. | .19 | Write a word that lacks the letter “{letter}”. | .81 | +.62 |
Change . | Model . | P1 . | Acc. . | P2 . | Acc. . | Diff. . |
---|---|---|---|---|---|---|
‘.’ → ‘:’ | nous-hermes | Create a word that does not include the letter “{letter}”. | .04 | Create a word that does not include the letter “{letter}”: | .65 | +.61 |
alpaca-13b | Create a sentence that concludes with the term “{word}”. | .61 | Create a sentence that concludes with the term “{word}”: | .19 | −.42 | |
+ ‘.’ | alpaca-13b | Write a word that lacks the letter “letter” | .04 | Write a word that lacks the letter “letter”. | .42 | +.38 |
flan-t5-xl | Write a word that omits the letter “letter” | .77 | Write a word that omits the letter “letter”. | .54 | −.23 | |
+ ‘using’ | flan-t5-large | Your task is to write a word without the letter “{letter}” | .46 | Your task is to write a word without using the letter “{letter}” | .12 | −.35 |
falcon-7b | Write a word without the letter {letter}.∖nOutput word: | .12 | Write a word without using the letter {letter}.∖nOutput word: | .35 | +.23 | |
omits → lacks | ultralm-13b | Write a word that omits the letter “{letter}”. | .62 | Write a word that lacks the letter “{letter}”. | .19 | −.42 |
flan-t5-xl | Write a word that omits the letter “{letter}”. | .54 | Write a word that lacks the letter “{letter}”. | .81 | +.27 | |
contain → have | falcon-7b | Write a word that does not contain the letter “{letter}” | .81 | Write a word that does not have the letter “{letter}” | .19 | −.62 |
flan-t5-xxl | Please write a word that does not contain the letter “{letter}”. | .62 | Please write a word that does not have the letter “{letter}”. | .88 | +.27 | |
include → have | falcon-7b | Write a word that does not include the letter “{letter}”. | .81 | Write a word that does not have the letter “{letter}”. | .19 | −.62 |
flan-t5-xl | Write a word that does not include the letter “{letter}”. | .42 | Write a word that does not have the letter “{letter}”. | .73 | +.31 | |
ultralm-13b | Please write a word that does not include the letter “{letter}”. | .46 | Please write a word that does not have the letter “{letter}”. | .12 | −.35 | |
excludes → lacks | flan-t5-large | Write a word that excludes the letter “{letter}”. | .54 | Write a word that lacks the letter “{letter}”. | .12 | −.42 |
flan-t5-xl | Write a word that excludes the letter “{letter}”. | .19 | Write a word that lacks the letter “{letter}”. | .81 | +.62 |
4.3 LLMs are Also Sensitive to Manual Paraphrases
Inconsistencies observed in our analyses could stem from paraphrases that leaked to the training of the models. To address this, we extended our analysis with instruction paraphrases which were recently written by Sun et al. (2023) for the BBL tasks (7–12 instruction templates per task). Importantly, these human-crafted paraphrases were written after model training.
We use these annotations to examine model performance. Our analysis revealed similar inconsistencies as observed with automated paraphrases, demonstrating model sensitivity to paraphrasing even when the potential for instruction leakage is minimized. See Table 4 for the Kendall’s W values for all BBL tasks, and Figure 2 for a pair of instruction templates exhibiting the minimal Kendall’s τ correlation across all BBL tasks.
5 Different Use Cases Merit Different Metrics
We have shown that LLM performance is greatly affected by paraphrasing of instruction templates. This calls into question current evaluation practices, which typically rely on LLM performance on a single instruction template. In this section we explore ways to evaluate LLMs using a diverse set of instruction templates.
Most importantly, we argue that the answer should depend on the purpose of the evaluation, and that different extrinsic needs should lead to different evaluation metrics, rather than striving for a coarse catch-all metric. We introduce a set of metrics, each tailored to specific scenarios and realistic user needs.
Notations.
In the following, M is a pretrained LLM, T = {(xi, yi)} denotes an evaluation dataset for M, IT is a set of natural language task instruction paraphrases for T (e.g., obtained via automatic paraphrasing), and ε(M, T, i) ∈ [0,1] denotes the aggregated performance of M on samples from T, using a single instruction template i ∈ IT according to a standard metric, e.g., accuracy or F1.
5.1 Maximum Performance Metric – For Particular Downstream Applications
Use Case: This metric is useful for developers aiming to integrate an LLM into a specific downstream task and domain (e.g., sentiment analysis in the news domain). In such cases, a user input is often embedded within a fixed instruction template. As such, it makes sense to find the best-performing instruction template for a given model (Wei et al., 2021). To mitigate overfitting, we advise developers to use a new sample set for the task. This ensures the chosen prompt is validated by its ability to maximize performance on these held-out samples irrespective of prior exposure during training.
5.2 Average Performance Metric – For LLM Developers
Use Case: Average prompt performance is useful for assessing model robustness to paraphrases. We believe this should be standard practice for LLM developers when presenting the performance of a new LLM on a range of tasks and prompt paraphrases (Le Scao et al., 2022), as it mitigates outliers in performance.
5.3 Combined Performance Score
Use Case: This metric is valuable for selecting a model for a suite of applications or a platform offering diverse tasks. For instance, when integrating an LLM into an application with user-visible prompts, such as a multi-functional chatbot, it is crucial for the model to be both effective (high MaxP) and robust (high Sat). CPS facilitates identifying models that strike a balance between top-tier performance and robust reliability across varying instruction templates.
6 Multi-Prompt Evaluation
In Figure 6 we evaluate all our 16 models according to the metrics we proposed in the previous section, on sample tasks from each of the three benchmarks (full results for all tasks are available in our repository). We report several interesting observations. First, we find that all aggregate metrics diverge from the performance on the original instruction templates. For the vast majority of the tasks in our study, the top three models determined by the original instruction templates were different from those which ranked first according to the average and maximum metrics.
More broadly, model ranking depended on the metric used. For instance, see Figure 6 (top): In LMentry’s rhyming word task, Falcon- Instruct-7b and Vicuna-13b rank first according to MaxP (0.74, gray and yellow bars), but their average performances AvgP are only 0.17 and 0.15, respectively. Similarly, across all tasks in the LMentry benchmark, LLaMA-based models were competitive with T5-based models in terms of MaxP. However, in terms of AvgP, they tended to lag behind, due to extremely poor performance on a large number of paraphrases (see Figure 5 for %paraphrases that achieved at least 5% accuracy).
Finally, we found that noise stemming from automatic paraphrase generation has virtually no impact on metric-based model rankings. We compute Kendall’s τ to compare model rankings before and after the manual filtering of paraphrases. The results (Table 6) show near-perfect to perfect agreement in rankings across all tasks, except for the “ends with word” task in LMentry. Upon examination, this seems to be mostly due to an error in LMentry’s evaluation script. These results suggest that it may be enough to compute our metrics over range of automatically-generated paraphrases, without having to manually verify them.
7 Small-Scale Evaluation of OpenAI Models on Prompt Paraphrasing
In this section we perform a small-scale evaluation showing that API LLMs are also sensitive to instruction paraphrasing. Our evaluation focuses on four OpenAI models: davinci, text-davinci- 002, text-davinci-003, and GPT-3.5-Turbo on the LMentry benchmark.
Due to budget constraints, we show that the performance of these models diverges significantly between the benchmark’s original instruction templates and a selection of paraphrases, in terms of both average and maximum metrics.
Estimating Average Performance.
To estimate the average performance of OpenAI models on a specific task, we adopted a randomized approach. For each task sample, we randomly selected a paraphrase from our collection, and evaluated the model’s response, scoring the entire set of task samples. To approximate average performance, this experiment was repeated 20 times, determined by the data from our 16 open-source models.
Estimating Maximal Performance.
To estimate which of the roughly 175 instruction templates per task performs the best for each model, we implemented a simple greedy search. Initially, we evaluated all paraphrases on 10 task instances, then narrowed down to the top 100 instruction templates for another 10 instances. Finally, the top 10 instruction templates were evaluated on the remaining instances, and the template that performed the best was chosen to estimate the maximum performance.
7.1 Results
Below we summarize the results of our evaluation of OpenAI models. The full details appear in our repository.
OpenAI Models are Also Sensitive to Minor Prompt Variations.
Minor changes in the phrasing of the instruction could lead to drastic performance changes for the OpenAI models, similar to our findings in Section 4.2 with smaller-scale LLMs. See representative examples in Table 7, showing nearly identical instruction template pairs resulting in notable variations in performance.
Change . | Model . | P1 . | Acc. . | P2 . | Acc. . | Diff. . |
---|---|---|---|---|---|---|
{...}→ “{...}” | td002 | Which word has a greater number of letters, {word1} or {word2}? | .50 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .23 | −0.27 |
td002 | Which of the words {word1} and {word2} is alphabetically first? | .54 | Which of the words “{word1}” and “{word2}” is alphabetically first? | .77 | +0.23 | |
td003 | Which word has a greater number of letters, {word1} or {word2}? | .60 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .14 | −0.46 | |
td003 | Compare the length of {word1} and {word2} and tell me which one is shorter. | .39 | Compare the length of “{word1}” and “{word2}” and tell me which one is shorter. | .73 | +0.34 | |
cgpt | Which word has a greater number of letters, {word1} or {word2}? | .55 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .24 | −0.31 | |
cgpt | Compare the length of {word1} and {word2}. Which one is longer? | .04 | Compare the length of “{word1}” and “{word2}”. Which one is longer? | .70 | +0.66 | |
‘,’ → ‘:’ | td002 | Which word is a rhyme for “{query}”, “{word1}” or “{word2}”? | .08 | Which word is a rhyme for “{query}”: “{word1}” or “{word2}”? | .85 | +0.77 |
td003 | Which word is a rhyme for “{query}”, “{word1}” or “{word2}”? | .48 | Which word is a rhyme for “{query}”: “{word1}” or “{word2}”? | .90 | +0.42 | |
‘,’ → ‘-’ | td002 | Which word rhymes with “{query}”, “{word1}” or “{word2}”? | .06 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .73 | +0.67 |
td003 | Which word rhymes with “{query}”, “{word1}” or “{word2}”? | .17 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .60 | +0.43 | |
the → a | td002 | What is the word that rhymes with “{query}” - “{word1}” or “{word2}”? | .03 | What is a word that rhymes with “{query}” - “{word1}” or “{word2}”? | .78 | +0.75 |
which → what | td002 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .73 | What word rhymes with “{query}” - “{word1}” or “{word2}”? | .82 | +0.09 |
td003 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .60 | What word rhymes with “{query}” - “{word1}” or “{word2}”? | .15 | −0.45 | |
word → term | td002 | Create a word that excludes the letter “{letter}”. | .54 | Create a term that excludes the letter “{letter}”. | .04 | −0.50 |
td003 | Create a word that excludes the letter “{letter}”. | .96 | Create a term that excludes the letter “{letter}”. | .58 | −0.38 | |
cgpt | Create a word that excludes the letter “{letter}”. | .81 | Create a term that excludes the letter “{letter}”. | .42 | −0.39 |
Change . | Model . | P1 . | Acc. . | P2 . | Acc. . | Diff. . |
---|---|---|---|---|---|---|
{...}→ “{...}” | td002 | Which word has a greater number of letters, {word1} or {word2}? | .50 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .23 | −0.27 |
td002 | Which of the words {word1} and {word2} is alphabetically first? | .54 | Which of the words “{word1}” and “{word2}” is alphabetically first? | .77 | +0.23 | |
td003 | Which word has a greater number of letters, {word1} or {word2}? | .60 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .14 | −0.46 | |
td003 | Compare the length of {word1} and {word2} and tell me which one is shorter. | .39 | Compare the length of “{word1}” and “{word2}” and tell me which one is shorter. | .73 | +0.34 | |
cgpt | Which word has a greater number of letters, {word1} or {word2}? | .55 | Which word has a greater number of letters, “{word1}” or “{word2}”? | .24 | −0.31 | |
cgpt | Compare the length of {word1} and {word2}. Which one is longer? | .04 | Compare the length of “{word1}” and “{word2}”. Which one is longer? | .70 | +0.66 | |
‘,’ → ‘:’ | td002 | Which word is a rhyme for “{query}”, “{word1}” or “{word2}”? | .08 | Which word is a rhyme for “{query}”: “{word1}” or “{word2}”? | .85 | +0.77 |
td003 | Which word is a rhyme for “{query}”, “{word1}” or “{word2}”? | .48 | Which word is a rhyme for “{query}”: “{word1}” or “{word2}”? | .90 | +0.42 | |
‘,’ → ‘-’ | td002 | Which word rhymes with “{query}”, “{word1}” or “{word2}”? | .06 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .73 | +0.67 |
td003 | Which word rhymes with “{query}”, “{word1}” or “{word2}”? | .17 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .60 | +0.43 | |
the → a | td002 | What is the word that rhymes with “{query}” - “{word1}” or “{word2}”? | .03 | What is a word that rhymes with “{query}” - “{word1}” or “{word2}”? | .78 | +0.75 |
which → what | td002 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .73 | What word rhymes with “{query}” - “{word1}” or “{word2}”? | .82 | +0.09 |
td003 | Which word rhymes with “{query}” - “{word1}” or “{word2}”? | .60 | What word rhymes with “{query}” - “{word1}” or “{word2}”? | .15 | −0.45 | |
word → term | td002 | Create a word that excludes the letter “{letter}”. | .54 | Create a term that excludes the letter “{letter}”. | .04 | −0.50 |
td003 | Create a word that excludes the letter “{letter}”. | .96 | Create a term that excludes the letter “{letter}”. | .58 | −0.38 | |
cgpt | Create a word that excludes the letter “{letter}”. | .81 | Create a term that excludes the letter “{letter}”. | .42 | −0.39 |
Average Performance is Lower Than That Observed in the Original Benchmark Instructions.
In 72.5% of the cases, the performance of the original instructions was higher than the estimated average across all paraphrases. In the davinci model, the original prompts added on average 21 more accuracy points.
Original Prompt Performances Fall Below All Paraphrases’ Estimated Maximum Performance.
Figure 7 depicts maximum performance of the original instructions for four LMentry tasks in solid colors, with overlaid semi-transparent columns indicating the estimated maximum performance on all paraphrases. Notably, for text-davinci-002, we found paraphrases that improved its maximal accuracy performance above 90% for 8 out of 10 tasks. Across all four models, 26 out of 40 differences were statistically significant according to the McNemar test.
Model Rankings Diverge Between the Different Metrics and Original Instruction Templates.
Similarly to our main evaluation, there were many mismatches between ranking on the original instruction templates and our metrics. Agreement was observed in only 5 out of 10 tasks for the average metric, and in 4 out of 10 tasks for the maximum metric.
8 Related Work
Our work is part of an emerging trend highlighting the many challenges standing in the way of meaningful, scalable, and reproducible evaluation of large language models.
Perlitz et al. (2023) focus on the rising cost of exhaustive evaluation of LLMs on large number of samples. They developed methods for choosing subsets of the test data which are expected to be a good representative of the whole. An interesting avenue for future work can extend Perlitz et al.’s (2023) approach to also include various instruction templates, thus efficiently approximating our suggested evaluation methods.
Sclar et al. (2023) show that LLMs are sensitive to prompt formatting. These are minor prompt design choices, such as the addition or omission of punctuation marks. They create a large pool of instruction paraphrases, ensuring that paraphrases maintain the meaning of the original prompt. We notice a similar phenomenon, albeit more anecdotally, when our automatic paraphrasing techniques incidentally produce minor changes in formatting (Table 7). Voronov et al. (2024) showed that LLMs are sensitive to the format of in-context examples. For example, they varied the manner in which each input-output is separated, and test how such choices interact with the phrasing of the instruction template, the number of demonstrations, or the model size.
The works discussed above represent a distinct thread within the larger field of model robustness, which is typically defined as a measure of models’ ability to adapt to distribution shifts between training and inference (Wang et al., 2022), or to cope with adversarial examples (Wang et al., 2021, 2023). In contrast, these works do not change the underlying instance to be classified (e.g., the homophone pairs in our running example), but rather the task instruction. This challenge arises with the introduction of LLMs which take such instructions as part of the input, rather than through dedicated calibration in training or finetuning.
9 Conclusions
Our research highlights the sensitivity of large language models (LLMs) to prompt paraphrasing, challenging the adequacy of single-prompt evaluations. We propose alternative evaluation metrics that use a diverse set of instruction templates for each task, designed for more robust and meaningful LLM evaluation. For example, LLM developers may be interested in measuring the robustness of performance across multiple prompts, which we propose to evaluate as the average across a large collection of prompts. In contrast, when developing a downstream model, different models should be compared according to their corresponding top-performing prompt.
Evaluating based on these metrics underscores the necessity for nuanced evaluation methods, revealing notable differences in absolute performance and relative model rankings compared to traditional evaluations. We hope that our work will help spur more consistency and comparability in LLM evaluation which is strongly coupled to real-world LLM uses. We believe this shift is crucial for accurately understanding and leveraging the true capabilities of LLMs.
Acknowledgments
We thank the reviewers for their insightful comments. We further thank Asaf Yehudai and Oyvind Tafjord for engaging discussions, and the members of SLAB and Hyadata Lab at the Hebrew University of Jerusalem for their thoughtful remarks. This work was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant no. 852686, SIAM) and was partially supported by the Israeli Ministry of Science and Technology (grant no. 2336).
Notes
Calculated as the number of models tested per task × number of paraphrased instructions per task × 100 samples, across all tasks and benchmarks ≈ 240 × 16 × 100 × 10 (LMentry) +175 × 11 × 100 × 15 (BBH).
References
Author notes
Action Editor: Emiel Krahmer