Abstract
Previous work in prompt engineering for large language models has introduced different gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but have failed to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common and diverse NLP tasks. We find that each of the existing methods can be interpreted as some variant of the method that maximizes mutual information between the input and the predicted output (MI). Utilizing this finding, we develop several other combinatorial variants of MI and increase the effectiveness of the oracle prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, considering that all the methods rely on the output probability distribution of the model that might be biased, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to the existing methods and helps increase the prompt selection effectiveness of the best method to 96.85%, achieving 99.44% of the oracle prompt F1 without calibration.1
1 Introduction
Large Language Models (LLMs) have demonstrated remarkable performance in solving various natural language processing tasks through prompt-based learning without requiring additional task-specific training (Brown et al., 2020; Dong et al., 2023). However, the performance of LLMs can heavily fluctuate according to the choice of prompts (Zhao et al., 2021; Holtzman et al., 2021; Lu et al., 2022). While various prompt engineering approaches have been proposed to mitigate this issue, the nontrivial prerequisites of many of these methods, such as training an additional model and/or using an additional component, have been a bottleneck to their real application (Liu et al., 2023; Li and Liang, 2021; Jiang et al., 2020; Prasad et al., 2023; Liu et al., 2022; Rubin et al., 2022).
On the other hand, probability-based prompt selection methods do not require any additional parameter updates or additional components2 and thus provide a promising and easily applicable solution; these methods aim to select the prompt from a set of prompts that is expected to be most effective in helping a language model to make correct predictions solely based on the probability distribution of the model (Sorensen et al., 2022; Lu et al., 2022; Wu et al., 2023; Liao et al., 2022; Gonen et al., 2023). However, despite their ease of utilization, there has been a lack of comprehensive comparative evaluation between existing probability-based prompt selection methods, as each method is proposed in different setups and evaluated on different datasets, evaluation instances, sets of prompts, and models. In this paper, we first carefully design a unified evaluation setup to facilitate a fair comparison between different prompt selection methods. Our unified evaluation reveals that no single method consistently outperforms other methods across all datasets and that all existing probability-based prompt selection methods roughly correspond to a sub-term of the equation of Mutual Information (MI) (Sorensen et al., 2022). We utilize this discovery to propose several variants of MI that use different combinations of the components of existing methods, and the best combinational variant MIAGL increases the scaled F1 (F1 divided by that of the oracle prompt, showing the effectiveness of the prompt selection method) from 87.79% to 94.98% (MIAGL of Figure 1 a).
Furthermore, we find the need for a better approximation of the LLM’s output probability distribution, considering that all probability-based prompt selection methods rely on the probabilistic estimates from the model that might be biased. Therefore, by drawing a connection between the existing model output probability calibration methods (Zhao et al., 2021; Holtzman et al., 2021), we propose an enhanced calibration method, Calibration By Marginalization (CBM). CBM significantly improves the prompt selection performance of several methods when applied to calibrate the output probability of LLMs, increasing the best-scaled F1 to 96.85% (MIA(PA) of Figure 1a), achieving 99.44% of the oracle prompt F1 under the uncalibrated scenario. CBM also proves to show the most robust answer selection enhancement across multiple datasets compared to the existing calibration methods (Figure 1b).
2 Probability-based Prompt Selection
In this section, we perform a unified evaluation of existing probability-based prompt selection methods. First, we describe the task of probability-based prompt selection in Section 2.1. Next, we briefly introduce each of the existing methods in Section 2.2. Then, we describe our experimental setup for unified evaluation in Section 2.3 and present the evaluation results in Section 2.4.
2.1 Task Description
Probability-based prompt selection is the task of selecting one or more prompts from a list of prompts T, which are expected to help the language model θ make the most accurate prediction for the evaluation dataset X where the evaluation instances are drawn from the data distribution, x ∼ PX, utilizing only the output probability distributions of the model on X,3 without knowing the ground truth labels and using neither additional gradient-based updates nor other trained components. The performance of a probability-based prompt selection method is evaluated by how high the score of the evaluation metric obtained with the selected prompt(s) is.
When one prompt is selected for the whole dataset, the performance is upper bounded by the performance obtained with the prompt with which the model achieves the best metric score; we call such a prompt the optimal oracle prompt.4 When one prompt is selected for each x ∼ PX, different t ∈ T can be chosen for each x; we call such a prompt selection approach instance-wise prompt selection.
Note that the definition of prompt can vary according to the setup for which prompt selection is performed. When prompt selection is applied to zero-shot learning, prompts are defined as various formats of text templates that are filled by evaluation instances x ∼ PX to facilitate. On the other hand, for few-shot (in-context) learning, prompts are often defined as the demonstrations sampled from a training/development set or texts of permutations of such demonstrations. In our work, in order to enable comparison between all the methods proposed either in zero-shot and few-shot setup, we perform prompt selection in a zero-shot setup with the former definition of prompt.5
Concrete Example
Examples of prompts t ∈ T include “Which category does the following news article fall into? {text}”, “The following news article, {text}, covers the topic of”, and “{text} belongs in which category: Politics, Sports, Business, Science and Technology”. We say that x instantiates the prompt t when x is inserted into the placeholder {text} of the prompt template and let ι(x, t) denote the instantiated prompt. Each of the answer categories represents the concept of politics, sports, business, and science/technology, and uses “Politics”, “Sports”, “Business”, and “Science and Technology” as the verbalizer (the actual text evaluated to score the answer choices), respectively.
For instance, given OPT 2.7B (Zhang et al., 2022a) as the language model, “King Charles III’s Coronation watched by more than 18 million viewers” as x, and the three prompts shown as examples in the previous paragraph, a prompt selection method should choose the prompt that is most likely to help OPT 2.7B correctly predict the answer y among the possible answer choices Y which represent the concepts of politics, sports, business, and science/technology. To select such a prompt, the method must rely solely on the output probability of the model given the instantiated prompts as input, e.g., p(“Politics”|“Which category … King …”).
2.2 Existing Approaches
Existing Method . | Abbr. . | Selected Prompt: . |
---|---|---|
Mutual Information (Sorensen et al., 2022) | MI | |
Entropy (Lu et al., 2022) | ||
Global Entropy | GE | |
Local Entropy | LE | |
Minimum Description Length (Wu et al., 2023) | MDL | −H(Y |x, t) |
Zero-Label Prompt Selection (Liao et al., 2022) | ||
Log-probability Mean | ZLP | |
Probability Mean | ZPM | |
Majority Vote | ZMV | |
Perplexity (Gonen et al., 2023) | PPL |
Existing Method . | Abbr. . | Selected Prompt: . |
---|---|---|
Mutual Information (Sorensen et al., 2022) | MI | |
Entropy (Lu et al., 2022) | ||
Global Entropy | GE | |
Local Entropy | LE | |
Minimum Description Length (Wu et al., 2023) | MDL | −H(Y |x, t) |
Zero-Label Prompt Selection (Liao et al., 2022) | ||
Log-probability Mean | ZLP | |
Probability Mean | ZPM | |
Majority Vote | ZMV | |
Perplexity (Gonen et al., 2023) | PPL |
Mutual Information (MI)
Sorensen et al. (2022) propose to select one prompt for the evaluation dataset that maximizes the mutual information between the evaluation instances X and their corresponding model predictions Y given prompt t, I(Y ;X|t) = [H(Y |t) −H(Y |X, t)]. Since they use the assumption that , the equation becomes as shown in the first row of Table 1. The intuition of the method is to select the prompt that guides the model to make less biased predictions on average (high H(Y |t)) and confident predictions about the input data (low H(Y—X, t)).
Entropy (GE, LE)
Lu et al. (2022) propose to select the prompt (finding the best ordering of few-shot demonstrations for in-context learning in their setup) using entropy-based metrics. While their proposed methods are intended specifically for in-context learning, viewing prompts as texts of permutations of demonstrations,6 we adopt the methods for our zero-shot setup of selecting among text template prompts and thus do not use an additional training set or construct a probing set. Global Entropy (GE) or Local Entropy (LE) shown in the second row of Table 1 are used to select a single prompt among the prompt candidates for the evaluation dataset.
Minimum Description Length (MDL)
Wu et al. (2023) propose to select the prompt (a permutation of few-shot demonstrations in their setup) that requires minimum codelength to compress and transmit testing label y given the testing input x. With several assumptions and approximations presented in Section 4.3 of the work of Wu et al. (2023), the equation boils down to finding different t for each x ∈ X, , performing instance-wise prompt selection. As their original setup for prompt selection is few-shot learning, they perform demonstration sampling as a set selection and then rank the texts of different permutations of the demonstrations. Here, we describe only the ranking part of their approach that we employ for our zero-shot learning setup.
Zero-Label Prompt Selection (ZLP, ZPM, ZMV)
Liao et al. (2022) propose to make a pseudo-label for each x by ensembling the outputs for all prompts to make a score s(x, y) for each x, and then choosing one prompt t for the evaluation dataset whose cases of is the maximum. As shown in Table 1, they propose three ways to calculate s(x, y): using the ensemble of log-probability mean, probability mean, and majority vote. We refer to them as ZLP, ZPM, and ZMV, respectively. While the authors of the original work applied filtering of prompts, we observed from our preliminary experiments that filtering does not have a significant effect.
Perplexity (PPL)
Gonen et al. (2023) propose to select one prompt for the evaluation dataset with which the language model exhibits the lowest average perplexity of the instantiated prompt ι(x, t) as shown in the last row of Table 1. p(x, t) is calculated as , where ι(x, t)i represents the i-th token of the instantiated prompt ι(x, t). We include the geometric mean to the definition of p(x, t) because the averaged probability is often used to approximate the probability of a sequence.
2.3 Experimental Setup
Evaluation Datasets
Our dataset selection, aimed at fair measurement of various probability-based prompt selection methods, is guided by several factors. We favor the datasets previously used in research, those encompassing diverse domains, and datasets where prompt selection is meaningful. We exclude the datasets where all prompts underperform a random baseline or where a naive baseline of selecting the mode label could excel due to high imbalance. By excluding the datasets with high imbalance, we aim to avoid the false positive cases where a failed algorithm that collapses to select one label regardless of the input is evaluated as a competitive method by chance.
The selected datasets have diverse label types and distributions, and we categorize them based on their label distributions into balanced (label distribution is about 1:1), unbalanced (otherwise), and dynamic7 categories. The 13 datasets selected through this process are shown in Table 2.8
Dataset . | Full Name . | Split . | # Used (# Orig.) . | Category . | Label Ratio . | ||||
---|---|---|---|---|---|---|---|---|---|
0 . | 1 . | 2 . | 3 . | 4 . | |||||
imdb | imdb | test | 1000 (25000) | balanced | 0.51 | 0.49 | |||
g-sst2 | glue-sst2 | valid | 872 | balanced | 0.49 | 0.51 | |||
agnews | ag_news | test | 1000 (7600) | balanced | 0.27 | 0.25 | 0.25 | 0.24 | |
g-rte | glue-rte | valid | 277 | balanced | 0.53 | 0.47 | |||
newspop | newspop | train | 1000 (93239) | unbalanced | 0.36 | 0.23 | 0.33 | 0.09 | |
t-irony | tweet_eval-irony | valid | 955 | unbalanced | 0.60 | 0.40 | |||
t-emo | tweet_eval-emotion | valid | 374 | unbalanced | 0.39 | 0.25 | 0.09 | 0.27 | |
sg-cb | super_glue-cb | valid | 56 | unbalanced | 0.41 | 0.50 | 0.09 | ||
sst5 | SetFit/sst5 | test | 1000 (1101) | unbalanced | 0.13 | 0.29 | 0.18 | 0.23 | 0.18 |
copa | super_glue-copa | valid | 100 | dynamic | 0.55 | 0.45 | |||
piqa | piqa | valid | 1000 (1838) | dynamic | 0.49 | 0.51 | |||
story | story_cloze-2016 | test | 1000 (1871) | dynamic | 0.51 | 0.49 | |||
hella | Rowan/hellaswag | valid | 1000 (10003) | dynamic | 0.22 | 0.25 | 0.26 | 0.26 |
Dataset . | Full Name . | Split . | # Used (# Orig.) . | Category . | Label Ratio . | ||||
---|---|---|---|---|---|---|---|---|---|
0 . | 1 . | 2 . | 3 . | 4 . | |||||
imdb | imdb | test | 1000 (25000) | balanced | 0.51 | 0.49 | |||
g-sst2 | glue-sst2 | valid | 872 | balanced | 0.49 | 0.51 | |||
agnews | ag_news | test | 1000 (7600) | balanced | 0.27 | 0.25 | 0.25 | 0.24 | |
g-rte | glue-rte | valid | 277 | balanced | 0.53 | 0.47 | |||
newspop | newspop | train | 1000 (93239) | unbalanced | 0.36 | 0.23 | 0.33 | 0.09 | |
t-irony | tweet_eval-irony | valid | 955 | unbalanced | 0.60 | 0.40 | |||
t-emo | tweet_eval-emotion | valid | 374 | unbalanced | 0.39 | 0.25 | 0.09 | 0.27 | |
sg-cb | super_glue-cb | valid | 56 | unbalanced | 0.41 | 0.50 | 0.09 | ||
sst5 | SetFit/sst5 | test | 1000 (1101) | unbalanced | 0.13 | 0.29 | 0.18 | 0.23 | 0.18 |
copa | super_glue-copa | valid | 100 | dynamic | 0.55 | 0.45 | |||
piqa | piqa | valid | 1000 (1838) | dynamic | 0.49 | 0.51 | |||
story | story_cloze-2016 | test | 1000 (1871) | dynamic | 0.51 | 0.49 | |||
hella | Rowan/hellaswag | valid | 1000 (10003) | dynamic | 0.22 | 0.25 | 0.26 | 0.26 |
Prompts
We create a diverse range of 100 prompts for each of the 13 evaluation datasets, which results in 1,300 prompts in total. For each dataset, a few of the 100 prompts are taken from PromptSource (Bach et al., 2022), and the rest are generated using GPT 3.5 (OpenAI, 2023) to speed up the prompt generation process and then manually reviewed and corrected.9 The prompts are designed to encompass various formats, with the evaluation instance and sometimes the answer choices appearing at different positions within the prompt, to ensure that the prompt selection task is meaningful. Table 3 shows a few examples of the prompts. We use one-token words as the verbalizers for the answer choices in most prompts, except for the prompts for the datasets of the dynamic category.
Dataset . | Prompt . | Verbalizers for Y . |
---|---|---|
imdb | From the following review, can you tell whether the sentiment is positive or negative? | negative, positive |
agnews | Which category among Politics, Sports, Business, Science would this news article fall under? | Politics, Sports, Business, Science |
g-rte | Given the statement “{{sentence1}}”, does it necessarily follow that ”{{sentence2}}” is true? | yes, no |
sg-cb | If the above statement is true, can we conclude that “{{hypothesis}}” is also true? Yes, no, or maybe? | Yes, no, maybe |
sst5 | What is the sentiment expressed in the following sentence? It’s either terrible or negative or neutral or positive or excellent. “{{ text }}” | terrible, negative, neutral, positive, excellent |
piqa | Your task is to achieve: {{goal}}∖n∖nWhich of the following options is the most appropriate?∖n∖n-{{sol1}}∖n- {{sol2}}∖n∖nAnswer: | {{sol1}}, {{sol2}} |
Dataset . | Prompt . | Verbalizers for Y . |
---|---|---|
imdb | From the following review, can you tell whether the sentiment is positive or negative? | negative, positive |
agnews | Which category among Politics, Sports, Business, Science would this news article fall under? | Politics, Sports, Business, Science |
g-rte | Given the statement “{{sentence1}}”, does it necessarily follow that ”{{sentence2}}” is true? | yes, no |
sg-cb | If the above statement is true, can we conclude that “{{hypothesis}}” is also true? Yes, no, or maybe? | Yes, no, maybe |
sst5 | What is the sentiment expressed in the following sentence? It’s either terrible or negative or neutral or positive or excellent. “{{ text }}” | terrible, negative, neutral, positive, excellent |
piqa | Your task is to achieve: {{goal}}∖n∖nWhich of the following options is the most appropriate?∖n∖n-{{sol1}}∖n- {{sol2}}∖n∖nAnswer: | {{sol1}}, {{sol2}} |
Models
We conduct the majority of our experiments with ten different models of varying sizes ranging from 1.3B to 66B.10 However, to present the experimental results and analysis more clearly, we only display the results of OPT 2.7B throughout the paper since the overall trend remains mostly identical (shown in Section 5).
Evaluation Metrics
Prompt selection performance is assessed using macro F1 of the selected prompts. To compare the effectiveness of the prompt selection methods across different datasets or models, we normalize the value by the performance of the oracle prompt (upper bound) and present it as scaled F1.
Implementation Details
2.4 Experimental Results
We find that there is no single probability-based prompt selection method that consistently outperforms another across all 13 datasets and evaluation categories. While PPL and LE do not rank first in any dataset, every other method ranks first in a few datasets. Figure 2 illustrates the selected prompt performance averaged by category, along with the performance of the best (oracle) and worst prompts and the average performance of all prompts. In the balanced category, GE and MDL outperform others, with MI closely following. In the unbalanced category, MI stands out, while in the dynamic category, GE, MDL, and ZLP perform the best. LE and PPL generally underperform in all of the datasets; their task average does not even exceed the average performance of all prompts.13 We conclude that no single existing approach is significantly better than others, especially when dividing the evaluation dimensions into balanced, unbalanced, and dynamic labels.
3 Improving MI via Unified Analysis
In this section, we first derive a unified view of prompt selection methods in Section 3.1 and show that each method other than MI roughly corresponds to a sub-term of the equation of MI and revisit the previous experimental results for a unified analysis in Section 3.2. Then, from the unified view and analysis, we identify the differences between methods, particularly MI, GE, and MDL, and derive a few combinational variants by transferring design elements across methods which improves the prompt selection performance of MI.
3.1 Unified View: Identifying Connections Between Methods
Prompt Selection Score (PSS)
Figure 3 offers a unified view of existing probability-based prompt selection methods, highlighting that each method except for MI approximately corresponds to a sub-term in the equation of MI. We denote the highlighted parts as the Prompt Selection Score of each method (PSSmethod); the score of which the prompt with the maximum value is chosen by the prompt selection method.
MI vs. GE and LE
MI selects a prompt that maximizes the first term of PSSMI, , and minimizes the second term, . This means that MI favors prompts that provide balanced predictions without label bias (interpretation of the first term) and sharp answer prediction distribution across all instances in the dataset (interpretation of the second term). These terms roughly correspond to PSSGE and −PSSLE, respectively. The difference between PSSGE and the first term of PSSMI is that the former converts p(y|x, t) to one-hot before taking the entropy of the mean. In sum, the prompts selected by GE and MI align, while those chosen by LE and MI tend to be opposite. Note that one expected caveat of GE is that it will be less effective when the dataset itself has a label imbalance.
MI vs. MDL
MDL is the only method among the presented probability-based prompt selection methods that selects a different prompt for each evaluation instance x, i.e., performs instance-wise prompt selection. Essentially, MDL is an instance-wise version of the second term of PSSMI, choosing prompts whose output probability distribution p(y|x, t) has the lowest entropy, and thus aligns with MI. Since MDL favors the prompt that makes the model output a sharp probability distribution, one expected caveat of MDL is that it will not work well when the model fails to solve the given task and collapses to a single prediction regardless of the input with overly high confidence.
MI vs. ZPM
MI vs. PPL
PSSPPL is the most dissimilar from PSSMI, along with PSSLE. Since , PSSPPL can be expressed as . It is clear that PSSPPL differs from PSSMI because it considers the probability of x and t that PSSMI neglects. Applying the probabilistic assumption of MI to PSSPPL converts the equation to , causing PPL to select the prompt with the lowest perplexity irrespective of the input. Since Gonen et al. (2023) even restrict their prompt format for the input x to appear at the beginning so that p(x, t) is calculated only as the form of p(t|x)p(x), i.e., the probability of prompt is always conditioned on x, the probabilistic assumption of MI is incompatible with the motivation of PPL.15
3.2 Unified Analysis: Revisiting Experimental Results
Revisiting the unified evaluation in Section 2.4, the results align with our analysis from Section 3.1. GE performs well in balanced datasets but poorly in unbalanced ones due to its preference for prompts that create balanced predictions. GE also performs well in dynamic datasets since the label distribution is balanced by chance (Table 2). MDL performs comparably to GE due to similar entropy calculations. LE’s performance, however, is less satisfactory, given that its optimization contradicts MDL. The underperformance of PPL compared to that by Gonen et al. (2023) might be due to our use of diverse prompt formats.16
Note that in dynamic datasets, MI’s best, worst, and average prompt performances differ due to its distinct calculation of p(y|x, t) that uses only the first token logits; for other methods, p(y|x, t) is calculated using all tokens (Section 2.2).17 This leads to a question: Is the difference in the calculation of p(y|x, t) the reason that MI performs well in balanced and unbalanced cases but poorly in dynamic cases? In addition, despite GE and MDL maximizing MI’s sub-term, they outperform MI in balanced datasets. This observation leads to another question: Is their higher performance due to their one-hot p(y|x, t) and instance-wise prompt selection?
In the following subsection, we show that the answers to both questions are yes, demonstrating that using all tokens to calculate p(y|x, t), one-hot p(y|x, t), and instance-wise prompt selection improves the prompt selection performance of MI.
3.3 Experimental Results: Transferring Design Choices from Unified Analysis
p(y|x, t) calculation using all tokens helps MI.
To investigate the difference between using only the first token probability and the mean/sum of all tokens to calculate PSSMI, we develop a variant of MI called MIA (A of All). Unlike MI and like other methods, MIA calculates p(y|x, t) by taking the mean of all token logits for balanced and unbalanced datasets, and the sum for dynamic datasets. Since the balanced and unbalanced datasets in our experimental setup (Section 2.4) mostly use one-token verbalizers which result in the same result of MI and MIA, we utilize new sets of verbalizers of 1-2 tokens (1 ≤|v|≤ 2) or 2 tokens (|v| = 2) for all the prompts of our evaluation datasets and compare the two methods. Our results in Figure 4 show that using all tokens is more effective in all configurations except for the 1-2 token-balanced tasks.
One-hot p(y|x, t) and instance-wise prompt selection benefits MI.
We create combinational variants of GE, MDL, and MI (outlined in Table 4) to study whether their differences contribute to MI’s lower performance in balanced datasets. For instance, PSSGEM is an MI-like version of GE employing p(y|x, t) without one-hot encoding, while PSSMDLM is an MI-like MDL version using the average of H(Y |x, t) for all x to select a single prompt. Contrarily, MIAG and MIAL are variants of MI, with the former emulating GE and the latter mirroring MDL, on top of MIA. MIAGL is another MI variant employing the sum of PSSGE and PSSMDL as PSS, using one-hot p(y|x, t) for the first term calculation and instance-wise selection.
. | A . | G . | L . | Prompt Selection Score . |
---|---|---|---|---|
Existing Methods | ||||
GE | ✓ | ✓ | – | |
MDL | ✓ | – | ✓ | −H(Y |x, t) |
MI | ✗ | ✗ | ✗ | GEM +MDLM |
Explored Variants | ||||
GEM | ✓ | ✗ | – | |
MDLM | ✓ | – | ✗ | |
MIA | ✓ | ✗ | ✗ | GEM +MDLM |
MIAG | ✓ | ✓ | ✗ | GE +MDLM |
MIAL | ✓ | ✗ | ✓ | GEM + MDL |
MIAGL | ✓ | ✓ | ✓ | GE + MDL |
. | A . | G . | L . | Prompt Selection Score . |
---|---|---|---|---|
Existing Methods | ||||
GE | ✓ | ✓ | – | |
MDL | ✓ | – | ✓ | −H(Y |x, t) |
MI | ✗ | ✗ | ✗ | GEM +MDLM |
Explored Variants | ||||
GEM | ✓ | ✗ | – | |
MDLM | ✓ | – | ✗ | |
MIA | ✓ | ✗ | ✗ | GEM +MDLM |
MIAG | ✓ | ✓ | ✗ | GE +MDLM |
MIAL | ✓ | ✗ | ✓ | GEM + MDL |
MIAGL | ✓ | ✓ | ✓ | GE + MDL |
Figure 5 compares these variants with existing methods. The variants that use instance-wise prompt selection (MIAGL, MIAL, MDL) perform better in balanced and unbalanced datasets but underperform in dynamic ones. Particularly in balanced datasets, MIAGL, MIAL, and MIA show significant improvement. While no method is consistently superior across all datasets (as observed in Section 2.4), MIAGL significantly improves scaled F1 to 94.98% (0.6454/0.6795) compared to that of the best existing method (GE), which is 87.79% (0.5965/0.6795).
4 Improving Prompt Selection Through Enhanced Probability Calibration
While the previous section enhances prompt selection performance using combinatorial variants, in this section, we explore an orthogonal approach to further improve prompt selection: model output probability calibration.
Since all the prompt selection methods except for PPL depend on the model output probability p(y|x, t) to calculate Prompt Selection Score (PSS), the stability and reliability of p(y|x, t) affect their prompt selection performance. However, previous works have pointed out that p(y|x, t) is unstable without calibration.18 To address the issue, Zhao et al. (2021) suggest Contextual Calibration (CC), which reduces bias towards each answer choice by employing content-free inputs (“N/A”, “[MASK]”, “”), while Holtzman et al. (2021) present Domain Conditional Pointwise Mutual Information (PMIDC) by reweighting each answer choice based on its task-specific prior likelihood. We summarize the two methods for answer selection in Table 5; is selected as the answer, where is the calibrated score.
Existing Method . | Equation for Answer Selection . |
---|---|
Contextual Calibration (CC) (Zhao et al., 2021) | |
Domain Conditional PMI (PMIDC) (Holtzman et al., 2021) |
One might assume that these existing calibration methods would effectively calibrate p(y|x, t) for PSS. However, through the experiments described in Section 4.1, we reveal in Section 4.2 the results that these methods have limitations for prompt selection and even answer selection across numerous datasets. In response, we propose an enhanced calibration method, Calibration By Marginalization (CBM), in Section 4.3. Section 4.4 shows that CBM notably improves prompt selection for most methods, particularly MI and MDLM, enabling them to achieve the highest prompt selection performance compared to all other methods. Furthermore, CBM’s answer selection enhancement is the most robust across various datasets when compared to existing calibration methods.
4.1 Experimental Setup for Probability Calibration
We compare the prompt selection performance with four different scenarios of calibration: without applying any calibration; (A) applying calibration only for Answer selection, computing where is selected as the answer; (P) applying calibration only for Prompt selection; and (PA) applying calibration for both Prompt selection and Answer selection.
Normalization of is not required for answer selection, as it does not affect the of the scores. However, to obtain PSS, it is essential to normalize so that the sum equals one, thereby preserving the original probabilistic motivation of different methods. Consequently, we apply the softmax function to convert into a proper probability distribution q(y|x, t).19
4.2 Experimental Results: Underperformance of Existing Calibration Methods
We check the prompt selection performance of each method across the four calibration scenarios. Surprisingly, for both CC and PMIDC, we find that all three calibration scenarios show degraded performance compared to the scenario of no calibration. Not only does the prompt selection performance degrade, but the best, worst, and average prompt performance also drops in the case of A (only answer selection). This is unexpected, as CC and PMIDC have been reported to improve performance in slightly different setups (our results are in a zero-shot setting, while the main setup of Zhao et al. (2021) is few-shot, and the choice of xdomain differs for PMIDC).
To further investigate the subpar performance in case A, we analyze the proportion of prompts (out of 100) that exhibit improved performance after applying calibration for answer selection across ten different models and 13 datasets. Figure 1b displays the average ratio for all models. The figure indicates that the existing calibration methods do not result in better answer selection for the majority of our evaluation datasets. For instance, more than half of the prompts displayed decreased performance after applying CC in 7 out of 13 datasets. A similar pattern holds when applying PMIDC.
4.3 Enhanced Calibration Method: Calibration By Marginalization (CBM)
Since the calculation of p(y|x, t) for all t ∈ T and x ∈ X is already done to perform prompt selection, CBM does not introduce any additional computational cost for calibration, unlike CC or PMIDC that require inference on additional inputs such as “N/A”, “[MASK]”, “”, and xdomain.
4.4 Experimental Results: Improvement with CBM Calibration
Figure 6 presents the prompt selection performance of each probability-based prompt selection method across the four calibration scenarios of applying CBM. Applying CBM calibration for answer selection (A) enhances prompt selection performance across all methods. Scenarios involving calibration for prompt selection (PA, P) mostly result in unchanged or decreased prompt selection performance compared to the cases without calibration, and applying calibration solely for prompt selection (P) consistently results in diminished performance.
The methods displaying the most significant performance improvements in the PA scenario are MIAG, MIA, MI, and MDLM, particularly with the prompt selection performance of MIA(PA) and MDLM(PA) being the highest among different methods. On average, MIA(PA) increases the scaled F1 from 87.79% (0.5965/0.6795) to 99.44% (0.6757/0.6795) compared to the best existing method (GE) when the oracle prompt without calibration is used as the target of comparison. The scaled F1 of MIA(PA) calculated with respect to the oracle prompt with calibration is 96.85% (0.6757/0.6977).
Next, we assess the effectiveness of CBM calibration for answer selection by examining the proportion of prompts (out of 100) that show improved performance after applying calibration for answer selection. Figure 1b indicates that CBM is considerably more effective than CC and PMIDC in enhancing the performance of the prompts. The performance of more than half of the prompts increases after applying CBM in all 13 datasets. Additionally, the performance of nearly 100% of prompts improves with CBM calibration in 7 datasets. While CC and PMIDC improved almost none of the F1 of the prompts in story and hella, the performance of approximately 70% of the prompts increased with CBM calibration, possibly due to the more accurate calculation of p(y|t) as discussed in Section 4.3.
5 Discussion
In this section, we discuss various findings that are relevant to our main experiments.
Figure 7a shows that the effectiveness of a probability-based prompt selection method remains consistent across models of different types and numbers of parameters, justifying our choice of using a single model (OPT 2.7B) as the representative for all experiments. Figure 7b shows that the trend of correlation between Prompt Selection Score and performance of the selected prompt is also quite consistent between different models.
Figure 8 shows the mean and standard deviation of the result of prompt selection among five different subsets of 50 prompts randomly sampled from the full set of 100 prompts, using the mainly discussed methods. The result shows that the performance of instance-wise prompt selection methods (MIAGL, MIAL, MDL) is not stable, likely due to the noisy nature of selecting one prompt for each instance. However, the performance of MIA(PA) and MDLM(PA) still achieves the highest performance and also shows the lowest standard deviation, proving the effectiveness of CBM.
Through additional analysis, we find that (1) while strong performance in prompt selection does not consistently correlate with Prompt Selection Score, a broadly positive correlation is observed when averaged across most methods; (2) CBM improves the performance of MDLM by mitigating overconfidence; (3) MI, GE, and CBM methods face limitations when applied to dynamic datasets with extreme label imbalance; (4) top-performing prompt selection methods from the zero-shot setting, like MIA(PA) and MDLM(PA), retain their effectiveness in the few-shot setting, further validating their robustness across different conditions.
6 Related Works
Recent advances in LLMs have created the paradigm of prompt-based learning, which gives the benefit that a single pretrained LLM can be used to solve a great number of tasks with task-specific prompts. However, the performance of LLMs can heavily fluctuate according to the choice of prompts (Zhao et al., 2021; Holtzman et al., 2021; Lu et al., 2022). To mitigate this issue, prompt engineering attempts to find the prompt that results in the most effective performance on the downstream task (Liu et al., 2023).
Automatic prompt engineering methods can be largely divided into two groups: the methods that use discrete prompts where the prompts are human-understandable actual text strings, and the methods that optimize continuous prompts where the prompts lie in the embedding space of the model (Li and Liang, 2021; Shin et al., 2020). Probability-based prompt selection methods that we study in this work (Section 2.2) fall into the former group; most of the methods of the latter group require gradient-based training, while probability-based prompt selection does not perform any gradient-based update.
Prompt engineering methods using discrete prompts include prompt paraphrasing, prompt generation, and prompt selection. Among these, prompt paraphrasing or generation approaches can be used together with probability-based selection methods; prompt selection can be performed on the prompts generated through prompt paraphrasing or generation (Jiang et al., 2020; Mishra et al., 2022; Gao et al., 2021; Wang et al., 2023; Prasad et al., 2023; Kim et al., 2022; Deng et al., 2022). Among prompt selection methods other than the probability-based approaches, a large portion of the methods are not easily utilizable since they require training an additional model and/or the use of an additional component. Zhang et al. (2022b) use reinforcement learning for demonstration selection of in-context learning; Chang and Jia (2023) train a scorer and estimator for demonstration selection; Kumar and Talukdar (2021) and Xu et al. (2022) use a genetic algorithm; Liu et al. (2022), Lyu et al. (2023), and Rubin et al. (2022) use retrieval from a corpus to select the prompts.
On the other hand, probability-based prompt selection offers the advantage of prompt selection requiring only the output probabilities of the LLM. While the prerequisite is a set of candidate prompts to select from, this data is relatively small in size and can be easily obtained from the research community (Bach et al., 2022) or via machine generation (OpenAI, 2023). One limitation of these methods, though, is that one cannot use them for closed-source LLMs that are only available via proprietary LLM APIs that do not provide output probability distributions. Also, when the number of candidate prompts |T| and the size of the dataset used to select the prompt |X| is large, the calculation for prompt selection becomes computationally heavy; using a smaller set X′ ∈ X to choose the prompt for X can be helpful in such a case.
7 Conclusion
In this paper, we address the need for a comprehensive evaluation to compare the existing probability-based prompt selection methods, which have been proposed and evaluated under varying conditions and datasets. To achieve this, we introduce a unified evaluation setup to compare these methods, conduct a thorough evaluation, and develop a unified framework of the existing probability-based prompt selection methods. Our analysis within this unified framework has provided insights into the relationship among existing methods, enabling the development of several combinational variants that improve performance. Furthermore, our research on probability calibration has revealed the limitations of existing calibration methods and led to the proposal of an enhanced calibration method, Calibration By Marginalization (CBM). CBM not only significantly improves prompt selection performance but also demonstrates robust answer selection enhancement across multiple datasets. We hope that our unified setup provides a foundation for fair evaluation between various prompt selection methods and that our findings yield deeper insights into probability-based prompt selection.
Acknowledgments
The authors would like to extend their sincere gratitude to the anonymous reviewers and action editor for their highly detailed and insightful comments and feedback. The authors would also like to thank Sang-Woo Lee for valuable feedback and discussions on the project. This work was partly supported by KT grant (2021, A study on a conversational language model that uses long external text as a prompt, 80%) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub, 20%).
Notes
The code and datasets used in our work are available at https://github.com/soheeyang/unified-prompt-selection.
Note that one can perform a computation-efficient prompt selection or transfer of prompt selection by (1) selecting one prompt using a subset of X or a separate development set X′ and then (2) use the selected prompt for the target evaluation dataset X to instantiate all x ∼ PX. However, following the conventional setup of the previous studies and for comparison with instance-wise prompt selection methods where such an approach is not applicable by design, we do not use a separate X′.
The number of oracle prompts can be greater than one, but we use the singular form for a more concise presentation.
We have performed additional experiments in a few-shot learning setup using the texts of permutations of varying numbers of in-context learning demonstrations as the prompts. However, we do not include these results in the paper due to space limitations; also, the overall trend of the results stays similar to that of the zero-shot learning setup.
They generate a probing set with demonstrations from the training set and use the probing set to find the best order.
The answer choices are sentences and vary dynamically for each evaluation instance. In these datasets, the label index is not connected to some concept, unlike the datasets with static choices (e.g., 0 is negative and 1 is positive in sst2), so the ratio of labels is not meaningful. However, all the datasets of dynamic categories that we use have balanced label distribution.
The generation, review, and correction are done by the first two authors of the paper.
Interpretations of these results are provided in Section 3.2.
One expected caveat of the methods of zero-label prompt selection is that it might not work well when a large portion of the prompts fail to solve the given task. Therefore, Liao et al. (2022) propose a way to filter out low-quality prompts in advance, but the filtering algorithm does not benefit their proposed methods in our experimental setup.
Note that our experimental setup also differs with the setup of Gonen et al. (2023); we generated the prompts in an unrestricted manner that x can appear anywhere in the prompt.
We allow the input x to appear anywhere in the prompt, unlike their restricted setup where x always comes at the beginning.
In balanced and unbalanced cases, the number of tokens of most verbalizers is 1, so the best, worst, and average prompt performances of the prompts whose performance is calculated using only the first token are identical to the other methods; on the other hand, the verbalizer is a sentence for dynamic datasets and makes the difference.
To calculate PMIDC, it is necessary to manually select xdomain for each prompt in every dataset. Nonetheless, our experiments involve a total of 1,300 unique prompts, making a manual determination of different xdomain for each prompt a tedious task. Therefore, we use the prompt instantiated with an empty input (xdomain = ι(“”, t)) for each prompt.
We can ignore the lack of because it does not change the result of .
References
Author notes
This project was initiated while the first author was a Master’s student at KAIST (Nov 2022 - Feb 2023).
Work done as an intern at KAIST.
Action Editor: Hermann Ney