Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis

Previous work in prompt engineering for large language models has introduced different gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but have failed to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common and diverse NLP tasks. We find that each of the existing methods can be interpreted as some variant of the method that maximizes mutual information between the input and the predicted output (MI). Utilizing this finding, we develop several other combinatorial variants of MI and increase the effectiveness of the oracle prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, considering that all the methods rely on the output probability distribution of the model that might be biased, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to the existing methods and helps increase the prompt selection effectiveness of the best method to 96.85%, achieving 99.44% of the oracle prompt F1 without calibration.1


Introduction
Large Language Models (LLMs) have demonstrated remarkable performance in solving various natural language processing tasks through prompt-based learning without requiring additional task-specific training (Brown et al., 2020;Dong et al., 2023).However, the performance of LLMs can heavily fluctuate according to the choice of prompts (Zhao et al., 2021;Holtzman et al., 2021;Lu et al., 2022).While various prompt engineering approaches have been proposed to mitigate this issue, the nontrivial prerequisites of many of these methods, such as training an additional model and/or using an additional component, have been a bottleneck to their real application (Liu et al., 2023;Li and Liang, 2021;Jiang et al., 2020;Prasad et al., 2023;Liu et al., 2022;Rubin et al., 2022).
On the other hand, probability-based prompt selection methods do not require any additional parameter updates or additional components2 and thus provide a promising and easily applicable solution; these methods aim to select the prompt from a set of prompts that is expected to be most effective in helping a language model to make correct predictions solely based on the probability distribution of the model (Sorensen et al., 2022;Lu et al., 2022;Wu et al., 2023;Liao et al., 2022;Gonen et al., 2023).However, despite their ease of utilization, there has been a lack of comprehensive comparative evaluation between existing probability-based prompt selection methods, as each method is proposed in different setups and evaluated on different datasets, evaluation instances, sets of prompts, and models.
In this paper, we first carefully design a unified evaluation setup to facilitate a fair comparison between different prompt selection methods.Our unified evaluation reveals that no single method consistently outperforms other methods across all datasets and that all existing probability-based prompt selection methods roughly correspond to a sub-term of the equation of Mutual Information (MI) (Sorensen et al., 2022).We utilize this discovery to propose several variants of MI that use different combinations of the components of existing methods, and the best combinational variant MI AGL increases the scaled F1 (F1 divided by that of the oracle prompt, showing the effectiveness of the prompt selection method) from 87.79% to 94.98% (MI AGL of Figure 1a).
Furthermore, we find the need for a better approximation of the LLM's output probability distribution, considering that all probability-based prompt selection methods rely on the probabilistic estimates from the model that might be biased.Therefore, by drawing a connection between the existing model output probability calibration methods (Zhao et al., 2021;Holtzman et al., 2021), we propose an enhanced calibration method, Calibration By Marginalization (CBM).CBM significantly improves the prompt selection performance of several methods when applied to calibrate the output probability of LLMs, increasing the bestscaled F1 to 96.85% (MI (PA)  A of Figure 1a), achieving 99.44% of the oracle prompt F1 under the uncalibrated scenario.CBM also proves to show the most robust answer selection enhancement across multiple datasets compared to the existing calibration methods (Figure 1b).

Probability-based Prompt Selection
In this section, we perform a unified evaluation of existing probability-based prompt selection methods.First, we describe the task of probabilitybased prompt selection in Section 2.1.Next, we briefly introduce each of the existing methods in Section 2.2.Then, we describe our experimental setup for unified evaluation in Section 2.3 and present the evaluation results in Section 2.4.

Task Description
Probability-based prompt selection is the task of selecting one or more prompts from a list of prompts T , which are expected to help the language model θ make the most accurate prediction for the evaluation dataset X where the evaluation instances are drawn from the data distribution, x ∼ P X , utilizing only the output probability distributions of the model on X,3 without knowing     1).(b) Ratio of the prompts (out of 100) whose F1 on each dataset improves by applying probability calibration for answer selection, averaged across 10 models.Our proposed calibration method, CBM (Equation 1), is considerably more effective than CC and PMI DC (Table 5) in enhancing the answer selection performance of the prompts.
the ground truth labels and using neither additional gradient-based updates nor other trained components.The performance of a probability-based prompt selection method is evaluated by how high the score of the evaluation metric obtained with the selected prompt(s) is.When one prompt is selected for the whole dataset, the performance is upper bounded by the performance obtained with the prompt with which the model achieves the best metric score; we call such a prompt the optimal oracle prompt. 4When one prompt is selected for each x ∼ P X , different t ∈ T can be chosen for each x; we call such a prompt selection approach as instance-wise prompt selection.
Note that the definition of prompt can vary acing one prompt using a subset of X or a separate development set X ′ and then (2) use the selected prompt for the target evaluation dataset X to instantiate all x ∼ PX .However, following the conventional setup of the previous works and for comparison with instance-wise prompt selection methods where such an approach is not applicable by design, we do not use a separate X ′ .

Existing Method
Abbr.Selected Prompt: arg max   (Wu et al., 2023) MDL −H(Y |x, t) Zero-Label Prompt Selection (Liao et al., 2022) x 1 arg max y p(y|x, t) = arg max y s(x, y) cording to the setup for which prompt selection is performed.When prompt selection is applied to zero-shot learning, prompts are defined as various formats of text templates that are filled by evaluation instances x ∼ P X to facilitate.On the other hand, for few-shot (in-context) learning, prompts are often defined as the demonstrations sampled from a training/development set or texts of permutations of such demonstrations.In our work, in order to enable comparison between all the methods proposed either in zero-shot and few-shot setup, we perform prompt selection in a zero-shot setup with the former definition of prompt. 5  Concrete Example Examples of prompts t ∈ T include "Which category does the following news article fall into?{text}", "The following news article, {text}, covers the topic of", and "{text} belongs in which category: Politics, Sports, Business, Science and Technology".We say that x instantiates the prompt t when x is inserted into the placeholder {text} of the prompt template and let ι(x, t) denote the instantiated prompt.Each of the answer categories represents the concept of politics, sports, business, and science/technology, and uses "Politics," "Sports," "Business," and "Science and Technology" as the verbalizer (the actual text evaluated to score the answer choices), respectively.
For instance, given OPT 2.7B (Zhang et al.,   5 We have performed additional experiments in a fewshot learning setup using the texts of permutations of varying numbers of in-context learning demonstrations as the prompts.However, we do not include these results in the paper due to space limitations; also, the overall trend of the results stays similar to that of the zero-shot learning setup.2022a) as the language model, "King Charles III's Coronation watched by more than 18 million viewers" as x, and the three prompts shown as examples in the previous paragraph, a prompt selection method should choose the prompt that is most likely to help OPT 2.7B correctly predict the answer y among the possible answer choices Y which represent the concepts of politics, sports, business, and science/technology.To select such a prompt, the method must rely solely on the output probability of the model given the instantiated prompts as input, e.g., p("Politics"|"Which category . . .King . . .").

Existing Approaches
Table 1 provides the summary of the existing approaches for probability-based prompt selection.In the equations, we use p(y|x, t) ∈ R |Y | to express the output probability distribution of the model over the answer choices, P θ (Y |X = x, T = t), when the instantiated prompt ι(x, t) is given as the input.The probability for each y ∈ Y is calculated as p(y|x, t) = exp(log p(y|x, t)) , where log p(y|x, t) is the unnormalized logit that the model outputs.When y's verbalizer is tokenized into more than one token, we calculate log p(y|x, t) as the mean of log-probability over the tokens of the verbalizer for datasets with fixed answer choices, and as the sum of log-probability for datasets with dynamically changing sentencetype answer choices, except for the method proposed by Sorensen et al. (2022) which explicitly specifies that the calculation of p(y|x, t) uses only the logits of the first token (dubbed as One-Token Response (OTR) in their work).We use H(q(y)) to denote the entropy of an arbitrary probability distribution q(y) ∈ R |Y | , − y∈Y q(y) log q(y).
When q(y) = p(y|x, t), we use H(Y |x, t) to represent its entropy H(Y |X = x, T = t).
Mutual Information (MI) Sorensen et al. (2022) propose to select one prompt for the evaluation dataset that maximizes the mutual information between the evaluation instances X and their corresponding model predictions Y given prompt Zero-Label Prompt Selection (ZLP, ZPM, ZMV) Liao et al. (2022) propose to make a pseudo-label for each x by ensembling the outputs for all prompts to make a score s(x, y) for each x, and then choosing one prompt t for the evaluation dataset whose cases of arg max y∈Y p(y|x, t) = arg max y∈Y s(x, y) is the maximum.As shown in Table 1, they propose three ways to calculate s(x, y): using the ensemble of log-probability mean, probability mean, and majority vote.We refer to them as ZLP, ZPM, and ZMV, respectively.While the authors of the original work applied filtering of prompts, we observed from our preliminary experiments that filtering does not have a significant effect.
Perplexity (PPL) Gonen et al. ( 2023) propose to select one prompt for the evaluation dataset with which the language model exhibits the lowest average perplexity of the instantiated prompt ι(x, t) as shown in the last row of Table 1 where ι(x, t) i represents the i-th token of the instantiated prompt ι(x, t).We include the geometric mean to the definition of p(x, t) because the averaged probability is often used to approximate the probability of a sequence.

Experimental Setup
Evaluation Datasets Our dataset selection, aimed at fair measurement of various probabilitybased prompt selection methods, is guided by several factors.We favor the datasets previously used in research, those encompassing diverse domains, and datasets where prompt selection is meaningful.We exclude the datasets where all prompts underperform a random baseline or where a naive baseline of selecting the mode label could excel due to high imbalance.By excluding the datasets with high imbalance, we aim to avoid the false positive cases where a failed algorithm that collapses to select one label regardless of the input is evaluated as a competitive method by chance.The selected datasets have diverse label types and distributions, and we categorize them based on their label distributions into balanced (label distribution is about 1:1), unbalanced (otherwise), and dynamic7 categories.The 13 datasets selected through this process are shown in Table 2. 8Prompts We create a diverse range of 100 prompts for each of the 13 evaluation datasets, which results in 1,300 prompts in total.For each dataset, a few of the 100 prompts are taken from PromptSource (Bach et al., 2022), and the rest are generated using GPT 3.5 (OpenAI, 2023) to speed up the prompt generation process and then manually reviewed and corrected9 .The prompts are designed to encompass various formats, with the evaluation instance and sometimes the answer choices appearing at different positions within the prompt, to ensure that the prompt selection task is meaningful.Table 3 shows a few examples of the prompts.We use one-token words as the verbalizers for the answer choices in most prompts, except for the prompts for the datasets of the dynamic category.

Models
We conduct the majority of our experiments with ten different models of varying sizes ranging from 1.3B to 66B10 .However, to present the experimental results and analysis more clearly,  The inference is performed using one to four NVIDIA V100 32GB GPUs.

Experimental Results
We find that there is no single probability-based prompt selection method that consistently outperforms one another across all 13 datasets and evaluation categories.While PPL and LE do not rank first in any dataset, every other method ranks first in a few datasets.Figure 2 illustrates the selected prompt performance averaged by category, along with the performance of the best (oracle) and worst prompts and the average performance of all prompts.In the balanced category, GE and MDL outperform others, with MI closely following.In the unbalanced category, MI stands out, while in the dynamic category, GE, MDL, and ZLP perform the best.LE and PPL generally underperform in all of the datasets; their task average does not even exceed the average performance of all prompts. 13We conclude that no single existing approach is significantly better than others, especially when dividing the evaluation dimensions into balanced, unbalanced, and dynamic labels.

Improving MI via Unified Analysis
In this section, we first derive a unified view of prompt selection methods in Section 3.1 and show that each method other than MI roughly corresponds to a sub-term of the equation of MI and revisit the previous experimental results for a unified analysis in Section 3.2.Then, from the unified view and analysis, we identify the differences between methods, particularly MI, GE, and MDL, and derive a few combinational variants by transferring design elements across methods which improves the prompt selection performance of MI.

Unified View: Identifying Connections Between Methods
Prompt Selection Score (PSS) Figure 3 offers a unified view of existing probability-based prompt selection methods, highlighting that each method except for MI approximately corresponds to a subterm in the equation of MI.We denote the highlighted parts as the Prompt Selection Score of each method (PSS method ); the score of which the prompt with the maximum value is chosen by the prompt selection method.
MI vs. GE and LE MI selects a prompt that maximizes the first term of PSS MI , 13 Interpretations of these results are provided in Section 3.2.arg max t H 1

|X|
x p(y|x, t) , and minimizes the second term, 1

|X|
x H (Y |x, t).This means that MI favors prompts that provide balanced predictions without label bias (interpretation of the first term) and sharp answer prediction distribution across all instances in the dataset (interpretation of the second term).These terms roughly correspond to PSS GE and −PSS LE , respectively.The difference between PSS GE and the first term of PSS MI is that the former converts p(y|x, t) to one-hot before taking the entropy of the mean.In sum, the prompts selected by GE and MI align, while those chosen by LE and MI tend to be opposite.Note that one expected caveat of GE is that it will be less effective when the dataset itself has a label imbalance.MI vs. MDL MDL is the only method among the presented probability-based prompt selection methods that selects a different prompt for each evaluation instance x, i.e., performs instance-wise prompt selection.Essentially, MDL is an instancewise version of the second term of PSS MI , choosing prompts whose output probability distribution p(y|x, t) has the lowest entropy, and thus aligns with MI.Since MDL favors the prompt that makes the model output a sharp probability distribution, one expected caveat of MDL is that it will not work well when the model fails to solve the given task and collapses to a single prediction regardless of the input with overly high confidence.
MI vs. ZPM Zero-label prompt selection methods ensemble the results of all prompts to calculate s(x, y), create pseudo labels by converting s(x, y) to one-hot, and then choose the prompt with predictions most similar to the pseudo labels.Applying this view to PSS ZPM with an assumption of  2023) even restrict their prompt format for the input x to appear at the beginning so that p(x, t) is calculated only as the form of p(t|x)p(x), i.e., the probability of prompt is always conditioned on x, the probabilistic assumption of MI is incompatible with the motivation of PPL. 15

Unified Analysis: Revisiting Experimental Results
Revisiting the unified evaluation in Section 2.4, the results align with our analysis from Section 3.1.GE performs well in balanced datasets but poorly in unbalanced ones due to its preference for prompts that create balanced predictions.GE also performs well in dynamic datasets since the label distribution is balanced by chance (Table 2).MDL performs comparably to GE due to similar 14 One expected caveat of the methods of zero-label prompt selection is that it might not work well when a large portion of the prompts fail to solve the given task.Therefore, Liao et al. (2022) propose a way to filter out low-quality prompts in advance, but the filtering algorithm does not benefit their proposed methods in our experimental setup.
15 Note that our experimental setup also differs with the setup of Gonen et al. ( 2023); we generated the prompts in an unrestricted manner that x can appear anywhere in the prompt.entropy calculations.LE's performance, however, is less satisfactory, given its optimization contradicts MDL.The underperformance of PPL compared to that by Gonen et al. ( 2023) might be due to our use of diverse prompt formats 16 .
Note that in dynamic datasets, MI's best, worst, and average prompt performances differ due to its distinct calculation of p(y|x, t) that uses only the first token logits; for other methods, p(y|x, t) is calculated using all tokens (Section 2.2).17This leads to a question: Is the difference in the calculation of p(y|x, t) the reason that MI performs well in balanced and unbalanced cases but poorly in dynamic cases?In addition, despite GE and MDL maximizing MI's sub-term, they outperform MI in balanced datasets.This observation leads to another question: Is their higher performance due to their one-hot p(y|x, t) and instance-wise prompt selection?
In the following subsection, we show that the answers to both questions are yes, demonstrating that using all tokens to calculate p(y|x, t), onehot p(y|x, t), and instance-wise prompt selection improves the prompt selection performance of MI.

Experimental Results: Transferring
Design Choices from Unified Analysis p(y|x, t) calculation using all tokens helps MI.
To investigate the difference between using only the first token probability and the mean/sum of all tokens to calculate PSS MI , we develop a variant of MI called MI A (A of All).Unlike MI and like other methods, MI A calculates p(y|x, t) by taking the mean of all token logits for balanced and unbalanced datasets, and the sum for dynamic datasets.Since the balanced and unbalanced datasets in our experimental setup (Section 2.4) mostly use one-token verbalizers which result in the same result of MI and MI A , we utilize new sets of verbalizers of 1-2 tokens (1 ≤ |v| ≤ 2) or 2 tokens (|v| = 2) for all the prompts of our evaluation datasets and compare the two methods.Our results in Figure 4 show that using all tokens is more effective in all configurations except for the 1-2 token-balanced tasks.
One-hot p(y|x, t) and instance-wise prompt selection benefits MI.We create combinational variants of GE, MDL, and MI (outlined in Table 4) to study whether their differences contribute to MI's lower performance in balanced datasets.For instance, PSS

Improving Prompt Selection Through Enhanced Probability Calibration
While the previous section enhances prompt selection performance using combinatorial variants, in this section, we explore an orthogonal approach to further improve prompt selection: model output probability calibration.Since all the prompt selection methods except for PPL depend on the model output probability p(y|x, t) to calculate Prompt Selection Score (PSS), the stability and reliability of p(y|x, t) affect their prompt selection performance.However, previous works have pointed out that p(y|x, t) is unstable without calibration. 18To address the issue, Zhao et al. (2021) suggest Contextual Calibration (CC), which reduces bias towards each answer choice by employing content-free inputs ("N/A", "[MASK]", ""), while Holtzman et al. (2021) present Domain Conditional Pointwise Mutual Information (PMI DC ) by reweighting each answer choice based on its task-specific prior likelihood.We summarize the two methods for answer selection in Table 5; arg max y q(y|x, t) is selected as the answer, where q(y|x, t) is the calibrated score.
One might assume that these existing calibration methods would effectively calibrate p(y|x, t) for PSS.However, through the experiments described in Section 4.1, we reveal in Section 4.2 the results that these methods have limitations for prompt selection and even answer selection across numerous datasets.In response, we propose an enhanced calibration method, Calibration By Marginalization (CBM), in Section 4.3.Section 4.4 shows that CBM notably improves prompt selection for most methods, particularly MI and MDL M , enabling them to achieve the highest prompt selection performance compared to all other methods.Furthermore, CBM's answer selection enhancement is the most robust across various datasets when compared to existing calibration methods.

Experimental Setup for Probability Calibration
We compare the prompt selection performance with four different scenarios of calibration: without applying any calibration; (A) applying calibration only for Answer selection, computing q(y|x, t) where arg max y q(y|x, t) is selected as the answer; (P) applying calibration only for Prompt selection; and (PA) applying calibration for both Prompt selection and Answer selection.Normalization of q(y|x, t) is not required for answer selection, as it does not affect the arg max of the scores.However, to obtain PSS, it is essential to normalize q(y|x, t) so that the sum equals one, thereby preserving the original probabilistic motivation of different methods.Consequently, 18 Zhao et al. (2021) find that the probability in few-shot learning tends to favor certain answer choices appearing at the end of the prompt or common in pretraining data.Holtzman et al. (2021) note that ranking based on string probability can be probabilistic due to surface form competition.

Existing Method
Equation for Answer Selection Contextual Calibration (CC) (Zhao et al., 2021) (Holtzman et al., 2021) q(y|x, t) = log p(y|x, t) p(y|x domain , t) Table 5: Existing calibration methods proposed for answer selection.arg max y q(y|x, t) is selected as the answer for the prompt t instantiated by input instance x.Note that the actual calculation of CC in the official code uses p cf , mean-normalized pcf ; thus, we also use it in our experiments.
we apply the softmax function to convert q(y|x, t) into a proper probability distribution q(y|x, t).19

Experimental Results: Underperformance of Existing Calibration Methods
We check the prompt selection performance of each method across the four calibration scenarios.Surprisingly, for both CC and PMI DC , we find that all three calibration scenarios show degraded performance compared to the scenario of no calibration.Not only does the prompt selection performance degrade, but the best, worst, and average prompt performance also drops in the case of A (only answer selection).This is unexpected, as CC and PMI DC have been reported to improve performance in slightly different setups (our results are in a zero-shot setting, while the main setup of Zhao et al. ( 2021) is few-shot, and the choice of x domain differs for PMI DC ).
To further investigate the subpar performance in case A, we analyze the proportion of prompts (out of 100) that exhibit improved performance after applying calibration for answer selection across ten different models and 13 datasets.Figure 1b displays the average ratio for all models.The figure indicates that the existing calibration methods do not result in better answer selection for the majority of our evaluation datasets.For in-stance, more than half of the prompts displayed decreased performance after applying CC in 7 out of 13 datasets.A similar pattern holds when applying PMI DC .

Enhanced Calibration Method:
Calibration By Marginalization (CBM) Table 5 shows that the equation for CC can be alternatively expressed as follows: , which turns CC into a special case of PMI DC20 , where p(y|x domain , t) = 1 |C| c∈C p(y|c, t).Additionally, upon revisiting the motivation of PMI DC and considering the equation of pointwise mutual information PMI(x, y) = log p(y|x) p(y) , it becomes evident that p(y|x domain , t) approximates p(y|t).Therefore, the distinction between CC and PMI DC lies solely in how they approximate p(y|t).However, since the approximation for CC relies on three inputs and PMI DC on just one, both methods fall short of providing a stable approximation.This limitation naturally leads to the following question: Could there be a way to approximate p(y|x, t) in a more stable manner?
Encouragingly, the answer to the question is yes.A better approximation of p(y|x, t) can be calculated using the law of marginal probability: p(y|t) = x∈X p(y, x|t) = x∈X p(y|x, t)p(x|t).With this more stable approximation of p(y|t) and the probabilistic assumption of MI that p(x|t) = 1 |X| , we introduce a new calibration method called Calibration By Marginalization (CBM) that employs the following equation for answer selection: . (1) Since the calculation of p(y|x, t) for all t ∈ T and x ∈ X is already done to perform prompt selection, CBM does not introduce any additional computational cost for calibration, unlike CC or PMI DC that require inference on additional inputs such as "N/A", "[MASK]", "", and x domain .The methods displaying the most significant performance improvements in the PA scenario are MI AG , MI A , MI, and MDL M , particularly with the prompt selection performance of MI (PA)  A and MDL (PA)  M being the highest among different methods.On average, MI (PA)  A increases the scaled F1 from 87.79% (0.5965/0.6795) to 99.44% (0.6757/0.6795) compared to the best existing method (GE) when the oracle prompt without calibration is used as the target of comparison.The scaled F1 of MI (PA)  A calculated with respect to the oracle prompt with calibration is 96.85% (0.6757/0.6977).
Next, we assess the effectiveness of CBM calibration for answer selection by examining the proportion of prompts (out of 100) that show improved performance after applying calibration for answer selection.Figure 1b indicates that CBM is considerably more effective than CC and PMI DC in enhancing the performance of the prompts.The performance of more than half of the prompts increases after applying CBM in all 13 datasets.Additionally, the performance of nearly 100% of prompts improves with CBM calibration in 7 datasets.While CC and PMI DC improved almost none of the F1 of the prompts in story and hella, the performance of approximately 70% of the prompts increased with CBM calibration, possibly due to the more accurate calculation of p(y|t) as discussed in Section 4.3.

Discussion
In this section, we discuss various findings that are relevant to our main experiments.
Figure 7a shows that the effectiveness of a probability-based prompt selection method remains consistent across models of different types and numbers of parameters, justifying our choice of using a single model (OPT 2.7B) as the representative for all experiments.Figure 7b shows that the trend of correlation between Prompt Selection Score and performance of the selected prompt is also quite consistent between different models.
Figure 8 shows the mean and standard deviation of the result of prompt selection among five different subsets of 50 prompts randomly sampled from the full set of 100 prompts, using the mainly discussed methods.The result shows that the performance of instance-wise prompt selection methods (MI AGL , MI AL , MDL) is not stable, likely due to the noisy nature of selecting one prompt for each instance.However, the performance of MI (PA)   A and MDL (PA)  M still achieves the highest performance and also shows the lowest standard deviation, proving the effectiveness of CBM.
Through additional analysis, we find that (1) while strong performance in prompt selection does not consistently correlate with Prompt Selection Score, a broadly positive correlation is observed when averaged across most methods; (2) CBM im-proves the performance of MDL M by mitigating overconfidence; (3) MI, GE, and CBM methods face limitations when applied to dynamic datasets with extreme label imbalance; (4) top-performing prompt selection methods from the zero-shot setting, like MI (PA)  A and MDL (PA) M , retain their effectiveness in the few-shot setting, further validating their robustness across different conditions.

Related Works
Recent advances in large language models (LLMs) have created the paradigm of prompt-based learning, which gives the benefit that a single pretrained LLM can be used to solve a great number of tasks with task-specific prompts.However, the performance of LLMs can heavily fluctuate according to the choice of prompts (Zhao et al., 2021;Holtzman et al., 2021;Lu et al., 2022).To mitigate this issue, prompt engineering attempts to find the prompt that results in the most effective performance on the downstream task (Liu et al., 2023).
Automatic prompt engineering methods can be largely divided into two groups: the methods that use discrete prompts where the prompts are human-understandable actual text strings, and the methods that optimize continuous prompts where the prompts lie in the embedding space of the model (Li and Liang, 2021;Shin et al., 2020).Probability-based prompt selection methods that we study in this work 2.2 fall into the former group; most of the methods of the latter group require gradient-based training, while probabilitybased prompt selection does not perform any gradient-based update.
Prompt engineering methods using discrete prompts include prompt paraphrasing, prompt generation, and prompt selection.Among these, prompt paraphrasing or generation approaches can   be used together with probability-based selection methods; prompt selection can be performed on the prompts generated through prompt paraphrasing or generation (Jiang et al., 2020;Mishra et al., 2022;Gao et al., 2021;Wang et al., 2023;Prasad et al., 2023;Kim et al., 2022;Deng et al., 2022).Among prompt selection methods other than the probability-based approaches, a large portion of the methods are not easily utilizable since they require training an additional model and/or the use of an additional component.(Zhang et al., 2022b)  On the other hand, probability-based prompt selection offers the advantage of prompt selection requiring only the output probabilities of the LLM.While the prerequisite is a set of candidate prompts to select from, this data is relatively small in size and can be easily obtained from the research community (Bach et al., 2022) or via machine generation (OpenAI, 2023).One limitation of these methods, though, is that one cannot use them for closed-source LLMs that are only available via proprietary LLM APIs that do not provide output probability distributions.Also, when the number of candidate prompts |T | and the size of the dataset used to select the prompt |X| is large, the calculation for prompt selection becomes computationally heavy; using a smaller set X ′ ∈ X to choose the prompt for X can be helpful in such a case.

Conclusion
In this paper, we address the need for a comprehensive evaluation to compare the existing probability-based prompt selection methods, which have been proposed and evaluated under varying conditions and datasets.To achieve this, we introduce a unified evaluation setup to compare these methods, conduct a thorough evaluation, and develop a unified framework of the existing probability-based prompt selection methods.Our analysis within this unified framework has provided insights into the relationship among existing methods, enabling the development of several combinational variants that improve performance.Furthermore, our research on probability calibration has revealed the limitations of existing calibration methods and led to the proposal of an enhanced calibration method, Calibration By Marginalization (CBM).CBM not only significantly improves prompt selection performance but also demonstrates robust answer selection enhancement across multiple datasets.We hope that our unified setup provides a foundation for fair evaluation between various prompt selection methods and that our findings yield deeper insights into probability-based prompt selection.

Figure 1 :
Figure 1: (a) F1 of the prompts selected by different probability-based prompt selection methods, averaged across 13 datasets.Per-dataset F1 and accuracy are shown in Figure 9.The methods without super/subscripts are the existing methods (Table1), while those with super/subscripts are our proposed methods (Table4& Equation1).(b) Ratio of the prompts (out of 100) whose F1 on each dataset improves by applying probability calibration for answer selection, averaged across 10 models.Our proposed calibration method, CBM (Equation1), is considerably more effective than CC and PMI DC (Table5) in enhancing the answer selection performance of the prompts.

Figure 3 :
Figure 3: The highlighted parts of the equation are rough estimations of the Prompt Selection Score (PSS) of each method, i.e., the score of which the prompt with the maximum value is chosen by the prompt selection method.They show the connection between different probability-based prompt selection methods.

Figure 4 :
Figure 4: F1 of the prompts selected by MI A and MI, averaged for each setup of a different number of tokens of verbalizers and evaluation dataset category.|v| denotes the number of tokens of the verbalizers.

Figure 7 :
Figure 7: Scaled F1 and correlation of F1 of the selected prompts and Prompt Selection Score of different probability-based prompt selection methods for different models, averaged across 13 datasets.
Figure 8: Mean and standard deviation of prompt selection among five sets of 50 prompts, sampled from the full set of 100 prompts.

Figure 9 :
Figure 9: F1 (top) and accuracy (bottom) of the prompts selected by the different probability-based prompt selection methods, shown for each dataset.
use reinforcement learning for demonstration selection of in-context learning; Chang and Jia (2023) train a scorer and estimator for demonstration selection; Kumar and Talukdar (2021); Xu et al. (2022) use a genetic algorithm; Liu et al. (2022); Lyu et al. (2023); Rubin et al. (2022) use retrieval from a corpus to select the prompts.

Table 1
), while those with super/subscripts are our proposed methods (Table4 & Equation t∈T

Table 1 :
Summary of the existing probability-based prompt selection methods.Notations used in the equations are explained at the end of Section 2.1.

Table 3 :
Examples of the created prompts.The prompts are written in Jinja for the use of Prompt-Source (Bach et al., 2022) APIs.weonly display the results of OPT 2.7B throughout the paper since the overall trend remains mostly identical (shown in Section 5).
Sanh et al. (2022)ails We use a modified version of the codebase ofSanh et al. (2022)11and PromptSource(Bach et al., 2022)12 to run model inference and add custom prompts, respectively.
in an alternative form, It is clear that PSS PPL differs from PSS MI because it considers the probability of x and t that PSS MI neglects.Applying the probabilistic assumption of MI p(x|t) = p(x) = 1 |X| to PSS PPL converts the equation to x p(t) |X| , causing PPL to select the prompt with the lowest perplexity irrespective of the input.Since Gonen et al. ( x p(x, t).
Bottom: new variations created by transferring design choices from existing probability-based prompt selection methods.A represents p(y|x, t) using All tokens, G represents one-hot p(y|x, t) like GE, and L represents instance-wise selection (select for each x) like MDL.