True Few-Shot Learning with Prompts—A Real-World Perspective

Abstract Prompt-based approaches excel at few-shot learning. However, Perez et al. (2021) recently cast doubt on their performance as they had difficulty getting good results in a “true” few-shot setting in which prompts and hyperparameters cannot be tuned on a dev set. In view of this, we conduct an extensive study of Pet, a method that combines textual instructions with example-based finetuning. We show that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set. Crucial for this strong performance is a number of design choices, including Pet’s ability to intelligently handle multiple prompts. We put our findings to a real-world test by running Pet on RAFT, a benchmark of tasks taken from realistic NLP applications for which no labeled dev or test sets are available. Pet achieves a new state of the art on RAFT and performs close to non-expert humans for 7 out of 11 tasks. These results demonstrate that prompt-based learners can successfully be applied in true few-shot settings and underpin our belief that learning from instructions will play an important role on the path towards human-like few-shot learning capabilities.


Introduction
With pretrained language models (LMs) getting ever larger (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020;Fedus et al., 2021), instructionbased learning has emerged as a powerful method for few-shot text classification (e.g., Jiang et al., 2020;Schick and Schütze, 2021a,c;Brown et al., 2020;Wei et al., 2021;Sanh et al., 2021).The key idea is to give an LM access to descriptive names for all possible outputs and to short prompts explaining the task to be solved.In settings where at most a few dozen examples are available, this simple idea leads to substantial improvements over Figure 1: PET achieves near-human performance for 7 out of 11 tasks of the RAFT benchmark (Alex et al., 2021), for which labeled development and test sets are not available.This demonstrates that prompt-based learners like PET, if correctly configured, excel at true few-shot learning, i.e., without any tuning of instructions or hyperparameters on a development set.
various baselines (Schick and Schütze, 2021a,c;Gao et al., 2021;Tam et al., 2021).However, recent work has questioned the strong few-shot performance of instruction-based approaches, arguing in particular that the considered settings are often not true few-shot settings (Perez et al., 2021;Logan IV et al., 2021) mainly for two reasons: For one, some approaches (e.g., Xie et al., 2019;Zhang et al., 2020;Chen et al., 2020;Tam et al., 2021) make use of large development sets to optimize hyperparameters.Beyond that, it is argued that manually designed instructions require manual tuning on development sets to achieve strong performance (Perez et al., 2021;Logan IV et al., 2021).Indeed, performance can vary largely -and in mostly unpredictable ways -across different instructions (Jiang et al., 2020;Schick and Schütze, 2021a); this issue even persists after finetuning a model on hundreds of instructions (Sanh et al., 2021).Even separate from this problem, the need for human involvement is generally seen as a huge drawback of manually designed instructions (Shin et al., 2020;Lester et al., 2021).Thus, several recent works abandon them in favor of automati-cally generated prompts (Shin et al., 2020;Gao et al., 2021;Hambardzumyan et al., 2021;Li and Liang, 2021;Lester et al., 2021).
Contrary to this trend, we argue that when correctly configured, prompt-based approaches achieve strong performance even in true few-shot settings and that there is no problem in using manually designed instructions per se.On the opposite, such instructions are often relatively easy to specify if one is familiar with the task to be solved, they provide an intuitive interface to convey task-specific knowledge, and if properly used, they consistently improve model performance in few-shot settings.To provide empirical support for these claims, we revisit PET (Schick and Schütze, 2021a) -a method for combining instructions with example-based finetuning whose key feature is that it allows users to specify multiple instructions for a single task -and thoroughly examine its performance with human-made instructions in true few-shot settings.In order to simulate a real-world scenario as best as possible, we proceed in two steps: First, we conduct an extensive study of PET using three English academic datasets to analyze its ability to perform true few-shot learning in a controlled environment and to derive best practices regarding the choice of instructions and other hyperparameters.We then put our findings to the test and evaluate PET on a large variety of different real-world tasks from the RAFT benchmark (Alex et al., 2021), for which no labeled development or test sets are available, enforcing a true few-shot setting (Perez et al., 2021).On average, PET clearly outperforms all baselines on this dataset and comes surprisingly close to the performance of non-expert humans (see Figure 1), demonstrating that instruction-based learning can successfully be applied to real-world tasks in true few-shot settings.
In summary, the main contributions of this work are as follows: • We investigate the performance of PET for various models, tasks and training set sizes, its ability to cope with different instructions and its robustness to hyperparameter choices in true few-shot settings.
• We show how PET can be used when no unlabeled data is available and propose a variant for efficient classification in scenarios with many different classes.
• We apply PET to RAFT (Alex et al., 2021), a benchmark of real-world tasks where it obtains a new state of the art and achieves nearhuman performance for 7 out of 11 tasks in true few-shot settings.

Related Work
As a precursor to instruction-based learning, some works have investigated ways of informing classifiers about the meaning of different output classes both for text (Chang et al., 2008;Veeranna et al., 2016;Zhou et al., 2018) and image classification (Norouzi et al., 2014;Romera-Paredes and Torr, 2015); actually providing instructions in the form of short prompts was first proposed by Radford et al. (2019).This idea has since been applied to solve a wide range of different NLP tasks without any task-specific training data (Puri and Catanzaro, 2019;Opitz, 2019;Davison et al., 2019;Schick et al., 2021;Schick and Schütze, 2021;Wei et al., 2021;Sanh et al., 2021).While most approaches rephrase tasks as a language modeling problem, some use prompts to reformulate them as different tasks for which large amounts of training data are available (Levy et al., 2017;McCann et al., 2018;Yin et al., 2019;Sun et al., 2021;Sainz et al., 2021).Instruction-based learning has also been used in few-shot settings; popular variants include in-context learning, where the model's parameters are fixed and examples are provided as additional context (Brown et al., 2020;Lu et al., 2021;Kumar and Talukdar, 2021;Min et al., 2021), finetuning the entire model (Schick and Schütze, 2021a,c;Gao et al., 2021;Tam et al., 2021), and prompt tuning, where only the instruction itself is optimized (Shin et al., 2020;Hambardzumyan et al., 2021;Li and Liang, 2021;Lester et al., 2021).Several works investigating the limitations and drawbacks of instruction-based few-shot approaches find that current LMs are mostly unable to understand complex instructions that go beyond short prompts or simple questions (Efrat and Levy, 2020;Weller et al., 2020;Webson and Pavlick, 2021) and that they are highly sensitive to the exact wording of the instructions provided (Jiang et al., 2020;Schick and Schütze, 2021a;Elazar et al., 2021).In a similar vein, Perez et al. (2021) andLogan IV et al. (2021) argue that prior work overestimates few-shot performance as manual prompt tuning is required to achieve good performance.Accordingly, some works attempt to obtain either prompts (Shin et al., 2020;Gao et al., 2021;Li and Liang, 2021;Lester et al., 2021)  names for output classes (Schick et al., 2020;Gao et al., 2021) without any human involvement.Finally, many benchmarks have been proposed for comparing few-shot approaches in a standardized way (e.g., Mishra et al., 2021;Bragg et al., 2021;Xu et al., 2021;Ye et al., 2021;Alex et al., 2021).As our focus is on the real-world applicability of few-shot methods, we evaluate PET on the RAFT benchmark (Alex et al., 2021), which measures performance in applied settings.

Pattern-Exploiting Training
We briefly review pattern-exploiting training (PET) (Schick and Schütze, 2021a,c), the method we use for instruction-based text classification.At its core, PET combines textual instructions with regular finetuning using labeled examples.To that end, users must specify one or more patterns which convert an input example x into a cloze question so that it can readily be processed by a masked language model (MLM) (Devlin et al., 2019). 1 These patterns can take on very different forms; some examples are shown in Figure 2. In addition, users must inform the model about the meaning of all output classes; this is done with a verbalizer that assigns a natural language expression to each output y (see Figure 2, right).We refer to the combination of a pattern and verbalizer as a pattern-verbalizer pair (PVP).
Given a single PVP, let p(y | x) be the probability that an MLM assigns to y's verbalization in the cloze question obtained by applying the pattern to x, normalized over all y.The MLM is finetuned on labeled examples (x, y) by minimizing the crossentropy loss between p(y | x) and y.
If a user specifies multiple PVPs, individual mod-1 We use the term prompt to refer to a short sequence of tokens that typically contains some form of instruction; the term pattern is used to denote the function that adds a prompt to an input.els are trained for each pair.Similar to knowledge distillation (Hinton et al., 2015), they are then used to annotate unlabeled examples for training a final classifier with a regular sequence classification head (Devlin et al., 2019).We use the weighted variant of PET without auxiliary language modeling; see Schick and Schütze (2021a) for details.
Multi-Token Verbalizers When using encoderonly MLMs (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020), one limitation of PET is that the verbalization of each output class must correspond to a single token.Schick and Schütze (2021c) propose a fix for this that uses multiple mask tokens, but their approach is very inefficient.As discussed by Schick and Schütze (2021b), an alternative solution is to instead use an encoder-decoder LM (Lewis et al., 2020;Raffel et al., 2020).In Section 5, we propose yet another solution that allows us to stick with encoder-only MLMs while being much more efficient than the approach of Schick and Schütze (2021c).
Iterative PET We also experiment with iPET (Schick and Schütze, 2021a), an iterative variant of PET that employs self-training (e.g., Scudder, 1965;Yarowsky, 1995;Brin, 1999;McClosky et al., 2006) to train several generations of models on datasets of increasing size.To this end, an ensemble of MLMs is trained as in regular PET and then used to assign labels to unlabeled examples; a new ensemble is trained on the so-obtained data.This process is repeated for multiple iterations, where the number of annotated examples is increased in each iteration.We refer to Schick and Schütze (2021a) for more details.

True Few-Shot Learning with PET
We conduct a variety of experiments to answer six important research questions (Q1-Q6) regarding the extent to which true few-shot learning is possible with PET.Beyond that, the purpose of our experiments is to establish a set of best practices for our real-world experiments on RAFT (Alex et al., 2021).Before discussing individual experiments, we describe the underlying setup.
Tasks and Datasets While they are heavily used in prior work (e.g., Brown et al., 2020;Schick and Schütze, 2021c;Logan IV et al., 2021;Webson and Pavlick, 2021), we decide against tasks and datasets from the GLUE (Wang et al., 2018) and SuperGLUE benchmarks (Wang et al., 2019) as they are very different from what we expect to see in real-world applications.Instead, we experiment with AG's News, Yelp Reviews Full Star and Yahoo Questions (Zhang et al., 2015) as these datasets represent classification tasks in three different domains that resemble real-world settings.Further, it is relatively easy to come up with a variety of instructions for each of these tasks, making it more straightforward to experiment with a large number of different patterns.
We We repeat all of our experiments for all five training sets and, by default, report average performance.
PVPs We manually write a total of 23 patterns per task, all of which can be categorized into one of the following groups: 2 • NULL: Following Logan IV et al. ( 2021), these patterns simply insert a mask token.
• PUNC: Similar to some patterns of Schick and Schütze (2021a), these patterns only add punctuation characters and a mask token.
• PROMPTS: Patterns in this group add short prompts -typically consisting of no more than three words -to the input, similar to Radford et al. (2019) and Schick and Schütze (2021a).
• Q&A: These patterns rephrase the task as a question q and append Question: q Answer: [MASK].
2 The full set of PVPs can be found in Appendix A.
to the input, similar to Brown et al. (2020) and Schick et al. (2021).
For all patterns, we use only a single verbalizer which we adopt from Schick and Schütze (2021a).
While varying the verbalizer may also lead to interesting insights, finding a large amount of reasonable verbalizers is challenging for some tasks as there is often a single, natural choice (e.g., the actual category names for AG's News and Yahoo Questions).
Hyperparameters We consider a setting similar to that of Schick and Schütze (2021a,c) and, unless otherwise specified, use the default settings of the PET library. 3 As our experiments require training hundreds of models, we make a few changes to reduce environmental impact (Strubell et al., 2019) and computational cost: We use the base variant of RoBERTa (Liu et al., 2019) as underlying LM, we train only one model per PVP, and we reduce the training steps for all individual models and the final classifier to 100 and 1,000, respectively.
Monitoring Finetuning pretrained LMs can be unstable on small datasets (Devlin et al., 2019;Dodge et al., 2020), sometimes leading to very poor performance.Luckily, we can detect such finetuning issues to some extent even without a labeled test set using the following two checks: • TRAIN SET UNDERFITTING: We check for training runs that result in less than 50% accuracy on the training set.As finetuning on up to 100 examples typically leads to perfect predictions on the training set, this is a clear indicator of a failed finetuning run.
• CONSTANT PREDICTIONS: Another strong indicator of unsuccessful training is when the finetuned model predicts the same output class for all inputs; we check this both on the training set and on the unlabeled set.
Whenever one of these events occurs, we restart training using a different seed for initializing all random number generators.
Q1: How can we find the best pattern -or do we even need to?
Even slightly different patterns can lead to very different performance (Jiang et al., 2020;Schick and 2 0 1 11 7 9 15 14 5 13 16 3 12 10 6 21 18 20 8 22 4 19   Schütze, 2021a; Schick et al., 2021;Webson and Pavlick, 2021;Sanh et al., 2021, i.a.) and popular model selection criteria can not reliably identify patterns that achieve similar performance to the best one (Perez et al., 2021).We thus investigate to what extent PET can eliminate the need to find the best instruction even in extreme settings where there are dozens of candidates to choose from.
Setup Using our default setup, we train individual models for each PVP and a final PET model; we also train an iPET model for 3 iterations.
Results Performance of individual models for each pattern and of the distilled models obtained using PET and iPET is shown in Figure 3. Interestingly, sorting all pattern groups by their average performance gives the exact same order for each task and training set size: NULL patterns clearly perform worst, followed by PUNC and PROMPT; Q&A gives the best average results.Contrary to findings of Logan IV et al. (2021), this shows that LMs can benefit a lot from manually written instructions even if combined with finetuning.Crucially, PET's performance is much higher than average performance of individual patterns; further, it consistently outperforms even the best pattern, verifying that PET indeed removes the need to find the best pattern.While iPET gives clear improvements for n = 10, it performs worse than PET for n = 100.The reason for this may be that we use a much smaller set of unlabeled examples than prior work (Schick and Schütze, 2021a,c).

Q2: Does performance of different patterns transfer across models?
While our results for Q1 show a consistent order of pattern groups for different training set sizes and tasks, an important question for real-world applications is whether the same finding also holds for different model sizes, and even entirely different models.
Setup We consider BERT (Devlin et al., 2019) RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020) as underlying language models;4 we experiment with the base and large variants.For each model and size, we repeat the exact same experiment as for Q1.Results Figure 4 shows the performance of each pattern group (i.e., average performance of all individual patterns in this group) and PET; scores are normalized so that the best-performing approach for each task, training set size and model gets a score of 1.0.Except for very few exceptions, our findings from Q1 regarding the relative performance of pattern groups and PET (NULL < PUNC < PROMPT < Q&A < PET) also hold for different models and sizes.In general, the performance of individual patterns strongly correlates between different models and sizes (Spearman's ρ ≥ 0.7 except in one case).
Q3: Does PET still work if some PVPs are not well understood?
While Q1 and Q2 have shown that PET performs even better than the best PVP if a large set of highquality PVPs is given, a potential concern is that performance may be much worse if the LM fails to understand a fair amount of patterns and verbalizers (e.g., because they are in a very different style from the model's pretraining data).For real-world scenarios, it is thus relevant to know how such "bad" PVPs affect the performance of PET.
Setup It is difficult to obtain large quantities of bad instructions as they might occur in real-world scenarios.As a proxy, we resort to noise patterns that add random tokens to the input, serving as a lower bound for truly bad patterns.In concrete terms, we add up to three randomly sampled tokens before and after the input. 5We also create noise verbalizers by assigning uniformly selected tokens to each output class.Using this process, we obtain 20 different intentionally bad PVPs per task.For each task, we start with 3 randomly selected, highquality patterns from our original set of manually designed instructions, add noise PVPs one by one and investigate the effect on performance.

Results
The effect of adding noise PVPs is shown in Figure 5. Interestingly, for both n = 10 and n = 100, performance remains almost con-5 If there are multiple input texts, we shuffle their order and additionally add 0-3 tokens in between them.stant even if more than half of the used PVPs are noise PVPs, demonstrating that PET is very robust to "bad" instructions.Figure 5 also shows performance when using only noise PVPs; except for AG's News with n = 100, this leads to substantially worse performance than PET with a small amount of manually designed instructions.
Q4: How many patterns are required for good performance?
Orthogonal to Q3, it is also important to know how many high-quality prompts are at least required to achieve satisfactory performance.This is of great practical importance because coming up with dozens of different PVPs for a task may take a significant amount of time, that could otherwise be spent annotating further examples.
Setup We generate 10 random permutations of all 23 patterns per task.For each permutation and training set, we use the same setup as in Q1 to compute the average performance obtained with PET when using only the first i patterns, where i ranges from 1 to 5.
Results Average performance of PET trained with the first i patterns is shown in Figure 6, relative to the performance of PET trained with all 23 patterns.For all tasks and training set sizes, as little as four different patterns are already sufficient to achieve performance very close to that of PET trained with all 23 patterns.Surprisingly, PET's performance is much higher than the average performance of a model trained on individual patterns even with i = 1.This indicates that the process of knowledge distillation using unlabeled data is also beneficial when using only a single instruction.
Q5: How important are values for other hyperparameters?
For true few-shot settings, the same set of hyperparameter values should ideally achieve good performance across different tasks; this enables us to adopt these values for new tasks without requiring manual tuning on task-specific validation sets.We therefore investigate how different choices for common hyperparameters affect the performance of PET; in concrete terms, we consider learning rate, training steps and batch size.

Results
All results are shown in Figure 7.For training steps and batch size, performance is relatively stable across a wide range of different values, with more steps and larger batch sizes typically leading to slightly better performance (especially for n = 100).Learning rate clearly has the strongest impact on performance, but values of 10 −5 and 5 • 10 −5 consistently give the best results across all tasks considered; those are also the values typically used for finetuning in prior work (Devlin et al., 2019;Liu et al., 2019).
Q6: Do we really need large amounts of unlabeled data?
One  Setup We use GPT2-XL (Radford et al., 2019) to generate synthetic unlabeled data.where denotes two consecutive line breaks.If an input consists of two texts, we simply concatenate them using the sequence +++ as a separator.
We generate 10,000 examples for n = 10 and 30,000 examples for n = 100 using top-p sampling (Holtzman et al., 2020) with p = 0.9.For each input, we stop the generation process as soon as the model generates two consecutive line breaks.
We discard all examples for which the model does not generate two consecutive line breaks within 128 tokens; for datasets with text pairs, we also discard examples where the model fails to generate the sequence separator (+++).
As the datasets obtained with this method may be highly imbalanced regarding the distribution of (unknown) labels, we also experiment with a balanced variant: We use the ensemble of models trained on individual PVPs to assign labels to each example and only keep so many examples per label that the resulting dataset -which is used for training the final classifier -is balanced.
Results Figure 8 shows the performance of individual patterns as well as PET and iPET both with real and synthetic unlabeled data.Except for iPET on Yahoo Questions with n = 10, using synthetic data consistently achieves accuracies within one point of using real data, with our balanced version performing slightly better.Moreover, for n = 10 using synthetic data even improves accuracy in some cases.This shows that in the absence of unlabeled examples, synthetic data obtained from generative language models can serve as a drop-in replacement without substantially degrading performance.

PET for Real-World Tasks
We use our insights from Section 4 to apply PET to the RAFT benchmark, a collection of 11 diverse real-world tasks whose automated solution has inherent value to someone (Alex et al., 2021).These tasks pose various challenges to few-shot approaches as they require some amount of domain expertise and the ability to understand detailed instructions, to process long inputs and to handle a large number of output classes.
To obtain the question q, we make minimal changes to the original instructions of Alex et al. (2021); we rephrase all binary classification tasks as yes/no questions to facilitate finding a suitable verbalizer.
For example, we rephrase the instruction "Label the sentence based on whether it is related to an adverse drug effect (ADE)." as the question "Is this sentence related to an adverse drug effect (ADE)?" Following our results from Q4, we specify 4 PVPs per task.In case of binary classification, we use two different patterns that either include or omit the full task specification of Alex et al. (2021) 6 and combine them with both a yes/no verbalizer and a true/false verbalizer.The full set of PVPs for all tasks can be found in Appendix B.
Hyperparameters We mostly keep the default values for hyperparameters used throughout Section 4 as our experiments for Q5 show that these perform well for all tasks considered.However, we make the following changes: • We replace RoBERTa (base) with ALBERT (xxlarge,v2).While being much slower to train, ALBERT was shown to outperform RoBERTa both in regular and few-shot settings (Lan et al., 2020;Schick and Schütze, 2021c;Logan IV et al., 2021).
• For tasks with more than 4,000 unlabeled examples, we finetune the distilled model for 2,000 steps instead of 1,000 steps.Otherwise, some examples would not be seen at all during training.7 • Following Schick and Schütze (2021a,c) we train three individual models per PVP.This improves robustness as performance can vary largely between individual finetuning runs.
Handling Many Labels The B77 dataset contains online banking customer service queries annotated with one of 77 possible intents.This large amount of different outputs leads to several issues for PET: First, it is impossible to specify a meaningful verbalizer that maps each intent to a single token.We initially experimented with the multimask version of PET (Schick and Schütze, 2021c), but found it to be too inefficient to get results in a reasonable amount of time.Therefore, we tried the following solution: We rephrase the task as binary classification, where for each pair of query x and intent y, the task is to decide whether y is the correct intent for x.For each original training example (x, y), we generate one example (x, y, True) and four examples (x, y , False) with randomly sampled, wrong intents y .As this increases the amount of data fivefold, we finetune each individual model for 500 steps instead of 100 steps.While this approach solves our problem, it is still not particularly efficient: Reframing the task as binary classification means that for each input, 77 forward passes are required to find the correct intent.We thus train the final model as a regular classifier with 77 different output classes; for training this classifier on an input x, we set the target probability of each output y proportional to the probability of True being the correct output for (x, y) according to our ensemble of binary classifiers.
Finally, another issue is that with 50 labeled examples, at least 27 labels are not covered in the training set at all; this may bias a model to never predict any of these labels.To alleviate this issue, we train two generations of models using iPET.For training the second generation, we obtain training data covering all possible labels as follows: For each label, we pick the two examples from our set of unlabeled data for which this label is most likely according to the first generation; the same approach was used by Schick and Schütze (2021a) for applying iPET in zero-shot settings.
Of course, the nature of RAFT makes it impossible to measure the impact of any of these choices.While we could conduct experiments similar to those in Section 4, none of the datasets considered therein has a similar structure to B77; as our modifications affect only one out of 11 tasks, we thus decided to not perform any further analysis.
Monitoring We checked for TRAIN SET UN-DERFITTING and CONSTANT PREDICTIONS as in Section 4 to detect finetuning issues.Unlike for our experiments in Section 4, on RAFT we encountered some issues that could not be resolved simply by retraining with a different seed: • We observed TRAIN SET UNDERFITTING for the final classifier on B77.This may be due to the classification head for 77 classes introducing many new parameters; we tried train-ing the final model for 5,000 steps instead of 2,000 steps, which fixed this issue.
• We observed CONSTANT PREDICTIONS for the ToS training set.Doubling the number of training steps resolved this problem.
• Finally, we also observed CONSTANT PRE-DICTIONS on the unlabeled data of SRI.Upon manually inspecting the training set, we observed that all but one out of 50 examples have the same label.As all models already classified the training set perfectly, we left the setup for our SRI submission unchanged.
Results For all 11 tasks, results of PET and various baselines are shown in Table 1. 8 As can be seen, PET performs better than all other approaches on average, achieving near-human performance for 7 out of 11 tasks.Note however that non-expert humans perform worse than a majority baseline SRI, so results on this task should be taken with a grain of salt.PET also clearly outperforms a GPT-3 model (Brown et al., 2020) by almost 7 points, despite the latter being larger by several orders of magnitude.9While PET is particularly successful on ADE, B77 and OSE (where it outperforms GPT-3 by 13.6, 21.5 and 29.4 points, respectively), it performs comparably bad on datasets in the law (Over, ToS) and social media (TEH, TC) domain.
Our approach for handling many labels performs surprisingly well on B77 without any tuning of its parameters.Due to the nature of the RAFT benchmark, we cannot perform further analysis or ablation studies.

Discussion
Our experimental results in Section 4 and 5 show that strong performance in few-shot settings is clearly possible without manual prompt tuning or hyperparameter optimization on large development sets; in other words, PET can successfully be applied in true few-shot settings.While we believe that it should be an important goal of future work to make LMs more robust to different instructions, even with current models it is relatively easy to successfully apply PET when following a few simple principles -such as rephrasing the task in a Q&A format, using simple vocabulary and singletoken verbalizers where possible, and specifying at least a handful of different patterns.In light of these findings, we also hope that future work will not view human involvement in prompt design as a drawback of instruction-based approaches, but rather as an exciting possibility to communicate with models in ways other than exclusively through examples.
There are various limitations to our study.First, a major obstacle to actually applying PET in realworld applications is that we do not know a priori how well it performs for a given task; we therefore believe an important next step is to investigate methods for estimating performance without access to large test sets -for example, through model calibration (Desai and Durrett, 2020;Jiang et al., 2021) -in real-world settings.In addition, we did not fully explore the capabilities of PET; for example, we did not investigate domain-adaptive pretraining (Gururangan et al., 2020) and auxiliary language modeling (Chronopoulou et al., 2019), both of which were shown to be helpful by Schick and Schütze (2021a).We also did not quantify the impact of our decisions regarding B77 and the effectiveness of our monitoring and only considered English models and datasets.Finally, we did not examine PET's performance beyond aggregate scores.While this is not feasible on RAFT due to the nature of this dataset, performing such analysis either with other datasets or with methods such as the ones proposed by Ribeiro et al. (2020) would be relevant future work to understand real-world capabilities of instruction-based approaches more comprehensively.

Conclusion
In light of recent work casting doubt on the performance of prompt-based approaches in true few-shot settings (Perez et al., 2021), we have conducted an extensive study of PET.In a controlled environment, we found that manually designed instructions consistently outperform null prompts, with Q&Astyle prompts performing best (Q1, Q2).Across different tasks, models and training set sizes, PET consistently outperforms even the best individual prompt (Q1, Q2).We have also shown that PET is robust to uninformative prompts and to different choices of hyperparameters (Q3, Q5), that as little as four prompts are sufficient to reach good performance (Q4), and that synthetic examples can be used to replace large amounts of unlabeled data (Q6).On the basis of these insights, we applied PET to a benchmark of real-world tasks, where it achieves near-human performance for 7 out of 11 tasks without any tuning on a development set, demonstrating the power of instruction-based approaches in true few-shot settings.

B PVPs for RAFT
Below, we list the PVPs used for all tasks in the RAFT benchmark.For each task t, we make use of the original task description D t that we copy verbatim from Alex et al. (2021). 10We use two vertical bars (||) to mark boundaries between text segments.If multiple patterns and verbalizers are specified, each verbalizer is used in combination with each pattern.

B.1 ADE
Task Description Label the sentence based on whether it is related to an adverse drug effect (ADE).Details are described below: Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated.Mentions of drugs or chemicals should strictly be in a therapeutic context.This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g.surgical equipment disinfectants).Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.• x Question: Is this sentence related to an adverse drug effect (ADE)?Answer:

B.2 B77
Task Description The following is a banking customer service query.Classify the query into one of the 77 categories available.

Inputs
• x: The text to be classified.
• y: The correct intent for the given text.The goal is to classify the institutions into one of three categories: "university", "company" or "research institute".

Inputs
• x 1 : The title of the paper to be classified.
• x 2 : The name of the organization to be classified.Included reviews should describe monetary charitable donations, assess any population of participants in any context, and be peer reviewed and written in English.

Patterns
They should not report new data, be nonsystematic reviews, consider cause-related marketing or other kinds of prosocial behaviour.

Inputs
• x 1 : The title of the paper to be classified.
• x 2 : The abstract of the paper to be classified.
• x 3 : The journal of the paper to be classified.

Patterns
• D t || Title: x 1 Abstract: x 2 Journal: x 3 Question: Should this paper be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations?Answer: [MASK].
• Title: x 1 Abstract: x 2 Journal: x 3 Question: Should this paper be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations?Answer: [MASK].

Verbalizers
• not included → No, included → Yes • not included → False, included → True B.8 TAI Task Description Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.Label a paper as "TAI safety research" if: 1.The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI, 2. There is substantive content on AI safety, not just AI capabilities, 3. The intended audience is the community of researchers, 4. It meets a subjective threshold of seriousness/quality, 5. Peer review is not required.

Inputs
• x 1 : The title of the paper to be classified.
• x 2 : The abstract of the paper to be classified.

Figure 2 :
Figure2: Different choices of patterns and corresponding verbalizers for classifying movie reviews as positive (+) or negative (−).The input is first converted into a cloze question using the pattern; classification is done by computing the output whose verbalization is the most likely substitute for the mask according to the MLM.
consider settings with n = 10 and n = 100 training examples.For each n, we generate five different training sets per task by randomly sampling examples from the original training sets while ensuring that the number of examples is about the same for each possible output class.In addition, for both n = 10 and n = 100, we sample 1,000 unlabeled examples from the original training sets.

Figure 3 :
Figure3: Performance of individual patterns, PET and iPET on all tasks considered.Accuracy is shown on the y-axis; the x-axis shows individual pattern ids where color is used to distinguish the different pattern categories (NULL, PUNC, PROMPT, Q&A).Small bullets ( ) correspond to individual training sets, large bullets ( ) correspond to average performance.Average performance across all patterns is shown as a dashed gray line.

Figure 4 :
Figure 4: Relative performance of individual pattern groups and PET for different models and sizes.Scores are normalized so that the best performance for each task, number of examples, model and size is 1.0.

Figure 5 :
Figure5: Performance of PET with three randomly selected patterns when adding noise PVPs; the x-axis shows the number of noise PVPs added.We also show performance of using only noise PVPs with PET (NP+P) and their average performance (NP).

Figure 6 :
Figure 6: Relative performance of PET with only a subset of patterns compared to that achieved using all 23 manually designed patterns.The x-axis shows the number of patterns used.

Figure 7 :
Figure 7: Performance of PET (solid lines) and average performance of individual models (dotted lines) for different learning rates (LR), training steps (Steps), and batch sizes.For reasons of readability, the legend is shown in the top left plot only.

• x :
The text to be classified.Patterns • D t || x Question: Is this sentence related to an adverse drug effect (ADE)?Answer: [MASK].
(Brown et al., 2020vide one or two randomly chosen training examples without labels as in-context examples(Brown et al., 2020) and let the model generate one additional example.For two inputs x 1 and x 2 , the input given to the model is . Each task comes with 50 labeled training examples; in accordance with the RAFT rules, we additionally make use of the unlabeled data (ranging from 150 to 5,000 examples) for PET's distil-

Table 1 :
(Alex et al., 2021)ous baselines and PET on the RAFT benchmark(Alex et al., 2021); shown numbers are macro F1 scores multiplied by 100.Best model performance is shown in bold, best overall performance (including human annotators) is underlined.The final column shows average performance across all 11 tasks.
Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations.
• D t || Organization name: x 1 Paper title:x 2 Question: What is the category of this institution?Answer: [MASK].•Organization name: x 1 Paper title: x 2 • D t || Title: x 1 Abstract: x 2 Question: Is this paper a TAI safety research paper?Answer: [MASK].• Title: x 1 Abstract: x 2 Question: Is this paper a TAI safety research paper?Answer: [MASK].ToS Task Description Label the sentence from a Terms of Service based on whether it is potentially unfair.If it seems clearly unfair, mark it as potentially unfair.According to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer.Verbalizers • not potentially unfair → No, potentially unfair → Yes • not potentially unfair → False, potentially unfair → True B.10 TEH Task Description Label whether the following tweet contains hate speech against either immigrants or women.Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics.The text to be classified.
Verbalizers • not TAI safety research → No, TAI safety research → Yes • not TAI safety research → False, TAI safety research → True B.9 Patterns • D t || x Question: Is this sentence potentially unfair?Answer: [MASK].• x Question: Is this sentence potentially unfair?Answer: [MASK].Inputs • x: The text to be classified.Patterns • D t || x Question: Does this tweet contain hate speech against either immigrants or women?Answer: [MASK].• x Question: Does this tweet contain hate speech against either immigrants or women?Answer: [MASK].Verbalizers • not hate speech → No, hate speech → Yes • not hate speech → False, hate speech → Verbalizers • no complaint → No, complaint → Yes • no complaint → False, complaint → True