Abstract
Prompt-based approaches excel at few-shot learning. However, Perez et al. (2021) recently cast doubt on their performance as they had difficulty getting good results in a “true” few-shot setting in which prompts and hyperparameters cannot be tuned on a dev set. In view of this, we conduct an extensive study of Pet, a method that combines textual instructions with example-based finetuning. We show that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set. Crucial for this strong performance is a number of design choices, including Pet’s ability to intelligently handle multiple prompts. We put our findings to a real-world test by running Pet on RAFT, a benchmark of tasks taken from realistic NLP applications for which no labeled dev or test sets are available. Pet achieves a new state of the art on RAFT and performs close to non-expert humans for 7 out of 11 tasks. These results demonstrate that prompt-based learners can successfully be applied in true few-shot settings and underpin our belief that learning from instructions will play an important role on the path towards human-like few-shot learning capabilities.
1 Introduction
With pretrained language models (LMs) getting ever larger (Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020; Fedus et al., 2021), instruction-based learning is a powerful method for few-shot text classification (e.g., Schick and Schütze, 2020; Jiang et al., 2020; Schick and Schütze, 2021; Brown et al., 2020; Wei et al., 2022; Sanh et al., 2022). The key idea is to give an LM access to descriptive names for all possible outputs and to short prompts explaining the task to be solved. In settings where at most a few dozen examples are available, this simple idea leads to substantial improvements over other approaches (Schick and Schütze, 2020, 2021; Gao et al., 2021; Tam et al., 2021).
However, recent work has questioned the strong few-shot performance of instruction-based approaches, arguing that they are evaluated in scenarios that are not true few-shot settings (Perez et al., 2021; Logan IV et al., 2021), mainly for two reasons. First, some approaches (e.g., Xie et al., 2019; Zhang et al., 2020; Chen et al., 2020; Tam et al., 2021) make use of large development sets to optimize hyperparameters. Second, it is argued that manually designed instructions require manual tuning on development sets to achieve strong performance (Perez et al., 2021; Logan IV et al., 2021). Indeed, performance can vary greatly—and in mostly unpredictable ways—across different instructions (Jiang et al., 2020; Schick and Schütze, 2020); this issue even persists after finetuning on hundreds of instructions (Sanh et al., 2022). More generally, the need for human involvement is seen as a serious drawback of manually designed instructions (Shin et al., 2020; Lester et al., 2021). Thus, several recent studies have abandoned them in favor of automatically generated prompts (Shin et al., 2020; Gao et al., 2021; Hambardzumyan et al., 2021; Li and Liang, 2021; Lester et al., 2021).
Contrary to this trend, we argue that when correctly configured, prompt-based approaches achieve strong performance even in true few-shot settings and that there is no problem with using manually designed instructions. Quite the opposite: Such instructions are often easy to specify if one is familiar with the task to be solved, they provide an intuitive interface to convey task- specific knowledge, and, if properly used, they can considerably improve model performance in few- shot settings.
To provide empirical support for these claims, we revisit Pet (Schick and Schütze, 2020), a method for combining instructions with example- based finetuning, and thoroughly examine its performance with human-made instructions in true few-shot settings. We simulate a real-world scenario by proceeding in two steps: First, we conduct an extensive study of Pet using three academic datasets to analyze its ability to perform true few-shot learning in a controlled environment and derive best practices for the choice of instructions and hyperparameters. We then put our findings to the test and evaluate Pet on a large variety of real-world tasks from the RAFT benchmark (Alex et al., 2021), for which no labeled dev or test sets are available, enforcing a true few-shot setting (Perez et al., 2021). On average, Pet clearly outperforms all baselines on this dataset and comes surprisingly close to non-expert human performance (see Figure 1). This demonstrates that instruction-based learning can successfully be applied to real-world tasks in true few-shot settings.
Our main contributions are as follows:
We investigate the performance of Pet for various models, tasks, and training set sizes, its ability to cope with different instructions, and its robustness to hyperparameter choices in true few-shot settings.
We show how Pet can be used when no unlabeled data is available and propose a method for efficient classification in scenarios with many different classes, addressing two frequent real-world scenarios.
We apply Pet to RAFT (Alex et al., 2021), a benchmark of real-world tasks. Pet obtains a new state of the art and achieves near-human performance for 7 out of 11 tasks in a true few-shot setting.
2 Related Work
As a precursor to instruction-based learning, some studies have investigated ways of informing classifiers about the meaning of different output classes both for text (Chang et al., 2008; Veeranna et al., 2016; Zhou et al., 2018) and image classification (Norouzi et al., 2014; Romera-Paredes and Torr, 2015); providing instructions in the form of short prompts was first proposed by Radford et al. (2019). This idea has since been applied to solve a wide range of NLP tasks without any task-specific training data (Puri and Catanzaro, 2019; Opitz, 2019; Davison et al., 2019; Schick et al., 2021; Schick and Schütze, 2021; Wei et al., 2022; Sanh et al., 2022). While most approaches rephrase tasks as a language modeling problem, some use prompts to reformulate them as different tasks for which large amounts of training data are available (Levy et al., 2017; McCann et al., 2018; Yin et al., 2019; Sun et al., 2021; Sainz et al., 2021). Instruction-based learning has also been used in few-shot settings; popular variants include in-context learning, where the model’s parameters are fixed and examples are provided as additional context (Brown et al., 2020; Lu et al., 2021; Kumar and Talukdar, 2021; Zhao et al., 2021; Min et al., 2021), finetuning the entire model (Schick and Schütze, 2020, 2021; Gao et al., 2021; Tam et al., 2021), and prompt tuning, where only the instruction itself is optimized (Shin et al., 2020; Hambardzumyan et al., 2021; Li and Liang, 2021; Lester et al., 2021).
Several works investigating the limitations of instruction-based few-shot approaches find that current LMs are mostly unable to understand complex instructions that go beyond short prompts or simple questions (Efrat and Levy, 2020; Weller et al., 2020; Webson and Pavlick, 2021) and that they are highly sensitive to the exact wording of the instructions provided (Jiang et al., 2020; Schick and Schütze, 2020; Chu et al., 2021; Elazar et al., 2021). In a similar vein, Perez et al. (2021) and Logan IV et al. (2021) argue that prior work overestimates few-shot performance as manual prompt tuning is required to achieve good performance. Accordingly, some studies attempt to obtain either prompts (Shin et al., 2020; Gao et al., 2021; Li and Liang, 2021; Lester et al., 2021) or meaningful names for output classes (Schick et al., 2020; Gao et al., 2021) without human involvement.
Finally, many benchmarks have been proposed for comparing few-shot approaches in a standardized way (e.g., Mishra et al., 2021; Bragg et al., 2021; Xu et al., 2021; Ye et al., 2021; Alex et al., 2021). As our focus is on the real-world applicability of few-shot methods, we evaluate Pet on the RAFT benchmark (Alex et al., 2021), which measures performance in applied settings.
3 Pattern-Exploiting Training
We briefly review pattern-exploiting training (Pet) (Schick and Schütze, 2020, 2021), the method we use for instruction-based text classification. At its core, Pet combines textual instructions with regular finetuning using labeled examples. To that end, users must specify one or more patterns that convert an input example x into a cloze question so that it can readily be processed by a masked language model (MLM) (Devlin et al., 2019).1 These patterns can take on very different forms; some examples are shown in Figure 2. In addition, users must inform the model about the meaning of all output classes; this is done with a verbalizer that assigns a natural language expression to each output y (see Figure 2, right). We refer to the combination of a pattern and verbalizer as a pattern-verbalizer pair (PVP).
Given a single PVP, let p(y∣x) be the probability that an MLM assigns to y’s verbalization in the cloze question obtained by applying the pattern to x, normalized over all y. The MLM is finetuned on labeled examples (x,y) by minimizing the cross-entropy loss between p(y∣x) and a distribution that assigns a probability of 1.0 to y.
If a user specifies multiple PVPs, individual models are trained for each pair. Similar to knowledge distillation (Hinton et al., 2015), they are then used to annotate unlabeled examples for training a final classifier with a regular sequence classification head (Devlin et al., 2019). We use the weighted variant of Pet without auxiliary language modeling; see Schick and Schütze (2020) for details.
4 True Few-Shot Learning with Pet
After describing our experimental setup, we conduct experiments on academic datasets to answer 6 important research questions (Q1–Q6) on the extent to which true few-shot learning is possible with Pet. The purpose of our experiments is also to establish best practices for real-world scenarios and experiments on RAFT (Alex et al., 2021).
Tasks and Datasets
While they were heavily used in prior work (e.g., Brown et al., 2020; Schick and Schütze, 2021; Logan IV et al., 2021; Webson and Pavlick, 2021), we decide against tasks and datasets from GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) as they are different from what we expect to see in real-world applications. Instead, we experiment with AG’s News, Yelp Reviews Full Star, and Yahoo Questions (Zhang et al., 2015) as these datasets represent classification tasks in three different domains that resemble real-world settings. We create a broad variety of instructions for each task to be able to experiment with a large number of different patterns.
We consider settings with n = 10 and n = 100 training examples. For each n, we generate five different training sets per task by randomly sampling examples from the original training set while ensuring that the number of examples is about the same for each possible output class. In addition, for both n = 10 and n = 100, we sample 1,000 unlabeled examples from the original training set. We repeat all of our experiments for all five training sets and, by default, report average performance.
PVPs
We manually write a total of 23 patterns per task, all of which can be categorized into one of the following groups:2
Null: Following Logan IV et al. (2021), these patterns simply insert a mask token.
Punc: Similar to some patterns of Schick and Schütze (2020), these patterns only add punctuation characters and a mask token.
Prompts: Patterns in this group add short prompts—typically consisting of no more than three words—to the input, similar to Radford et al. (2019) and Schick and Schütze (2020).
For all patterns, we use a single verbalizer, adopted from Schick and Schütze (2020). There is often a single natural choice for the verbalizer (e.g., the category names for AG’s News / Yahoo Questions), so finding many good verbalizers is challenging.
Hyperparameters
We consider a setting similar to that of Schick and Schütze (2020) and Schick and Schütze (2021) and, unless otherwise specified, use the default settings of the Pet library.3 As our experiments require training hundreds of models, we make a few changes to reduce environmental impact (Strubell et al., 2019) and computational cost: We use the base variant of RoBERTa (Liu et al., 2019) as underlying LM both for individual models and the final classifier, we train only one model per PVP, and we reduce the training steps for all individual models and the final classifier to 100 and 1,000, respectively.
Monitoring
Finetuning LMs on small datasets is unstable (Devlin et al., 2019; Dodge et al., 2020) and sometimes results in poor performance. We aim to detect failed finetuning without a labeled test set using the following two checks:
Train Set Underfitting: We check for training runs that result in less than 50% accuracy on the training set. As finetuning on up to 100 examples typically leads to perfect predictions on the training set, this is a clear indicator of failed finetuning.
Constant Predictions: We check for training runs that result in the same class being predicted for all inputs, both on the training set and the unlabeled set. Again, this is a clear indicator of failed finetuning.
Whenever one of these two events occurs, we restart training using a different seed.
Q1: How can we find the best pattern—or do we even need to?
Slightly different patterns can have very different performance (Jiang et al., 2020; Schick and Schütze,2020; Schick et al., 2021; Webson and Pavlick, 2021; Sanh et al., 2022, inter alia) and popular model selection criteria cannot reliably identify the best-performing patterns in few-shot settings (Perez et al., 2021). We thus investigate to what extent Pet can eliminate the need to find the best instruction even in extreme settings where there are dozens of candidates to choose from.
Setup
Using our default setup, we train individual models for each PVP and a final Pet model; we also train models with iPet, an iterative variant of Pet introduced by Schick and Schütze (2020), using 3 iterations.
Results
Performance of individual models for each pattern and of the distilled models obtained using Pet and iPet is shown in Figure 3. Interestingly, sorting all pattern groups by their average performance gives the exact same order for each task and training set size: Null patterns clearly perform worst, followed by Punc and Prompt; Q&A gives the best average results. Contrary to findings of Logan IV et al. (2021), this shows that LMs can benefit considerably from manually written instructions even if combined with finetuning.
Crucially, Pet’s performance is much higher than average performance of individual patterns; further, it consistently outperforms even the best pattern, verifying that Pet indeed removes the need to find the best pattern. While iPet gives clear improvements for n = 10, it performs worse than Pet for n = 100. The reason for this may be that we use a much smaller set of unlabeled examples than prior work (Schick and Schütze, 2020, 2021).
Q2: Does performance of different patterns transfer across models?
While our results for Q1 show a consistent order of pattern groups for different training set sizes and tasks, an important question for real-world applications is whether the same finding also holds for different model sizes and entirely different models.
Setup
Results
Figure 4 shows the performance of each pattern group (i.e., average performance of all individual patterns in this group) and Pet; scores are normalized so that the best-performing approach for each task, training set size, and model gets a score of 1.0. With few exceptions, our findings from Q1 regarding the relative performance of pattern groups and Pet (Null ¡ Punc ¡ Prompt ¡ Q&A ¡ Pet) also hold for different models and sizes. The performance of individual patterns strongly correlates between different models and sizes (Spearman’s ρ ≥ 0.7 except in one case).
Q3: Does Pet still work if some PVPs are not well understood?
Q1 and Q2 show that Pet performs even better than the best PVP for a large set of high-quality PVPs. But perhaps the performance is much worse if the LM fails to understand many patterns and verbalizers, for example, because they are in a style different from the model’s pretraining data? For real-world scenarios, we want to know how such “bad” PVPs affect the performance of Pet.
Setup
It is difficult to obtain large quantities of bad instructions as they might occur in real-world scenarios. As a proxy, we resort to noise patterns that add random tokens to the input, serving as a lower bound for truly bad patterns. In concrete terms, we add up to three randomly sampled tokens before and after the input.5 We also create noise verbalizers by assigning uniformly selected tokens to each output class. Using this process, we obtain 20 different intentionally bad PVPs per task. For each task, we start with 3 randomly selected, high-quality patterns from our original set of manually designed instructions, add noise PVPs one by one, and investigate the effect on performance.
Results
Q4: How many patterns are required for good performance?
Orthogonal to Q3, what is the minimum number of high-quality prompts required for good performance? This is important because we want to minimize the amount of time spent creating PVPs in a practical setting.
Setup
We generate, per task, 10 random permutations of the 23 patterns. For each permutation and training set, we use the same setup as in Q1 to compute the average performance obtained with Pet when using only the first i, 1 ≤ i ≤ 5, patterns.
Results
Average performance of Pet trained with the first i patterns is shown in Figure 6, relative to the performance of Pet trained with all 23 patterns. For all tasks and training set sizes, as little as four patterns are already sufficient to achieve performance close to that of Pet trained with all 23 patterns. Surprisingly, Pet’s performance is much higher than the average performance of a model trained on individual patterns even with i = 1. This indicates that the process of knowledge distillation using unlabeled data is also beneficial when using only a single instruction.
Q5: Are other hyperparameters important?
For true few-shot settings, we want the same set of hyperparameter values to perform well across different tasks; this enables us to adopt these values for new tasks without tuning on task-specific validation sets. We investigate how the hyperparameters, learning rate, training steps, and batch size affect performance.
Setup
Based on previous work, we consider learning rates from 10−4 to 10−6, training steps from 10 to 1,000, and batch sizes from 1 to 32. Learning rate and batch size are changed for the individual models and the final classifier simultaneously; the number of training steps is varied only for individual models. We modify each hyperparameter independently, keeping all other parameters at their default value (i.e., a learning rate of 10−5, 100 steps and a batch size of 4).
Results
Results are shown in Figure 7. For training steps and batch size, performance is relatively stable across a wide range of different values, with more steps and larger batch sizes typically leading to slightly better performance (especially for n = 100). Learning rate clearly has the strongest impact on performance, but values of 10−5 and 5 ⋅ 10−5 consistently give the best results across tasks; these are also the values typically used for finetuning in prior work (Devlin et al., 2019; Liu et al., 2019).
Q6: Do we really need unlabeled data?
In contrast to individual PVPs, Pet needs unlabeled data, which is not available in some real-world settings. Building on earlier work (Anaby-Tavor et al., 2020; Papanikolaou and Pierleoni, 2020; Yang et al., 2020; Mohapatra et al., 2020; Kumar et al., 2020; Schick and Schütze, 2021), we investigate whether synthetic examples can replace unlabeled data.
Setup
We generate 10,000 examples for n = 10 and 30,000 examples for n = 100 using top-p sampling (Holtzman et al., 2020) with p = 0.9. For each input, we stop the generation process as soon as the model generates two consecutive line breaks. We discard all examples for which the model does not generate two consecutive line breaks within 128 tokens; for datasets with text pairs, we also discard examples where the model fails to generate the sequence separator (+++).
As the datasets obtained with this method may be highly imbalanced regarding the distribution of (unknown) labels, we also experiment with a balanced variant: We use the ensemble of models trained on individual PVPs to assign labels to each example and only keep so many examples per label that the resulting dataset—which is used for training the final classifier—is balanced.
Results
Figure 8 shows the performance of individual patterns as well as Pet and iPet with real and synthetic unlabeled data. Except for iPet on Yahoo Questions with n = 10, the accuracy of synthetic data is within one point of real data, with our balanced version performing slightly better. For n = 10, using synthetic data even improves accuracy in some cases. This shows that in the absence of unlabeled examples, synthetic data obtained from generative language models can serve as a drop-in replacement without substantially degrading performance.
5 Pet for Real-World Tasks
We use our insights from §4 to apply Pet to the RAFT benchmark, a collection of 11 diverse real-world tasks whose automated solution has inherent value to someone (Alex et al., 2021). These tasks are challenging for few-shot approaches: they require domain expertise, understanding of detailed instructions, processing of long inputs, and handling a large number of output classes.
Tasks and Datasets
The RAFT benchmark includes 11 tasks from different domains: ADE, B77, NIS, OSE, Over, SOT, SRT, TAI, ToS, TEH, and TC; for a detailed overview see Alex et al. (2021). Each task comes with 50 labeled training examples; in accordance with the RAFT rules, we additionally make use of the unlabeled data (ranging from 150 to 5,000 examples) for Pet’s distillation step. In the case of RAFT, the unlabeled set is the same as the test set. So unlike in §4, our final classifier is directly trained on (unlabeled) test examples.
PVPs
Based on Q1 and Q2, we only employ Q&A prompts. To obtain the question q, we make minimal changes to the original instructions of Alex et al. (2021); we rephrase all binary classification tasks as yes/no questions. For example, we rephrase the instruction “Label the sentence based on whether it is related to an adverse drug effect (ADE).” as “Is this sentence related to an adverse drug effect (ADE)?” Following our results from Q4, we specify 4 PVPs per task. For binary classification, we use two different patterns that either include or omit the full task specification of Alex et al. (2021) and combine them with both a yes/no verbalizer and a true/false verbalizer.6
Hyperparameters
Based on the results of Q5, we mostly keep hyperparameter defaults from §4. However, we make the following changes:
We replace RoBERTa (base) with ALBERT (xxlarge, v2). While being much slower to train, ALBERT was shown to outperform RoBERTa both in regular and few-shot settings (Lan et al., 2020; Schick and Schütze, 2021; Logan IV et al., 2021).
Since 1,000 steps cover only 4,000 examples at batch size 4, we finetune the distilled model for 2,000 steps for tasks with more than 4,000 unlabeled examples. This makes sure all examples are seen at least once.
Following Schick and Schütze (2020) and Schick and Schütze (2021) we train three individual models per PVP. This improves robustness as performance can vary greatly between individual finetuning runs.
Handling Many Labels
The B77 dataset consists of banking customer service queries, each annotated with one of 77 possible intents. That large a number of outputs leads to several issues for Pet: First, it is impossible to specify a meaningful verbalizer that maps each intent to a single token. We initially experimented with Pet’s multi- mask version (Schick and Schütze, 2021), but it was too inefficient for experimentation. We instead proceed as follows. We rephrase the task as binary classification, where for each pair of query x and intent y, the task is to decide whether y is the correct intent for x. For each original training example (x,y), we generate one example (x,y,True) and four examples (x,y′,False) with randomly sampled, incorrect intents y′. As this increases the amount of data fivefold, we finetune each individual model for 500 instead of 100 steps.
This approach still is not particularly efficient: Reframing the task as binary classification means that for each input, 77 forward passes are required to find the correct intent. We thus train the final model as a regular classifier with 77 different output classes; for training this classifier on input x, we set the target probability of output y proportional to the probability of True being the correct output for (x,y) according to our ensemble of binary classifiers.
Finally, another issue is that with 50 labeled examples, at least 27 labels are not covered in the training set; this may bias a model to never predict these labels. To alleviate this issue, we train two generations of models using iPet. For training the second generation, we obtain training data covering all possible labels similar to Schick and Schütze (2020): For each label, we pick the two examples from the unlabeled data for which this label is most likely according to the first generation.
The nature of RAFT makes it hard to measure the impact of any of these choices. While we could conduct experiments similar to those in §4, none of the datasets considered there has a structure similar to B77; as our modifications affect only one of 11 tasks, we leave further analysis for future work.
Monitoring
We checked for Train Set Underfitting and Constant Predictions (§4) to detect finetuning issues. Unlike in §4, on RAFT we encountered some issues that could not be resolved simply by retraining with a different seed:
We observed Train Set Underfitting for the final classifier on B77. This may be due to the classification head for 77 classes introducing many new parameters; we train the final model for 5,000 instead of 2,000 steps, which fixed this issue.
We observed Constant Predictions for the ToS training set. Doubling the number of training steps resolved this problem.
Finally, we also observed Constant Predictions on the unlabeled data of SRI. Upon manually inspecting the training set, we observed that all but one out of 50 examples have the same label. As all models already classified the training set perfectly, we left the setup for our SRI submission unchanged.
Results
For all 11 tasks, Table 1 shows results of Pet and baselines.7 As can be seen, Pet performs better than all other approaches on average, achieving near-human performance for 7 out of 11 tasks. Note, however, that non-expert humans perform worse than the majority baseline on SRI, so results on this task should be taken with a grain of salt. Pet also clearly outperforms a GPT-3 model (Brown et al., 2020) by almost 7 points, despite the latter being larger by several orders of magnitude. While Pet is particularly successful on ADE, B77, and OSE (where it outperforms GPT-3 by 13.6, 21.5, and 29.4 points, respectively), it performs comparably poorly on datasets in the law (Over, ToS) and social media (TEH, TC) domains. Our approach for handling many labels performs surprisingly well on B77 without any tuning of its parameters. Due to the nature of RAFT, we cannot perform further analysis or ablation studies.
Method . | ADE . | B77 . | NIS . | OSE . | Over . | SOT . | SRI . | TAI . | ToS . | TEH . | TC . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-2 | 60.0 | 12.1 | 56.1 | 24.5 | 49.8 | 38.0 | 49.2 | 61.2 | 49.8 | 31.1 | 72.3 | 45.8 |
GPT-Neo | 45.2 | 14.9 | 40.8 | 34.3 | 68.1 | 40.6 | 49.3 | 60.5 | 56.5 | 55.4 | 63.6 | 48.1 |
AdaBoost | 54.3 | 02.3 | 62.6 | 47.5 | 83.8 | 45.5 | 50.6 | 55.6 | 56.0 | 44.3 | 62.5 | 51.4 |
snlt | 60.3 | 24.8 | 58.5 | 30.2 | 83.1 | 33.6 | 49.2 | 62.6 | 54.0 | 44.9 | 79.1 | 52.8 |
GPT-3 | 68.6 | 29.9 | 67.9 | 43.1 | 93.7 | 76.9 | 51.6 | 65.6 | 57.4 | 52.6 | 82.1 | 62.7 |
SetFit | 72.6 | 53.8 | 87.2 | 52.1 | 90.7 | 68.2 | 49.3 | 62.8 | 62.0 | 53.2 | 83.7 | 66.9 |
Pet | 82.2 | 59.3 | 85.7 | 64.6 | 90.8 | 81.6 | 49.3 | 63.8 | 57.6 | 48.3 | 82.4 | 69.6 |
Human | 83.0 | 60.7 | 85.7 | 64.6 | 91.7 | 90.8 | 46.8 | 60.9 | 62.7 | 72.2 | 89.7 | 73.5 |
Method . | ADE . | B77 . | NIS . | OSE . | Over . | SOT . | SRI . | TAI . | ToS . | TEH . | TC . | Avg . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-2 | 60.0 | 12.1 | 56.1 | 24.5 | 49.8 | 38.0 | 49.2 | 61.2 | 49.8 | 31.1 | 72.3 | 45.8 |
GPT-Neo | 45.2 | 14.9 | 40.8 | 34.3 | 68.1 | 40.6 | 49.3 | 60.5 | 56.5 | 55.4 | 63.6 | 48.1 |
AdaBoost | 54.3 | 02.3 | 62.6 | 47.5 | 83.8 | 45.5 | 50.6 | 55.6 | 56.0 | 44.3 | 62.5 | 51.4 |
snlt | 60.3 | 24.8 | 58.5 | 30.2 | 83.1 | 33.6 | 49.2 | 62.6 | 54.0 | 44.9 | 79.1 | 52.8 |
GPT-3 | 68.6 | 29.9 | 67.9 | 43.1 | 93.7 | 76.9 | 51.6 | 65.6 | 57.4 | 52.6 | 82.1 | 62.7 |
SetFit | 72.6 | 53.8 | 87.2 | 52.1 | 90.7 | 68.2 | 49.3 | 62.8 | 62.0 | 53.2 | 83.7 | 66.9 |
Pet | 82.2 | 59.3 | 85.7 | 64.6 | 90.8 | 81.6 | 49.3 | 63.8 | 57.6 | 48.3 | 82.4 | 69.6 |
Human | 83.0 | 60.7 | 85.7 | 64.6 | 91.7 | 90.8 | 46.8 | 60.9 | 62.7 | 72.2 | 89.7 | 73.5 |
6 Discussion
Our experimental results in §4 and §5 show that strong performance in few-shot settings is clearly possible without manual prompt tuning or hyperparameter optimization on large dev sets; in other words, Pet can successfully be applied in true few-shot settings. While we believe that it should be an important goal of future work to make LMs more robust to different instructions, even with current models it is relatively easy to successfully apply Pet when following a few simple principles—such as rephrasing the task in a Q&A format, using simple vocabulary and single-token verbalizers where possible, and specifying at least a handful of different patterns. In light of these findings, we also hope that future work will not view human involvement in prompt design as a drawback of instruction-based approaches, but rather as an exciting possibility to communicate with models in ways other than exclusively through examples.
Our study has limitations. First, a major obstacle to using Pet in real-world applications is that we do not know a priori how well it performs for a given task; we therefore believe an important next step is to investigate methods for estimating performance without access to large test sets—for example, through model calibration (Desai and Durrett, 2020; Jiang et al., 2021; Zhao et al., 2021)—in real-world settings. In addition, we did not fully explore the capabilities of Pet; for example, we did not investigate domain-adaptive pretraining (Gururangan et al., 2020) and auxiliary language modeling (Chronopoulou et al., 2019), both of which were shown to be helpful by Schick and Schütze, (2020). We also did not quantify the impact of our decisions regarding B77 and the effectiveness of monitoring (§4) and only considered English models and datasets. Finally, we did not examine Pet’s performance beyond aggregate scores. While this is not feasible on RAFT due to its nature, performing such analysis either with other datasets or with methods such as the ones proposed by Ribeiro et al. (2020) would be relevant future work to understand real-world capabilities of instruction-based approaches more comprehensively.
7 Conclusion
In light of recent work casting doubt on the performance of prompt-based approaches in true few-shot settings (Perez et al., 2021), we have conducted an extensive study of Pet. In a controlled environment, we found that manually designed instructions outperform null prompts, with Q&A- style prompts performing best (Q1, Q2). Across different tasks, models and training set sizes, Pet consistently outperforms even the best individual prompt (Q1, Q2). We have also shown that Pet is robust to uninformative prompts and to different choices of hyperparameters (Q3, Q5), that as little as four prompts are sufficient to reach good performance (Q4), and that synthetic examples can be used to replace unlabeled data (Q6). Based on these insights, we applied Pet to a benchmark of real-world tasks, where it achieves near-human performance for 7 out of 11 tasks without any tuning on a dev set, demonstrating the potential of instruction-based approaches in true few-shot settings.
Acknowledgments
This work was funded by the European Research Council (ERC #740516). We thank the anonymous reviewers and the action editor for their helpful comments.
Notes
We use the term prompt to refer to a short sequence of tokens that typically contains some form of instruction; pattern is used to denote the function that adds a prompt to an input.
The full set of PVPs can be found at https://github.com/timoschick/pet/tree/master/true-fsl.
For Yahoo, we do not consider BERT as it uses a vocabulary that does not assign a single token to each verbalization.
If there are multiple input texts, we shuffle their order and additionally add 0–3 tokens in between them.
For a full list of all task specifications, see https://github.com/oughtinc/raft-baselines. The full set of PVPs can be found at https://github.com/timoschick/pet/tree/master/true-fsl.
All results are taken directly from the leaderboard at https://huggingface.co/spaces/ought/raft-leaderboard.
References
Author notes
Action Editor: Alexander Rush