Abstract
Task semantics can be expressed by a set of input-output examples or a piece of textual instruction. Conventional machine learning approaches for natural language processing (NLP) mainly rely on the availability of large-scale sets of task-specific examples. Two issues arise: First, collecting task-specific labeled examples does not apply to scenarios where tasks may be too complicated or costly to annotate, or the system is required to handle a new task immediately; second, this is not user-friendly since end-users are probably more willing to provide task description rather than a set of examples before using the system. Therefore, the community is paying increasing interest in a new supervision-seeking paradigm for NLP: learning to follow task instructions, that is, instruction following. Despite its impressive progress, there are some unsolved research equations that the community struggles with. This survey tries to summarize and provide insights into the current research on instruction following, particularly, by answering the following questions: (i) What is task instruction, and what instruction types exist? (ii) How should we model instructions? (iii) What are popular instruction following datasets and evaluation metrics? (iv) What factors influence and explain the instructions’ performance? (v) What challenges remain in instruction following? To our knowledge, this is the first comprehensive survey about instruction following.1
1 Introduction
One goal of AI is to build a system that can universally understand and solve new tasks. Labeled examples (Figure 1a), as the mainstream task representation, are costly to obtain at scale or even do not exist in some cases. Therefore, is there any other task representation that can contribute to task comprehension? Textual instructions provide another dimension of supervision for expressing the task semantics, which often contains more abstract and comprehensive knowledge of the target task than individual labeled examples. As shown in Figure 1b, with the availability of task instructions, systems can be quickly built to handle new tasks. Such efficiency is highly desirable in real-world applications, especially when task-specific annotations are scarce. More importantly, instruction following leans toward human intelligence in terms of learning new tasks—a child can easily solve a new mathematical task by learning from its instruction and a few examples (Fennema et al. 1996; Carpenter, Fennema, and Franke 1996). As a result, this new learning paradigm has recently attracted the attention of the machine learning and NLP communities (Wang et al. 2022b; Longpre et al. 2023).
When talking about “instruction,” most of us will first think of “prompt”—using a brief template to convert a task input into a new format (e.g., cloze question) that caters to the language modeling objective of large language models (LLMs) (Brown et al. 2020). Despite the prevalence of prompts in text classification, machine translation, and so forth, we argue that prompts are merely a special case of instructions. This article takes a comprehensive and broader view of instruction-driven NLP research. Particularly, we try to answer the following questions: (i) What is task instruction, and what instruction types exist? (§ 4); (ii) Given a task instruction, how should we encode it to assist the model generalization on the target task? (§ 5). (iii) What are popular instruction following datasets and the mainstream evaluation metrics? (§ 6). (iv) What factors (e.g., model size, task numbers) impact the instruction-driven systems’ performance? (§ 7). (v) What challenges exist in instruction following, and what are future directions? (§ 8).
To our knowledge, this is the first article that surveys instruction following. In contrast to some existing surveys that focused on a specific in-context instruction, such as prompts (Liu et al. 2023a), input-by-output demonstrations (Dong et al. 2023), or reasoning (Huang and Chang 2023; Qiao et al. 2023; Yu, Zhang, and Wang 2023), this work provides a more comprehensive overview of the instruction following. Our contributions are 3-fold:
Going beyond prompts, we analyze prompt constraints via a user-centric lens, with a focus on discerning the disparity between current instruction following research and real-world needs.
We interpret different task instructions from the unified perspective of indirect supervision, and summarize their advantages, limitations, and scope of applications;
We regard current ever-growing LLMs and instruction datasets as an effort of dual-track scaling; additionally, we point out current notable research issues and promising directions in the future.
2 Related Work
2.1 Instruction Following
As illustrated in Figure 1, unlike traditional example-driven supervised learning, the essence of instruction following is to train the LLMs to understand various instructions and produce the corresponding responses. Because this capacity can be extended to any unseen downstream tasks, instruction following has become an efficient learning paradigm for solving few/zero-shot tasks (Radford et al. 2019; Schick and Schütze 2021c; Yin, Li, and Xiong 2022; Li et al. 2023a; Gupta et al. 2023; Sun et al. 2024; Xie et al. 2024b, inter alia). However, the performance of instruction following highly relies on both model and task scale: A larger LLM (or pretraining with more tokens) tuned on more diverse tasks can achieve significantly better few/zero-shot performances on the downstream tasks (Chung et al. 2022; Iyer et al. 2022; Wang et al. 2023b, inter alia). As scaling model size is prohibitively costly for most research institutes, numerous recent studies worked on collecting high-quality instruction-tuning datasets, either using human workers (Khashabi et al. 2020; Ye, Lin, and Ren 2021; Sanh et al. 2022; Wang et al. 2022b; Longpre et al. 2023; Köpf et al. 2023) or distilling supervision from the powerful LLMs (Wang et al. 2023c; Honovich et al. 2023a; Taori et al. 2023; Peng et al. 2023; Xu et al. 2023a, b; Köksal et al. 2023; Kim et al. 2023; Ding et al. 2023; Yin et al. 2023a; Lou et al. 2024), for example, utilizing ChatGPT or GPT-4 to develop creative task instructions (OpenAI 2022 2023).
Despite its popularity, current instruction following still suffers challenges and has considerable room for evolution. This work not only surveys the extensive existing literature on instruction following but also goes beyond: We trace the development of instruction following back to the early days of semantic parsing based machine learning, and formulate our story from an indirect supervision perspective. We hope this survey can systematically introduce this popular yet challenging area.
2.2 Surveys on In-context Instructions
Several existing surveys (Dong et al. 2023; Huang and Chang 2023; Qiao et al. 2023; Yu, Zhang, and Wang 2023) share similar motivations with this work while focusing on merely some sub-area of instruction following, such as prompts, few-shot demonstrations, chain-of-thoughts reasoning, and so forth. For example, Liu et al. (2023a) provided a comprehensive overview of prompt learning and LLMs, where the prompt can be regarded as one specific type of textual instruction (as categorized in § 4). Some other studies surveying “soft instruction,” namely, parameter-efficient fine-tuning methods (Lialin, Deshpande, and Rumshisky 2023), also differ from our scope of “textual instruction.” Notably, Zhang et al. (2023) also proposed a survey on instruction tuning, however, they mostly focused on existing datasets and models, whereas we present a more complete and consistent story of the instruction following, including instruction categories, modeling strategies, and so on, which have never been introduced by previous works. To the best of our knowledge, this is the first work that provides a comprehensive and high-level story of instruction following.
3 Preliminary
For instruction following, we target driving the systems to reach the corresponding output of the input by following the instruction. Thus, we assume that a dataset usually consists of three items:
Input (X): the input of an instance; it can be a single piece of text (e.g., sentiment classification) or a group of text pieces (textual entailment, question answering, etc.).
Output (Y): the output of an instance; in classification problems, it can be one or multiple predefined labels; in text generation tasks, it can be any open-form text.
Template (T): a textual template that either tries to express task intent or is used for bridging X and Y.2T may not be an instruction yet.
In § 4, we will elaborate that a task instruction I is actually a combination of T with X or Y, or the T on its own in some cases.
4 What Is Task Instruction?—A Unified Perspective from Indirect Supervision
This section first summarizes three main instruction types constructed by different combinations of T, X, and Y (as illustrated in Figure 2), then presents our interpretation of them via an indirect supervision perspective.
4.1 Three Types of Instructions
4.1.1 NLI-oriented Instructions (i.e., I = T + Y)
A conventional scheme to handle the classification tasks is to convert the target labels into indices and let models decide which indices the inputs belong to. This paradigm only encodes the input semantics while losing the label semantics. To let systems recognize new labels without relying on massive labeled examples, Yin, Hay, and Roth (2019) proposed converting the target classification tasks into natural language inference (NLI) (Bowman et al. 2015) by building a hypothesis for each label—deriving the truth value of a label is then converted into determining the truth value of the hypothesis. As exemplified in Figure 2a, this approach builds instructions (I) by combining a template (T) with a label (Y) to explain the task semantics. Table 1 further provides more detailed examples for NLI-oriented instructions.
The advantages of NLI-oriented instruction learning are 4-fold: (i) it keeps the label semantics and makes it possible to encode the input-output relations; (ii) it unifies various classification problems into an NLI task; (iii) by making use of the indirect supervision from existing NLI datasets, a model trained on NLI tasks is expected to work on other tasks in a zero-shot manner; (iv) it extends the original close-set indices classification problem into an open-domain label recognition paradigm. Therefore, it has been widely used in a variety of few/zero-shot classification tasks (Xu et al. 2023d), such as classifying topics (Yin, Hay, and Roth 2019), sentiments (Zhong et al. 2021), stances (Xu, Vucetic, and Yin 2022), entity types (Li, Yin, and Chen 2022), entity relations (Murty, Koh, and Liang 2020; Xia et al. 2021; Sainz et al. 2021, 2022), and so on.
Despite the term “NLI-oriented,” this type of instruction indeed has a broad scope. Numerous NLP tasks can be formulated in a question-answering format (Khashabi et al. 2020; Wu et al. 2018; Zhang, Gutierrez, and Su 2023; Yin et al. 2023b), where the question-answering instances can also be further transformed to the NLI style by simply concatenating the questions and different possible answers (Yin, Hay, and Roth 2019).
4.1.2 LLM-oriented Instructions (i.e., prompts; I = T + X)
As shown in Figure 2b and Table 2, the prompt is a representative of the LLM-oriented instructions, which is usually a brief utterance prepended with the task input (prefix prompt), or a cloze-question template (cloze prompt). It is basically designed for querying the intermedia responses (that can be further converted into the final outputs) from the LLM. Since the prompted input conforms to the pre-training objectives of LLM (e.g., the cloze-style input satisfies the masked language modeling objective [Devlin et al. 2019]), it helps get rid of the reliance on the traditional supervised fine-tuning and greatly alleviates the cost of human annotations. Thus, prompt learning achieves impressive results on a multitude of previous few/zero-shot NLP tasks, like question answering (Radford et al. 2019; Lin et al. 2022), machine translation (Li et al. 2022b), sentiment analysis (Wu and Shi 2022), textual entailment (Schick and Schütze 2021a, b), entity recognition (Cui et al. 2021; Wang et al. 2022a), and so on.
Despite the excellent performance of prompt techniques, there are still two obvious shortcomings with LLM-oriented instructions in real-world applications. First, it is not user-friendly. As the prompt is crafted for serving LLMs, it is encouraged to design the prompt in a “model’s language” (e.g., model-preferred incoherent words or internal embedding). However, this LLM-oriented style is hard to be understood by users and often violates human intuitions (Gao, Fisch, and Chen 2021; Li and Liang 2021; Qin and Eisner 2021; Khashabi et al. 2022). Meanwhile, the performance of prompts highly depends on labor-intensive prompt engineering (Bach et al. 2022), but most end-users are not LLM experts and usually lack sufficient knowledge to tune an effective prompt. Second, there are application constraints. The prompt is usually short and simplistic, whereas many tasks cannot be effectively formulated with solely a brief prompt, making prompt hard to deal with the diverse formats of real-world NLP tasks (Chen et al. 2022b; Zhang, Gutierrez, and Su 2023).
4.1.3 Human-oriented Instructions (i.e., I = T + optional )
Human-oriented instructions essentially denote the instructions used for crowd-sourcing on the human-annotation platforms (e.g., Amazon MTurk). Unlike LLM-oriented instructions, human-oriented instructions (Figure 2c) are usually some human-readable, descriptive, and paragraph-style information consisting of various components, such as “task title”, “category”, “definition”, and “things to avoid” (cf. Mishra et al. 2022b). Thus, human-oriented instructions are more user-friendly and can be ideally applied to almost any complex NLP task. Table 3 further shows some representative task examples.
Accordingly, human-oriented instructions have attracted much more attention in recent years (Hu et al. 2022b; Gupta et al. 2022; Yin, Li, and Xiong 2022, inter alia). However, due to the complex nature, human-oriented instructions are more challenging to encode by vanilla LLMs. For example, off-the-shelf GPT-2 was found to work poorly on following MTurk instructions (Wolf et al. 2019; Efrat and Levy 2020). To ensure the LLMs better understand the human-oriented instructions, follow-up works began to collect large-scale instruction datasets (Mishra et al. 2022b; Wang et al. 2022b). All previous results showed that, after fine-tuning with various task instructions, the text-to-text LLMs, like T5 (Raffel et al. 2020), OPT (Zhang et al. 2022a), and Llama (Touvron et al. 2023), achieved remarkable few/zero-shot generalizations by following these complex instructions (Wang et al. 2023b; Ivison et al. 2023b).
4.2 An Indirect Supervision Perspective
Table 4 further compares the aforementioned three instruction categories from different dimensions. Although we defined three types of instructions based on the ultimate downstream use cases they are facing, they are not exclusively different from each other. From a broad overview, they are essentially seeking the same thing—indirect supervision (Yin et al. 2023b)—to cope with target tasks that have limited annotations.
Specifically, NLI-oriented instructions transform target NLP problems into a source task—NLI—so that the rich supervision from existing NLI datasets can act as indirect supervision for those target problems. LLM-oriented instructions reformat target problems into the source task—language modeling, so that the rich generic-purpose knowledge in those LLMs can be directly utilized to get the output. Whether it is NLI-oriented instructions or LLM-oriented instructions, both try to solve unseen tasks with a generalizable system. However, both of them have limited application scope, for example, they cannot efficiently deal with some structured prediction tasks (Chen et al. 2022b; Zhang, Gutierrez, and Su 2023). Instead of seeking supervision from a single source task (NLI or language modeling), human-oriented instructions learn indirect supervision from a large set of training tasks; the resulting system, therefore, can ideally generalize to any unseen textual tasks.
5 How to Model Instructions?
Since both NLI-oriented instructions and LLM-oriented instructions are associated with either the input X or the output Y, these types of instructions do not require specific system design to encode them. NLI-oriented instructions can be handled by regular systems for the NLI task, and LLM-oriented instructions are mostly fed to auto-regressive LLMs. In contrast, human-oriented instructions are the most challenging type since it is independent of any labeled instances.
Therefore, this section mainly presents several mainstream modeling strategies for the human-oriented instructions, as illustrated in Figure 3.
5.1 Semantic Parser
At the early stage of machine learning, to help the systems understand natural language instructions, a great number of works used semantic parsing to convert the instruction into the formal language (logical formula), which can be more easily executed by the systems (Goldwasser and Roth 2011, inter alia). As exemplified in Figure 3a, a game instruction “Move any top card to an empty free cell” can be processed into an executable formula: “card(x) ∧ freecell(y)”.
Previous research spent extensive efforts on this strategy, among which most are used for human-computer interaction tasks, for example, playing soccer games (Kuhlmann et al. 2004). To alleviate laborious human annotations, follow-up work leveraged indirect or weak supervision from the grounded environments (e.g., knowledge base) to train the semantic parser (Kim and Mooney 2012).
Limitations
Semantic parser-based approaches mainly apply to individual tasks rather than universal cross-task generalization, because building a versatile semantic parser for all NLP tasks is over-challenging. By contrast, the approach introduced in the next subsection aims at cross-task generalization with limited supervision for the target tasks.
5.2 Flatten-and-Concatenation
In contrast to the semantic parser approach, which considers the instructions’ structure and the target problems, methods based on the neural networks take more brutal treatment: As illustrated in Figure 3b—instructions, regardless of their length, structure, task types, and so forth, are flattened as a long token sequence and concatenated with the input X as a new input sequence for the models, which has been widely adopted by prior research (Wang et al. 2022b; Wei et al. 2023, inter alia). However, this naive strategy constantly results in unsatisfactory performances when using vanilla models (Weller et al. 2020), leading to its reliance on large-scale instruction fine-tuning, known as instruction tuning.
Limitations
(i) Flattening and concatenating everything into a long sequence tends to ignore some key information that humans can often capture in the instruction (Mishra et al. 2022a; Jang, Ye, and Seo 2022), such as negation (e.g., “do not generate outputs longer than 5 tokens”), warning (e.g., “generate ‘D’ if the question is not answerable or you’re not sure”), output constraints (e.g., “your answer should be in one of ‘A’, ‘B’, ‘C’, and ‘D’”), and so on. (ii) To let models understand the instruction, a large number of training tasks have to be prepared. This is similar to what happened in the early years of deep learning in NLP: To improve the performance of deep neural networks for a particular task, we collect more labeled examples; back to the instruction following, the system’s comprehension of the instruction, unfortunately, still exhibits a high degree of dependence on the scale of training tasks (Chung et al. 2022).
5.3 HyperNetwork
Unlike the conventional modeling strategy that encodes the input sequence into the dense representation (i.e., language-to-representation), hypernetwork follows a language-to-parameter paradigm: as shown in Figure 3c, this scheme converts textual instruction into a block of model parameters that can be further plugged into the underlying models (Ha, Dai, and Le 2017; Houlsby et al. 2019; Jin et al. 2020). As a result, hypernetwork-based instruction modeling can better leverage the structured input sequence by encoding the instruction and task input separately (i.e., instruction-to-parameter, input-to-representation), achieving stronger generalization compared with the flatten-and-concatenation approach (Ye and Ren 2021; Deb, Awadallah, and Zheng 2022). It can also significantly improve inference efficiency, as concluded by recent work (Ivison et al. 2023a).
Limitations
5.4 Reinforcement Learning from Human Feedback
The loss function for training LMs significantly impacts the resulting LMs’ instruction-following performance (Tay et al. 2023). However, almost all the aforementioned modeling strategies (except the semantic-parser-based method) adopt a conventional next token prediction loss (e.g., cross entropy) to train the models, which tries to capture the human preference by simply comparing the model’s generation text with the ground truth reference. In order to directly optimize the LMs with the supervision of human preference, recent work utilized reinforcement learning from human feedback (RLHF) to train the LMs (Stiennon et al. 2020; Bai et al. 2022a; Ouyang et al. 2022).
Initial and Tuned LM
The first step of RLHF is to obtain an initial LM, which is usually trained with the flatten-and-concatenation-based modeling strategy—concatenate instruction, input and all other resources (if they exist) into one input sequence, and train the LM to generate the ground-truth output (as we have introduced before). With the initial LM as a starting point, we can copy it to another independent parameter, which is the target LM will be continually updated in RLHF (i.e., tuned LM).
Prediction Shift Penalty
Here, the θ and θ* represent the parameters of initial and tuned LM, respectively. I and x denote the instruction and task input; while y and y* are the outputs of the initial and tuned LM. rKL is the final reward (loss) of prediction shifting, and KL means the calculation of Kullback–Leibler (KL) divergence. KL divergence is a widely adopted strategy for measuring the textual difference, which can be used as a part of the loss to penalize the tuned LM on shifting the output substantially away from the initial LM generation. This prediction shift penalty can prevent the tuned LM from fooling the reward function to get a high reward but losing the coherence of the generation text.
Reward Function
As well as the prediction shift penalty, another part of the final reward comes from the reward function. A reward function is used for directly measuring how well the model’s output aligns with human preference—the higher rewards mean the output better aligns with human preference.
The rθ is the reward model, which is usually a regression model.
Where the σ is the activation function that scales the reward into (0,1], y + is the output preferred by human, y− otherwise. By training on these pairwise preference comparison data, the reward model can directly learn to capture the human preference and make alignment reward estimation for the RLHF.
The Final Training Reward and Inference
The λ here is the controlling factor.
After training with the above reinforcement learning policy, the final tuned LM can better align with human preference. While the inference procedure of the tuned LM is actually similar to the aforementioned flatten-and-concatenation modeling, where it receives instruction and input, and then generates the corresponding output.
Limitations
Compared with other modeling strategies, RLHF requires much more expensive human efforts because of collecting the preference data, especially when the preference comparison outputs are all written by humans (Ouyang et al. 2022). Meanwhile, the performance of RLHF highly relies on the quality of its human preference annotations. More importantly, in some cases, such as some open-ended creative writing tasks, different humans often hold high disagreement on the preference decision due to the lack of ground-truth output.
6 Instruction Following Datasets and Evaluation
In this section, we shed light on an important topic related to the instruction following, that is, the instruction-following datasets and the evaluation settings for the instruction-tuned models.
6.1 Datasets
The essence of instruction following is to tame the models by following various task instructions and responding with the corresponding desired outputs. Therefore, the instruction-tuning dataset (high-quality instruction-output pairs) is the critical part (Wang et al. 2023b; Zhou et al. 2023a).
The current instruction-tuning datasets can be divided into two categories according to different annotation categories: (1) human-annotated datasets (§ 6.1.1); and (2) LLM-synthetic datasets (§ 6.1.2). We summarize all the datasets in Table 5 for a better overview.
6.1.1 Human-annotated Datasets
The conventional way to create instruction-tuning datasets is by using human annotators, especially for early-stage datasets, as shown in Table 5. For example, Public Pool of Prompts (P3) (Sanh et al. 2022) and Flan (Wei et al. 2022a) collected multi-task instruction-tuning datasets, where they utilized human expertise to design various prompts for each task. Mishra et al. (2022b) proposed Natural Instructions, in which they collected more than 60 NLP tasks with the corresponding human-written instructions; Wang et al. (2022b) further extended this collection into a 1.6k cross-lingual tasks scale contributed by 88 NLP experts, namely, Super-Natural Instructions. Xu, Shen, and Huang (2023) proposed the first multimodal instruction tuning benchmarks (MultiInstruct) by leveraging the existing open-source datasets and expert-written instructions.
Human-created datasets are mostly high-quality (with minimum annotation errors) but require labor-intensive human efforts and expensive time consumption. More importantly, humans suffer from limited diversity—it is very challenging for humans to brainstorm diverse and novel tasks; thus, the task scale of human-annotated datasets is usually limited by human annotators (e.g., expertise level and collaboration scheme of humans).
6.1.2 LLM-synthetic Datasets
Because LLMs have shown their superior annotation quality on various NLP tasks (He et al. 2023; Pan et al. 2023), much recent work tried to use
LLMs (e.g., ChatGPT and GPT-4) instead of humans on instruction-tuning dataset curation. For instance, SELF-INSTRUCT (Wang et al. 2023c) and Unnatural Instructions (Honovich et al. 2023a) utilized human-annotated instructions as demonstrations to guide LLMs in devising novel tasks and increasing task diversity. WizardLM (Xu et al. 2023a) used an instruction evolution paradigm to increase instruction complexity. Dynosaur (Yin et al. 2023a) repurposed existing input-output pairs in NLP datasets to stimulate new instructions and reduce annotation costs. MUFFIN (Lou et al. 2024) prompted the LLMs to gather different task instructions for the same input and obtained an impressive generalization capacity of the tuned smaller models. Besides single-turn instruction-output datasets, some works also collected multi-turn dialogue data from ShareGPT,7 where the instructions are created by humans (users of OpenAI API), and the responses are from LLMs.
Though these LLM-synthetic datasets contained considerable noise (e.g., incoherent instructions and hallucination outputs), the diverse task distribution and model-preferred output patterns still benefit the smaller models on instruction-following, achieving comparable or even better generalization performance compared with human-annotated datasets (Wang et al. 2023b, c).
In a word, the choice between human-annotated and LLM-synthetic datasets can also be regarded as a trade-off between data quality and diversity. Previous work has concluded that both factors affect the performance of the resulting models (Chung et al. 2022; Longpre et al. 2023)—mixing human and machine data can lead to better results (Wang et al. 2023c; Yin et al. 2023a), while there is no concrete conclusion about which factor outweighs the other, which highly depends on the downstream tasks and application situations.
6.2 Evaluation
6.2.1 Different Evaluation Schemes
How to evaluate an instruction-tuned model is also a crucial topic. Most traditional NLP tasks usually have concrete criteria on the task objective, whereas for instruction following, the key objective is to tame the model to follow instructions—how well the model follows instructions is highly subjective and depends on various preferences. Therefore, different studies tend to utilize various evaluation strategies. In this section, we list several common evaluation settings.
Automatic Metrics
When testing the model’s instruction-following performance on an evaluation dataset, if this dataset has “ground-truth” outputs, then a conventional criterion is to use those automatic evaluation metrics, such as Exact-Match (Rajpurkar et al. 2016) and ROUGE (Lin 2004), that have been widely used for evaluating the generation models (Mishra et al. 2022b; Wang et al. 2022b; Wei et al. 2022a; Sanh et al. 2022; Lou and Yin 2023; Yin, Li, and Xiong 2022). However, this naive evaluation strategy suffers from several drawbacks: (1) It has been widely acknowledged that the automatic generation metrics are not perfect and have significant biases (e.g., BLUE score has text length bias). (2) All of these metrics are used for showing how well the model’s prediction aligns with pre-annotated answers, however, most real-world user tasks are highly open-ended, and there are probably no official ground-truth labels to calculate the metrics. (3) The essence of instruction following is to follow user’s instructions and provide desired responses that can appropriately address user’s requirements, while automatic metrics focus more on some superficial textual patterns and lack the reflection on how well the response satisfies the instructions.
Human Evaluation
A more reliable evaluation method is to use humans to decide whether a model’s response satisfies the instruction or not. For example, given a task instruction and a corresponding model output, the human evaluator should read the instruction and decide whether this model output is acceptable or not (reporting an acceptance ratio for the target model) (Wang et al. 2023b, c; Lou et al. 2024); or ask humans to compare two models’ outputs and decide which one better satisfies the instruction (pairwise comparison between two models) (Taori et al. 2023). Because instructions are mostly complicated and contain considerable explicit or implicit constraints, human evaluation is more flexible and accurate than automatic metrics in reflecting the instruction-following capacities of different models.
However, human evaluation is much more expensive, slower than automatic evaluation, and unreproducible. Thus, most of the studies only conduct a human evaluation on a small subset of the whole evaluation benchmark. Meanwhile, human evaluation is mostly based on human evaluators’ personal preferences and can result in high variance between different evaluators.
Leading LLMs as Evaluators
To address the aforementioned issues of human evaluation, recent studies have also tried to use LLMs (e.g., GPT-4) rather than humans to evaluate the models’ instruction following capacity, such as VicunaEval8 and AlpacaEval (Taori et al. 2023). Nevertheless, although LLMs are cheaper and faster, they were found to have serious preference bias on some superficial textual patterns or hallucinations, for example, GPT-4 prefers longer texts and responses with diverse tokens (Wang et al. 2023b). Meanwhile, only a final preference score is usually insufficient for a comprehensive evaluation.
In order to improve reliability, instead of letting LLMs simply provide a preference decision, other works tend to ask LLMs to generate comprehensive analyses as well as the final decision, such as generating the error types, locations, and explanations before concluding with the final scores (Fernandes et al. 2023; Xu et al. 2023e). Some other studies also predefined several explainable criterion questions for the various evaluation tasks (e.g., for an instruction “Please generate at least 25 sentences”, define a criterion “is the model’s generation at least 25 sentences?”), that can be further verified by humans or LLMs easily (i.e., doing binary classification on these predefined criteria) (Liu et al. 2023e; Zhou et al. 2023b). Saha et al. (2023) also asked LLMs to first generate the criteria questions automatically according to the instructions and then evaluate the model’s response.
6.2.2 Two Branches of Evaluation
Despite the various evaluation choices in the instruction following, they can be summarized into two branches from our view.
Task-centric Evaluation
Most evaluation datasets in this branch are based on conventional multi-task learning, where the evaluation tasks are mostly traditional NLP tasks, such as natural language inference (Wei et al. 2022a; Sanh et al. 2022). This branch aims to test LLMs’ instruction-following and problem-solving capacity, and the main criterion here is whether the models can correctly solve the given textual task. Therefore, most of the evaluation settings in this branch adopt conventional automatic metrics to reflect the task ground-truth label alignment. Representative benchmarks are MMLU (Hendrycks et al. 2021), BBH (Suzgun et al. 2023), SuperNI-Test (Wang et al. 2022b), T0-Eval (Sanh et al. 2022), InstructEval (Chia et al. 2023), and so forth.
Human-centric Evaluation
The evaluation instructions in this setting are user-oriented or dialogue-like user queries, mainly used to test how well the models’ responses align with human preference, especially for the safety and usefulness of the responses (e.g., harmlessness and honesty). Unlike the task-centric evaluation, human-centric evaluation cares less about the ground-truth labels since most user tasks are open-ended. Thus, this evaluation setting is more subjective and requires more high-level human or LLM efforts. Representative benchmarks are AlpacaFarm (Dubois et al. 2023), VicunaEval, and HHH (Bai et al. 2022b).
To our knowledge, since instruction following is a relatively wide topic that can be related to various downstream tasks and real-world scenarios, there is still a lack of a comprehensive evaluation setting that can be applied to all of the target scenarios. A more practical choice is to adopt different evaluation settings according to the objectives of different studies (i.e., task-centric or human-centric).
7 Factors that Influence Instruction Following Performance
Instruction following is proven to be effective in a lot of few/zero-shot NLP tasks, but how to explain the impressive performance of instruction? And which aspects make a successful instruction following procedure? We categorize the factors affecting instruction following performance into five dimensions: model, instruction, demonstration, model-instruction interaction, and dataset. Table 6 displays a roadmap for this section, where we also conclude the takeaways to make it easy to refer to.
7.1 Model-related Factors
7.1.1 Update Model or Not
As shown in Figure 1b, to drive LLMs to understand and follow task instructions more smoothly, a widely adopted practice is fine-tuning LLMs on multi-task datasets, where each task input is equipped with a task instruction. This procedure is also well known as “instruction tuning.” Numerous studies have demonstrated that instruction-tuned LLMs could better follow the instructions of unseen tasks compared with frozen LLMs (Wei et al. 2022a; Sanh et al. 2022).
As well as the performance gains on unseen tasks, instruction tuning has many other benefits, such as learning faster on the downstream tasks (Longpre et al. 2023; Gupta et al. 2023), being more robust to tiny instruction perturbations (e.g., paraphrasing) (Weller et al. 2020; Sanh et al. 2022; Gu et al. 2023), becoming more user-friendly (Chung et al. 2022), and being better at following soft instructions (Wei et al. 2022a).
7.1.2 Model Scale
Recent work has demonstrated that the model scale significantly impacts the generalization performance of instruction following (Chung et al. 2022; Longpre et al. 2023; Wang et al. 2023b, inter alia). As shown in Figure 4, the generalization performance of each model consistently increases when scaling up the model size. More interestingly, when the model scale is large enough, even vanilla LLMs can significantly outperform smaller LLMs tuned on extensive tasks (see Flan-PaLM; vanilla 540B > 8B + 1,836 tasks), which probably implies that the benefits of scaling up the model size can outweigh dataset scaling.
However, the super-large model scale is usually unaffordable for most research groups, and it also leads to enormous carbon emissions, making it unrealistic in most real-world scenarios (Strubell, Ganesh, and McCallum 2019; Schick and Schütze 2021c). Accordingly, recent studies began to investigate a more efficient way to address the model scale problem, for example, by parameter-efficient fine-tuning (Hu et al. 2022a; Liu et al. 2022a; Lialin, Deshpande, and Rumshisky 2023; Jang et al. 2023).
7.2 Instruction-related Factors
7.2.1 Instruction Engineering
A common problem in instruction following is that the pre-trained models are usually sensitive to some subtle modifications in the instruction (Weller et al. 2020; Efrat and Levy 2020; Bach et al. 2022; Mishra et al. 2022a; Gu et al. 2023)—even a minor edition on instruction, such as paraphrasing or word replacement, can lead to huge performance variance. Therefore, modifying the wording of instruction before usage, namely, instruction engineering, is critical for the models’ performance.
One straightforward solution is to manually rewrite the instruction, that is, human instruction engineering. When humans perform instruction engineering, the criteria of rewriting is based on mostly human intuition. For example, Mishra et al. (2022a) conducted error case analysis on GPT’s instruction-following outputs. Accordingly, they designed several empirical rules on instruction writing and “reframed” the instructions. All of these proposed rules are based on human intuition, for example, itemizing instructions and task decomposition. In order to avoid the preference bias introduced by a small group of humans, Bach et al. (2022) proposed community-driven instruction engineering, where they collected instructions created by various NLP experts with different writing styles, diversifying the choices of instructions. However, human instruction engineering is time-consuming and expensive. Moreover, the human intuition on instruction designing might be subjective and sometimes is suboptimal for the models.
To this end, automatic instruction engineering tries to let the model figure out better instructions automatically. Prasad et al. (2023) proposed an edition-based method to automatically modify the instruction. For each iteration, they edited the instruction at the phrase level to generate multiple candidates, and then used the target model to predict the scores of the different candidates by using a small labeled set (i.e., calculating the ground-truth entropy and accuracy). In doing this, Prasad et al. (2023) achieved better performance compared with those manually reframed instructions (Mishra et al. 2022a). Besides using the ground-truth score, Gonen et al. (2023) utilized the model’s prediction likelihood as feedback to select instruction candidates, which does not even require any labeled instances. Deng et al. (2022) further proposed a reinforcement learning framework to conduct instruction engineering. Despite the superior performance, the obvious drawback of automatic instruction engineering is the poor explainability, where the resulting instructions mostly violate human intuition (e.g., some task-irrelevant sentences) (Khashabi et al. 2022; Prasad et al. 2023), which is similar to soft instruction.
In a word, instruction engineering is a trade-off procedure—lower explainability is the tax of better performances. Meanwhile, instruction engineering is a highly empirical subject, and there are no gold-standard rules/methods on it—different models and tasks might require totally different instruction designing. Hence, we highly recommend the community release the accompanying instruction manuals when releasing their instruction-tuned models, thus ensuring stable and expected model behaviors (e.g., OpenAI’s cook book9).
7.2.2 Instruction Consistency
This factor considers the instructions across the training tasks and test tasks. Keeping the instruction paradigm (e.g., abstractiveness) consistent is crucial in instruction following. Wei et al. (2022a) first investigated the performance impact of changing the instruction paradigm. They found that LLMs tuned on short instructions (i.e., task names) cannot generalize to longer sentence-style instructions (short ⇏ long). Similarly, Gu et al. (2023) observed the performance dropping when changing paragraph-style instructions to shorter sentence-style instructions at the test phase (long ⇏ short), further indicating the importance of instruction consistency.
Besides discrete instruction, maintaining the instruction paradigm is also critical for soft instruction, that is, keeping the same-size prefix embedding when testing on unseen tasks (Xu et al. 2022). Interestingly, similar results were also found in the few-shot demonstrations (i.e., in-context learning), where the combination of input-output pairs or the number of demonstrations cannot be changed during training and evaluation (Min et al. 2022a, b; Iyer et al. 2022). These phenomena raise a concern: Although instruction-tuned LLMs are robust to tiny perturbations of instructions, they are vulnerable when facing more significant alterations, which is far behind human-level generalization.
7.2.3 Instruction Diversity
To further improve the robustness of LLMs, especially when facing significant alterations of instruction paradigms, people try to promote instruction diversity during the training phase—for the same training task, writing multiple instructions in different textual expressions (e.g., different wordings and lengths), then training LLMs on the mixture of diverse instructions. Notably, Sanh et al. (2022) showed that adopting instructions with diverse writing styles not only improved the model generalization but also compensated for the limited model scale to some extent.
Nevertheless, manually crafting instructions with diversity is expensive and usually hard to achieve due to the human annotation bias (Huynh, Bigham, and Eskenazi 2021; Parmar et al. 2023). Owing to the excellent annotation quality of LLMs (He et al. 2023; Pan et al. 2023), a considerable number of studies began to use models to compose innovative instructions (Zhang et al. 2020, 2021; Honovich et al. 2023b). Although the model-generated instructions have been proven to contain more noise, benefiting from the diverse syntax structures (Kitaev and Klein 2018), these instructions could still show complementary effects with the human-written instructions (Wang et al. 2023c). More interestingly, Lou et al. (2024) proposed a new instruction-following dataset paradigm, where they used LLMs to synthesize diverse task instructions for each input. Benefiting from this paradigm, the tuned LMs were forced to focus more on the instruction than the task input, achieving promising instruction-following performance. All of these results may imply the profitability of instruction diversity, even at the expense of the correctness of instructions.
7.2.4 Add Demonstrations or Not
Demonstrations, that is, a couple of input-output examples, have been shown to be critical for the expressiveness of task instructions. For example, existing work found that adding a few positive demonstrations in the textual instructions could result in a significant performance improvement on the unseen tasks (Yin, Li, and Xiong 2022; Deb, Awadallah, and Zheng 2022), especially for the tasks occupying complex output space (Mishra et al. 2022b; Wang et al. 2022b). Surprisingly, Gu et al. (2023) further found that models highly relied on few-shot demonstrations and even abandoned other useful resources (e.g., detailed task definition) when demonstrations were available. This prominence is perhaps because the LLMs prefer to exploit the more superficial patterns of the demonstrations rather than the other complex textual expressions (Min et al. 2022b). In other words, at present, a comprehensive framework for accurately encoding pure instructions in the absence of demonstrations or task scaling remains elusive (Lou and Yin 2023).
7.3 Demonstration-related Factors
Because few-shot demonstrations can considerably impact the model’s instruction following performance, recent studies investigated different factors in the demonstrations that can further enhance the model’s demonstration learning efficiency.
7.3.1 The Selection of Demonstrations
Given an unlabeled test instance (i.e., input-only instance waiting for the answer from the model), and a pool of labeled training instances (i.e., input-output pairs), how to select the better demonstrations from this pool for the test instance is a fundamental question for in-context learning.
Liu et al. (2022b) proposed an unsupervised demonstration selection strategy, where they utilized kNN (k nearest neighbors) to retrieve the demonstrations with the closed embedding distance as the test instance. The key step in the clustering-based selection methods is the distance metrics, such as L2 distance, cosine-similarity, or mutual information (Sorensen et al. 2022). In addition to the clustering-based methods, another branch of methods used the output score of models as the selection criterion (Gonen et al. 2023; Wu et al. 2023; Li and Qiu 2023). For example, Nguyen and Wong (2023) tried to select a subset A from the training pool as the demonstrations by measuring the model’s average performance variance between A and the complement set .
Beyond the above unsupervised or weak-supervised selection strategies, some other studies also utilized supervised methods. Wang, Zhu, and Wang (2023) regarded the LMs as implicit topic models, where the LMs can generate meaningful concept representation based on the few-shot demonstrations. By training the topic models, they selected demonstrations that could maximize the likelihood of the given concept. Meanwhile, Zhang, Feng, and Tan (2022) regarded the demonstration selection as a Markov decision process (Bellman 1957) and proposed a reinforcement learning model via Q-learning (Jang et al. 2019).
7.3.2 The Order of Demonstrations
Even with the same set of demonstrations, differences in example order can also impact the model’s in-context learning performance. Zhao et al. (2021) emphasized that GPT-3 is sensitive to the order of the demonstrations, and they conjectured that this sensitivity potentially comes from recency bias—the tendency to repeat answers that appear towards the end of the prompt. Lu et al. (2022) further conducted comprehensive experiments and found that, along with GPT-3, various models suffer from order sensitivity.
To this end, recent work has proposed several methods to sort a “suitable” example order for the LMs. For example, based on recency bias, Liu et al. (2022b) calculated the embedding similarity between the demonstrations and the target input; those more similar examples were put closer (right more often) to the input. Lu et al. (2022) proposed several entropy-based metrics to search for the best demonstration order.
7.3.3 Reasoning Step Augmentation
Beyond the standard input-by-output demonstrations, augmenting in-context examples with reasoning steps is found helpful for the model’s performance, especially for the super-large models.
Wei et al. (2022b) proposed chain-of-thoughts (CoT), where they inserted some human-written intermediate reasoning steps (i.e., rationale) between input and output of in-context demonstration. By doing so, when predicting the target output, the models can generate intermediate reasoning steps as well, thus enhancing the performance on reasoning tasks (e.g., math word problems) and the explainability of LMs. In addition to the human-written CoT, Xu et al. (2023c) also found that CoT synthesized by larger models can assist the smaller models. Based on the promising results of adopting CoT, more advanced variations were proposed for more accurate reasoning, such as program-of-thoughts (PoT) (Chen et al. 2022a), tree-of-thoughts (ToT) (Yao et al. 2023), graph-of-thoughts (GoT) (Besta et al. 2024), and CoT with self-consistency decoding augmentation (Wang et al. 2023a).
However, similar to the demonstration sensitivity, different CoT writing styles can also result in performance variance. Therefore, in contrast to the human-craft CoT (i.e., few-shot CoT), Zhang et al. (2022b) proposed Auto-CoT (i.e., zero-shot CoT), where they added a “Let’s think step by step” into the prompt and let the models generate CoTs themselves. Afterwards, more and more variations of Auto-CoT were proposed to address more complicated reasoning tasks. For example, Self-Ask (Press et al. 2023) asked the model to first generate several questions regarding the input and then answer these questions by the model itself—these self-generated contexts were further used as the reasoning rationales to help answer the original input. Similarly, Least-to-Most (Zhou et al. 2022) asked the model to decompose an origin complex input into several sub-questions and answer them subsequently, which can be used as the rationales as well.
7.3.4 Emphasizing Input-output Mappings
For in-context learning, the model usually cannot directly “learn” the input-output mapping from the given examples because there is no parameter update for the models. Therefore, one issue of in-context learning is that, when conducting instruction following, the demonstrations are not necessarily needed for the model to solve the task (i.e., even without the few-shot demonstrations, the model can still make predictions). Min et al. (2022b) also found that the model is more likely to “copy” the output candidate from the demonstrations, instead of truly learning the underlying mapping.
To this end, Wei et al. (2023) proposed symbol tuning. Different from conventional instruction following, which tunes the models to follow input-by-output demonstrations to complete the target input, symbol tuning uses some unrelated symbols to replace the origin outputs of the demonstrations. For example, the origin output space of the demonstrations might be “positive” and “negative”; symbol tuning uses “Foo” and “Bar” instead. After losing the semantics of the output spaces, there are no prior label biases (Zhao et al. 2021) for the models to rely on to make the final prediction, so the models are forced to figure out the input-output mapping in the context.
7.4 Model-Instruction Alignment
This factor refers to making the procedure of instruction following better conform to the preference of LLMs. One aspect is the training objective. Since the current instruction following paradigm mainly uses the LLMs as the system backbone, one of the potential explanations for why LLM-oriented instructions (i.e., prompt) can work is that prompt aligns well with the pretraining objective—language modeling—and activates the task-specific knowledge of the LLMs. Some existing works demonstrated the importance of conforming to the pretraining objective of LLMs when doing instruction following (Schick and Schütze 2021c; Tay et al. 2023), such as recalling language modeling objectives in fine-tuning phase (Iyer et al. 2022). Another aspect of model preference alignment is the way of designing instructions: That is, converting the instructions into model-oriented styles (Deng et al. 2022). For example, using soft instructions (i.e., continuous embedding) instead of human-understandable discrete instructions (Lester, Al-Rfou, and Constant 2021; Liu et al. 2021; Ye et al. 2022a). This is consistent with empirical guidelines established in the field of prompt engineering, which emphasize the significance of model-oriented prompt design.10 Despite performance profits, it is still controversial whether it is worthwhile to convert the original human-oriented instructions into an LLM-oriented style, because it always impairs the interpretability of instructions and is highly contrary to human intuition (Khashabi et al. 2022; Webson and Pavlick 2022; Prasad et al. 2023).
7.5 Data-wise Factor: Task Scale
The task scale often refers to the number of different training task categories in the dataset. Since “data-wise factor” also includes the scale of training instances, Wang et al. (2022b) investigated the impact of both task and instance scales. They found that instance scale (fixed task number, increasing the number of instances per task) can only bring a limited performance boost, while task scale is the key factor for instruction following, in line with the observations of other studies (Wei et al. 2022a; Chung et al. 2022). As illustrated in Figure 4, the same-size model with more tuning tasks usually gains better performance. However, the performance improvement of scaling up tasks is unstable, especially when the model size is too small (e.g., 0.08B Flan-T5). This phenomenon aligns with the discussion in § 7.1; we can draw a similar conclusion here: the profits of the task scale are highly governed by the model scale.
7.6 Main Takeaway: Dual-Track Scaling
Among all the factors discussed in this section, scaling is arguably the core factor that leads to the success of instruction following. Prior to LLM-based instruction following, scaling was mainly for deep learning models: from single-layer neural nets to multi-layer perceptions, and from convolutional/recurrent neural networks to deep-layer transformers (Hochreiter and Schmidhuber 1997; LeCun et al. 1998; Vaswani et al. 2017; Devlin et al. 2019). Along with the pretraining of massive raw text data, the ever-increasing models are expected to have encoded a vast amount of generic-purpose knowledge (Zhou et al. 2023a). In the era of instruction following, where the community is more interested in cross-task generalization, merely scaling LLMs seems not enough. Thus, researchers take a parallel scaling: to collect more and more training tasks and labeled examples for each. We interpret this as a dual-track scaling. Overall, this dual-track scaling jointly seeks supervision to solve new tasks—the supervision either comes from LLMs’ pretraining or substantial training tasks. Despite its progress, some notable challenges remain in this area, which we will discuss in the next section.
8 Challenges and Future Directions
Despite all the aforementioned benefits of instruction, tons of under-explored challenges remain in this area. In this section, we list several challenges related to the instruction following, which are worthwhile for future research to investigate.
8.1 The Tax of Instruction Alignment
Instruction following aims at taming the models to better assist humans in real-world tasks; therefore, in addition to pursuing ultimate performance, inference-time safety is also a crucial aspect for the instruction-tuned models (i.e., instruction alignment). Ouyang et al. (2022) defined “alignment” with three criteria—Helpful, Honest, and Harmless (HHH), which has been widely considered by the previous instruction tuning models and datasets (Bai et al. 2022b; Yin et al. 2023a; Wang et al. 2023c; Lou et al. 2024). However, alignment can also bring a “tax” to the instruction-tuned models. For example, Bekbayev et al. (2023) found that well-aligned answers provided in instruction following datasets can considerably drop the model’s performance on various task benchmarks. This implies a trade-off between performance and safety for instruction following, which requires careful consideration.
8.2 Learning Negated Information
Negation is the common linguistic property and has been found to be crucial for various NLP tasks, for example, NLI (Naik et al. 2018; Kassner and Schütze 2020). Specific to instruction following, negation denotes any things-to-avoid information of in-context instructions, including negated requirements (e.g., “avoid using stop words”) and negative demonstrations (i.e., some wrong examples). Although humans can learn a lot from the negation (Dudschig and Kaup 2018), existing work has found that LLMs often fail to follow the negated instructions; some negations can even drop models’ performance (Li et al. 2022a; Jang, Ye, and Seo 2022; Mishra et al. 2022a).
Because negation has increasingly become a challenge in instruction following, we provide several hints to inspire future work. One potential solution is unlikelihood training (Hosseini et al. 2021; Ye et al. 2022b), which trains the LLMs to minimize the ground truth probability when negated instructions are conditioned. Additionally, Yin, Li, and Xiong (2022) proposed pretraining the LMs on the negative demonstrations with maximizing likelihood objective to exploit the useful information in the negation. Some other methods, such as contrast-consistent projection (Burns et al. 2023) and n-gram representations (Sun and Lu 2022), have also provided insights into tackling this problem.
8.3 Adversarial Instruction Attacks
Though most of the instruction-tuned LLMs can align well with human preferences and provide harmless responses, recent work found that they could easily be attacked—the model’s response can be manipulated by using simple prompting strategies. Kang et al. (2023) designed several prompts to trigger the LLMs to generate malicious content. For example, instead of directly providing malicious instruction with obviously harmful intentions, they split the instruction into several pieces (each piece itself doesn’t trigger the LLMs’ defense mechanism). In doing this, those powerful preference-aligned LLMs, such as ChatGPT and InstructGPT, were successfully fooled and generated harmful content. Li et al. (2023b) also found that the retrieval-augmented generation models can be easily attacked by injecting adversarial questions into the retrieved context. As well as attacking the instruction-tuned LLMs, Wan et al. (2023) concluded that LLMs can also be attacked during instruction following. Based on the clean instances, they automatically created a few poisoned examples to train the LLMs and found that the resulting LLMs could be manipulated by using some trigger words.
Since instruction-tuned LLMs have been applied to various real-world scenarios, such as Web agents and search engines (Deng et al. 2023; Xie et al. 2024a), the safety of LLM generation is becoming more urgent. Simply conducting preference alignment or content filtering seems to be insufficient, especially for those super-strong LLMs. Thus, developing efficient defence methods is necessary for the current instruction-tuned models. Meanwhile, further deep analyses of LLMs’ vulnerability are also critical, potentially providing more insights into the defense.
8.4 Explainability of Instruction Following
As we have mentioned in § 7, to achieve a promising cross-task performance, one of the critical factors is to convert the human-oriented instructions into LLM-oriented instructions, i.e., making the instructions conform to the model’s preference. Numerous previous studies have verified the effectiveness of catering to the model’s preference in designing instructions, for example, using the model’s perplexity in choosing appropriate instructions (Gonen et al. 2023). Despite the performance gains, the resulting instructions consistently violate human intuitions and show worrying reliability, such as some semantically incoherent, task-irrelevant, or even misleading instructions (Khashabi et al. 2022; Prasad et al. 2023). These results prove the conflict between performance profits and the human interpretability of instructions, which is tricky to trade-off.
Although Mishra et al. (2022a) demonstrated that it is possible to maintain both the faithfulness and effectiveness of instructions, manual rewriting requires laborious human efforts. Therefore, one of the future trends is to investigate how to automatically rephrase the instructions, in a way that matches both human and model preferences.
8.5 Learning to Follow Instruction rather than Merely Generating Y
Multi-task instruction following is becoming a fundamental practice in the current instruction following paradigm. However, there are two issues in such a learning paradigm: (i) It relies on training on massive labeled examples to learn the instructions, which is still expensive and unrealistic for using large-scale LLMs; (ii) Although the ultimate goal of instruction following is learning to follow instructions by observing various training tasks, the current training objective is still the conventional maximum likelihood of reference outputs. This implicit instruction following objective can lead to sub-optimal optimization (i.e., LLMs can learn to generate Y for X without really understanding the meaning of instructions I).
To this end, one desired future direction is to evolve a new learning objective to help LLMs explicitly learn to follow instructions, which might alleviate the reliance on large-scale labeled instances. Moreover, a more ambitious and challenging idea is to drive the system to follow instructions without additional tuning on the labeled examples of any specific tasks (Ye et al. 2023; Lou and Yin 2023), which is somehow similar to a semantic parser-based paradigm (§ 5).
8.6 Multi-Lingual Instruction Following
Intuitively, instruction following is the language-agnostic capacity for the language models, which means that it is also possible for multi-lingual language models to follow the same semantic instructions in different languages. For example, Kew, Schottmann, and Sennrich (2023) found that LLMs tuned with more than three languages exhibit stronger instruction following capacity, implying the benefits of multi-lingual instruction tuning. Unfortunately, most of the current open-sourced instruction following datasets and foundation models are English-centric (as shown in Table 5). Therefore, the release of high-quality multi-lingual instruction tuning datasets (with pair translation) should be valuable for future research, as also mentioned by Peng et al. (2023).
9 Instruction-related Applications
In addition to the main body of our paper, we also survey some popular instruction-related application directions to inspire future board-wide utilization for instruction following.
9.1 Human–Computer Interaction
Textual instructions can be naturally regarded as a human–computer interaction method. Numerous previous work used natural language instructions to guide the computer to perform various real-world tasks.
For the non-NLP (multi-modal) tasks, most focused on environment-grounded language learning, i.e., driving the agent to associate natural language instructions with the environments and make corresponding reactions, such as selecting mentioned objects from an image/video (Matuszek et al. 2012; Krishnamurthy and Kollar 2013; Puig et al. 2018), following navigational instructions to move the agent (Tellex et al. 2011; Kim and Mooney 2012; Chen 2012; Artzi and Zettlemoyer 2013; Bisk, Yuret, and Marcu 2016), plotting corresponding traces on a map (Vogel and Jurafsky 2010; Chen and Mooney 2011), playing soccer/card games based on given rules (Kuhlmann et al. 2004; Eisenstein et al. 2009; Branavan, Silver, and Barzilay 2011; Babeş-Vroman et al. 2012; Goldwasser and Roth 2011), generating real-time sports broadcast (Chen and Mooney 2008; Liang, Jordan, and Klein 2009), controlling software (Branavan, Zettlemoyer, and Barzilay 2010), and querying external databases (Clarke et al. 2010), among others. Meanwhile, instructions are also widely adapted to help communicate with the system in solving NLP tasks, for example, following instructions to manipulate strings (Gaddy and Klein 2019), classifying e-mails based on the given explanations (Srivastava, Labutov, and Mitchell 2017, 2018), and text-to-code generation (Acquaviva et al. 2022).
Recently, a growing body of research tended to design the human–computer communication procedure in an iterative and modular manner (Dwivedi-Yu et al. 2022; Chakrabarty, Padmakumar, and He 2022). For example, Li, Mitchell, and Myers (2020) built a system to help the users tackle daily missions (e.g., ordering coffee or requesting Uber). Benefiting from a user-friendly graphical interface, the system can iteratively ask questions about the tasks, and users can continually refine their instructions to avoid unclear descriptions or vague concepts. As it is usually difficult for non-expert users to write sufficient instructions in one shot, adapting an iterative and modular paradigm in designing instruction-based AI systems can help guide the users to enrich the task instruction step by step. Thus, this paradigm efficiently relieves the thinking demands of users and leads to a more user-oriented system (Mishra and Nouri 2023). Due to its practical values, we emphasize the importance of this branch of work in this article.
9.2 Data and Feature Augmentation
Task instructions are regarded as indirect supervision resources where sometimes superficial and assertive rules are embedded. These rules are also known as labeling functions that can be directly applied for annotations.11 Therefore, some existing studies also used the instruction as a distant supervision to perform data or feature augmentation (Srivastava, Labutov, and Mitchell 2018; Hancock et al. 2018; Ye et al. 2020). For instance, Srivastava, Labutov, and Mitchell (2017) used a semantic parser to convert natural language explanations into logical forms, and applied them on all instances in the dataset to generate additional binary features. Wang et al. (2020) utilized the label explanations to annotate the raw corpus automatically and trained the classifier on the resulting noisy data.
Besides straightforward augmentation, Su et al. (2023) further used task instruction to enrich model representation and achieved strong cross-task generalization. Specifically, they trained an embedding model (a single encoder) on the diverse instruction datasets with contrastive learning, and then used this model to produce task-specific representations based on the instruction for the downstream unseen tasks.
9.3 Generalist Language Models
According to the definition of Artificial General Intelligence (AGI), the “generalist model” is usually a system that can be competent for different tasks and scalable in changeable contexts, which shall go far beyond the initial anticipations of its creators (Wang and Goertzel 2007; Goertzel 2014). While specific to the NLP domain, a generalist language model is supposed to be an excellent multi-task assistant that is skilled in handling a variety of real-world NLP tasks and different languages, in a completely zero/few-shot manner (Arivazhagan et al. 2019; Pratap et al. 2020; Wei et al. 2022a). As numerous existing works demonstrated the incredible power of using instructions in cross-task generalization (Wei et al. 2022a; Sanh et al. 2022; Mishra et al. 2022b; Wang et al. 2022b; Chung et al. 2022, inter alia), the instruction is likely to become a breakthrough in achieving this ultimate goal.
Notably, the recent remarkable applications of instructions, namely, InstructGPT, ChatGPT, and GPT-4, also indicated a large step towards building generalist language models. For example, during the pretraining of LLama-2, Touvron et al. (2023) utilized the idea of context distilling to inculcate instructions within LLMs, thus addressing the inconsistency issue of instruction following in the multi-turn dialogue situation. The OpenAI GPT-series adopt RLHF to align the model’s preference with human instructions, where feedback supervision plays a big role. Although the answer to “Is it instruction or human feedback that contributes more to the performance of ChatGPT?” remains ambiguous and needs further investigation, we introduce some recent works highlighting the critical role of instruction following. For example, Chung et al. (2022) conducted extensive experiments to evaluate the human-preference alignments of PaLM (Chowdhery et al. 2023). They found that, even without any human feedback, the instruction following significantly reduced the toxicity in the open-ended generations of PaLM, such as gender and occupation bias. In addition, some other studies also solely used creative instructions instead of human feedback and achieved notable cross-task results (Bai et al. 2022b; Honovich et al. 2023a; Wang et al. 2023c). Furthermore, as the knowledge conflict problem of LLMs has a significant impact on the applications of instruction-tuned models (Xie et al. 2024a), in order to make the LLMs more generalist and useful in the real world, recent work also utilized the idea of the instruction following to enhance the retrieval-augmented language models, and, vice versa, improve the instructions by adopting retrieved knowledge (Lin et al. 2023).
10 Conclusion
This survey summarizes the existing literature on instruction following, providing a comprehensive overview of the field, including instruction taxonomies, modeling strategies, and key aspects of instruction utilization. It also addresses unique challenges and offers hints for future research. Unlike previous work, we go beyond the limited scope of modern instruction following—we trace the studies of instruction following back to the early stage of machine learning, and explore textual instruction as an indirect supervision for LLMs. To our knowledge, this is the first extensive survey on instruction following. Overall, we aim to offer valuable insights and inspire further in-depth research in this area.
Notes
The curated paper list can be found at: https://github.com/RenzeLou/awesome-instruction-learning.
A plain template connecting X and Y, e.g., “The input is […] The output is […]”, no task-specific semantics.
Using prefix prompts for auto-regressive LMs, while using cloze prompts for masked LMs (Liu et al. 2023a).
For example, if “a very fair price” is sentiment-positive, every sentence with a similar adj-noun collocation as “fair price” will be positive as well.
References
Author notes
Action Editor: Vivek Srikumar