Abstract
Natural language generation has witnessed significant advancements due to the training of large language models on vast internet-scale datasets. Despite these advancements, there exists a critical challenge: These models can inadvertently generate content that is toxic, inaccurate, and unhelpful, and existing automatic evaluation metrics often fall short of identifying these shortcomings. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of recent research that has leveraged human feedback to improve natural language generation. First, we introduce a taxonomy distilled from existing research to categorize and organize the varied forms of feedback. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which uses large language models to make judgments based on a set of principles and minimize the need for human intervention. We also release a website of this survey at feedback-gap-survey.info.
1 Introduction
For generation systems to be widely useful, they must generate text that is not only fluent and high-quality, but also well-aligned with human desires and specifications (Vamplew et al., 2018; Hendrycks et al., 2020; Kenton et al., 2021; Turner et al., 2022; Ngo, 2022). Achieving such ambitious goals requires large language models (LLMs) to evolve beyond traditional training methods. Recent improvements in this space have centered on incorporating human feedback (Bai et al., 2022b; Ouyang et al., 2022; OpenAI, 2023a), intended to serve as a guiding force toward the desired outcomes, much like feedback mechanisms in physical machines (Åström and Murray, 2021).
Typically, state-of-the-art language generation systems are obtained by training probabilistic, autoregressive LLMs on massive amounts of data using maximum likelihood estimation (MLE). However, the data used to train these models is generally scraped from the Internet, often containing noise, social biases, and errors (Bolukbasi et al., 2016; Dodge et al., 2021). This combination may result in a misspecification of target behavior (Kenton et al., 2021), and may lead to models that generate toxic, inaccurate, and unhelpful content (Sheng et al., 2019; Bender et al., 2021).
The evaluation challenge is compounded as these models are often assessed by automatic metrics that rely on superficial features such as word overlap with reference text. However, these metrics often fail to correlate with human-perceived text quality, particularly when models are overly optimized for these metrics (Schluter, 2017; Mathur et al., 2020; Gehrmann et al., 2022; Paulus et al., 2017; Amrhein and Sennrich, 2022).1 Considering human-perceived quality can help bridge the gap between machine and human generated text and better align the system with desired outcomes (Rosenblueth et al., 1943; Wiener, 1948).
Feedback, as a concept, encompasses a wide range of interpretations (Wiener, 1948); however, some universal characteristics can be identified, such as its format, its intended results, and the ways it is utilized as a part of the model development process. In this survey, we focus on the role of human feedback for improving language generation. We start by formalizing human feedback, creating a taxonomy of feedback types and uses (§2). We characterize feedback by its format and objective, relating to desired model behavior (§3). We explore direct feedback optimization strategies, such as reinforcement learning with human reward functions (§4) and indirect approaches utilizing trained feedback models as proxies (§5). We look at human-feedback data datasets and their collection, discussing their influence on models (§6). Lastly, we cover recent work leveraging AI feedback from LLMs for feedback reduction (§7).
2 A Taxonomy for Leveraging (Human) Feedback for Generation
2.1 Background
Consider a model that, given an input , outputs natural language. This model encompasses various NLG tasks including Summarization (: documents, : summaries), Machine Translation (: source language sentences, : target language sentences), Dialogue Generation (: dialogue histories, : responses), and Image Captioning (: images, : captions).
These models are generally realized as a parameterized, conditional probability distribution Pθ(y|x), where θ are the model parameters. This distribution is often estimated autoregressively: The probability of a sentence y given input x is decomposed into the product of the probabilities of each token in y, conditioned on the previous tokens. These models are trained by finding θ★ that maximizes the likelihood of some training data . At inference time, given input x, an output is decoded from . This decoding can be done, for example, by approximating the most-likely sequence of tokens () or by random sampling ().
Evaluating the quality of generated text can be challenging due to the complexity and subjectivity of natural language. Although numerous automated metrics have been proposed for different domains, they typically rely on n-gram matching or other rudimentary heuristics. These measures often overlook complex linguistic phenomena, such as paraphrasing or stylistic variations, ultimately failing to align with nuanced human judgments (Sai et al., 2022; Gehrmann et al., 2022). Therefore, for many of these tasks, human feedback is considered the gold standard for assessing the quality, and newer learned metrics often aim to approximate how humans provide feedback (see §5.1).
We note that this framing is a simplification of the real world: Often, different humans might provide different (potentially contradicting) feedback for the same outputs, and a single function may not be able to capture this variability (discussed further in §6). Finally, while our formalization is flexible, it excludes other approaches where models interact with humans to improve learning, such as active learning and other human-in-the-loop approaches.
2.2 Taxonomy
Having established a basic mathematical formulation, we now identify four key axes along which we can classify the uses of human feedback: format, objective, use, and modeling. Figure 1 shows this taxonomy in detail, along with example representative works and how they fit in it. In the next sections we will describe each axis in more detail.
3 Describing Feedback
3.1 Format
An important decision to make when we want to improve language generation systems through human feedback is what format to collect this feedback in. This choice has implications on the expressivity of the feedback, the ease of its collection, and how we can use it to improve systems, and the level of rationality of said feedback is heavily impacted by this choice (Ghosal et al., 2023). Feedback types are summarized in Table 1 with examples.
Input . | Output(s) . | Feedback . | Type . |
---|---|---|---|
A melhor comida do mundo é a portuguesa. | The worst food in the world are Portuguese. | 0.7 | Score |
‘worst’: major/accuracy | MQM | ||
‘are’: minor/fluency | |||
‘worst’ → ‘best’, ‘are’ → ‘is’ | Post-Edition | ||
Artificial intelligence has the potential to revolutionize industries (…) but ethical concerns need to be handled. | AI can change industries. | Fluency: 1 | Multi-Aspect |
Relevance: 0.7 | |||
“Misses the ethical concerns.” | Natural Language | ||
Explain the moon landing to a 6 year old | A: People went to the ... | A > B | Ranking |
B: The moon is a satellite... |
Input . | Output(s) . | Feedback . | Type . |
---|---|---|---|
A melhor comida do mundo é a portuguesa. | The worst food in the world are Portuguese. | 0.7 | Score |
‘worst’: major/accuracy | MQM | ||
‘are’: minor/fluency | |||
‘worst’ → ‘best’, ‘are’ → ‘is’ | Post-Edition | ||
Artificial intelligence has the potential to revolutionize industries (…) but ethical concerns need to be handled. | AI can change industries. | Fluency: 1 | Multi-Aspect |
Relevance: 0.7 | |||
“Misses the ethical concerns.” | Natural Language | ||
Explain the moon landing to a 6 year old | A: People went to the ... | A > B | Ranking |
B: The moon is a satellite... |
Numerical
Numerical feedback, which takes an input and output and returns a single score (), is one of the simplest feedback formats to collect and use. Kreutzer et al. (2018) studied using categorical feedback, in the form of 5 possible “stars” assigned to a translation, which are averaged to produce a score () to improve the model. Liu et al. (2018) and Shi et al. (2021) used even simpler feedback, by asking humans to choose if a given response is good or not (). Numerical feedback has also been widely used for evaluation, albeit not with the explicit goal of improving generation. For example, direct assessments (Graham et al., 2013) in machine translation ask humans to rate translations on a continuous scale. Some works have attempted to use this data to train feedback models (Sellam et al., 2020; Rei et al., 2020a) and improve generation (Freitag et al., 2022a; Fernandes et al., 2022).
Although easy to leverage, numerical feedback has limitations: Reducing feedback to a single score may be a hard and ill-defined task for humans, especially for complex tasks, leading to a costly collection process and problems of subjectivity and variance (see §6). Furthermore, it may not distinguish well between outputs of similar quality.
Ranking-based
Natural Language
Both numerical and ranking-based feedback cannot capture detailed information about problems with the output, which can be crucial for improving generation systems. Natural language feedback typically provides more detailed information, often by suggesting specific shortcomings or revisions for the current output. For example, Li et al. (2017) asked humans to give natural language feedback to a dialogue question answering model, including positive or negative feedback, but also possibly providing the correct answer to the model or a hint. Tandon et al. (2022) and Madaan et al. (2022) gather natural language feedback on errors in model-generated graphs and the model’s interpretation of a given instruction. Scheurer et al. (2022, 2023) improve summarization capabilities of language models by asking humans to provide natural language feedback of the model’s summaries. Li et al. (2022) collect natural language feedback (in addition to numerical feedback) for responses from a question answering system.
Others
Besides these feedback types, other (potentially domain-specific) types of feedback can be used to improve model behavior. Commonly humans are asked to provide multi-aspect feedback ( or more generally), scoring an output or ranking multiple outputs with respect to multiple dimensions (Böhm et al., 2019; Glaese et al., 2022; Madaan et al., 2023; Nguyen et al., 2022). Post-editions ask humans to provide corrections to the output in the form of small edits (e.g., replace X by Y), and post-edition data has been used to directly improve models (Denkowski et al., 2014) or train automatic post edition systems that correct model mistakes (Pal et al., 2016; Mehta and Goldwasser, 2019; Madaan et al., 2021; Talmor et al., 2020; Elgohary et al., 2021). There are other feedback types that haven’t been fully leveraged to improve generation: e.g., Multidimensional Quality Metrics (MQM) (Lommel et al., 2014), the standard for evaluating translation quality, asks professional translators to identify error spans in a translation, alongside severity and type of error.
3.2 Objective
The purpose of collecting feedback is to align the model’s behavior with some (often ill-defined) goal behavior: For example, we might want our summarization model to generate summaries that contain all core information, even if it means they are longer. This alignment objective has been studied extensively in the AI safety and alignment literature (Bostrom, 2014; Amodei et al., 2016; Bommasani et al., 2021; Kenton et al., 2021), with Leike et al. (2018) proposing the use of feedback models to tackle the difficulty in specifying objectives.
Bai et al. (2022a) explicitly divided the problem of “aligning” a language model into improving its helpfulness and increasing its harmlessness. Most works implicitly consider either the use of feedback that targets performance factors (such as when targeting overall performance in a task or ability to follow instructions) or harmlessness factors (such as not producing toxic text or providing information that could lead to harm).3
Helpfulness
Most often, feedback is collected with some helpfulness objective in mind: A necessary (but not sufficient) condition for a helpful system is that it performs well, so feedback related to task performance generally falls under this umbrella. For example, most works in machine translation leverage feedback related to translation quality (Kreutzer et al., 2018; Fernandes et al., 2022), which is expected to be correlated with its helpfulness in downstream applications. Similarly, in summarization, most works leverage feedback related to aspects such as relevance, consistency, and accuracy (Ziegler et al., 2019; Stiennon et al., 2020). One particularly well-studied feedback objective is the ability to follow instructions (Ouyang et al., 2022), which encompasses a wide range of tasks.
Harmlessness
Another important alignment objective is harmlessness: we want our models not to produce certain types of output or violate certain norms. Feedback collected in Ouyang et al. (2022) considered aspects such as the toxicity of text (besides the overall ability to follow instructions). Bai et al. (2022a) explored the interaction between the helpfulness and harmlessness objectives, showing a trade-off between both. Thoppilan et al. (2022) collected feedback on whether their model violates a set of safety objectives and used it to finetune the model. Glaese et al. (2022) also asked humans to provide feedback on the harmlessness of their system, by defining a set of rules and asking humans if the outputs violate these rules. Bai et al. (2022b) showed that feedback produced by LLMs could increase harmlessness without reducing helpfulness.
4 Directly Leveraging Human Feedback
In an ideal scenario, we would directly leverage human feedback to improve generation for both training and decoding.
4.1 Optimizing for Human Feedback
An instance of this approach can be found in Li et al. (2017), in which the authors train a dialogue model by maximizing the likelihood of the model’s answers labeled correct by humans. Similarly, Kreutzer et al. (2018) trained a machine translation model on a set of positively labeled translations, and Glaese et al. (2022) performed supervised learning on the dialogues which complied with their pre-defined rules (concerning correctness, harmfulness, and helpfulness), according to humans. A slightly different approach was proposed by Hancock et al. (2019): Deploying a chit-chat dialogue model and using the human utterances as targets to fine-tune the model. Scheurer et al. (2022, 2023) leverage the fact that LLMs can follow instructions and start by collecting natural language human feedback about the model generations, which often describes what an improved text would look like. Then, they ask the LM to generate multiple refinements based on the input, previous model generation, and the corresponding feedback. The highest similarity refinements for each generation are then used to fine-tune the LLM. OpenAI’s text-davinci-002 was trained with both human demonstrations and model outputs with the highest possible rating, an approach deemed FeedME (OpenAI, 2023b). A downside of these approaches is that they disregard generations which do not receive positive feedback, which may also contain useful information.
Other works train the model to predict the generations and the corresponding human feedback. Xu et al. (2022) proposed using the Director model introduced by Arora et al. (2022) to leverage human feedback. As this model has a unified decoder-classifier architecture, Xu et al. (2022) proposed using positively labeled examples to train its language modeling head (similarly to feedback-based imitation learning) and using both positive and negatively labeled examples to train a classifier head that directs the model away from generating undesirable sequences. Thoppilan et al. (2022) follow this approach to enforce the model’s quality and safety: Using collected dialogues between crowd-workers and the model LaMDA, annotated with the crowd-workers’ feedback on each response’s quality, LaMDA is fine-tuned to predict the high-quality responses alongside each response’s quality attributes and safety.
Liu et al. (2023) proposed prompt-based fine-tuning, where they create prompts containing previous generations rated by humans, in the order of preference and insert language-based feedback (e.g., “…is worse than ...”) to the prompt, between the generations. Then, the model is fine-tuned on the preferred answers. In Section 5.2.1, we discuss scnearios where the feedback f is sourced from a feedback model instead of humans.
4.2 Decoding with Human Feedback
Directly adjusting model parameters might not always be feasible, especially for LLMs. Moreover, during the training phase, consistent and meaningful feedback may not always be readily available. In such settings, leveraging human feedback during decoding becomes crucial. There are two primary approaches in this realm: 1. Feedback Memory: This involves maintaining past feedback and incorporating relevant aspects when processing new inputs, guiding the model toward preferential outputs.
To illustrate, imagine a scenario where the model produces an output that is either biased or factually incorrect. Upon receiving feedback highlighting this flaw, a model without feedback memory capabilities would still be prone to making the same error on similar inputs. In contrast, a model equipped with a robust feedback memory mechanism can actively reference this feedback. When faced with a comparable input or context in the future, it can thus reduce the likelihood of reproducing the same error. This feedback memory can be conceptualized as a repository or “bank” where past feedback is stored. Depending on the implementation, this could be in the form of plain text entries (Madaan et al., 2022) or dense vector representations (Tandon et al., 2022). When processing new inputs, the model first probes this memory bank to identify if a similar input or context exists. If a match or a close approximation is found, the model retrieves the corresponding feedback. This feedback can then be factored (e.g., by concatenating the feedback to the prompt) in to produce a refined output.
While the notion of learning from past experiences or feedback traces its roots to early cognitive theories and computational models (Riesbeck, 1981; Schank, 1983), its effectiveness in finetuning language models and few-shot learning settings has been shown in recent work (Weston et al., 2014; Wu et al., 2018; Tandon et al., 2022; Madaan et al., 2022).
2. Iterative Output Refinement: This method employs human feedback to refine the model’s output iteratively. Users can provide feedback on intermediate responses, enabling the model to adjust its output until it meets the user’s satisfaction. This process allows the model to better understand user preferences and produce more suitable outcomes (Reid and Neubig, 2022; Saunders et al., 2022; Schick et al., 2022; Nijkamp et al., 2022). Feedback can also be provided on model attributes such as the decoding strategy (Passali et al., 2021), rather than directly on its outputs.
5 Improving Generation using Human Feedback Models
Directly using human feedback to improve model behavior is not feasible in the general case: Asking humans to provide feedback for every model output is both expensive and time-consuming.
5.1 Learning Models of Human Feedback
An alternate approach to obtaining human feedback is to develop models that can predict or approximate it. Although they may not be perfect, they can provide feedback at a low cost after training, enabling feedback-dependent techniques at scale.
Feedback modeling has been studied extensively in the context of metric learning for NLP. In MT, Sellam et al. (2020) and Rei et al. (2020a) trained BLEURT and COMET, respectively, to regress on human translation quality assessments. For summarization, Zopf (2018) leveraged annotated pairwise preferences to train a preference model and Peyrard et al. (2017) learned a summary-level metric from a set of human judgements from older summarization datasets (e.g., TAC-2008). These metrics have been shown to correlate much better with human judgments than widely used lexical-metrics such as BLEU and ROUGE (Freitag et al., 2022b). Notably, these reward models were not trained with the intent of improving generation directly, though somewere used for that purpose later (§5.2).
Recently, there has been interest in developing feedback models directly for improving generation (Böhm et al., 2019; Ziegler et al., 2019). Initialized from either the target LM to improve or a smaller one from the same family, the feedback model finetuned on (collected) human feedback. This data is typically collected by asking annotators to provide feedback on outputs from an earlier version of the model being improved. It is also possible to first finetune the feedback model on naturally occurring implicit feedback, such as user interactions on sites (e.g., Reddit, StackOverflow), which greatly increases data size at the cost of noisier data.
Nguyen et al. (2022) train a preference model based on rankings on three human-designed objectives: whether the summary has an appropriate topic, length, and quality, combining these three into a single objective using a distance-based ranking loss. Interestingly, automatic post-editing (APE) systems in MT (e.g., Simard et al., 2007; Correia and Martins, 2019) can also be seen as feedback models (albeit non-numerical). Their aim is to automatically correct the output of an MT system and they are trained on human post-editions.
5.2 Leveraging Feedback Models to Improve Generation
After training a feedback model, we can use it almost exactly as we would use human feedback: either by leveraging this feedback model during training of the generation model, or by incorporating the feedback model during decoding.
5.2.1 Optimizing for Feedback Models
In terms of the impact of the underlying RL algorithm, PPO (Schulman et al., 2017) is by far the most used algorithm, and the one where tricks for its success are more widely known (see Zheng et al., 2023). However, it is unclear how important a role it plays in the success of RLHF, and some have proposed alternative RL algorithms that claim better performance (Donato et al., 2022).
Casper et al. (2023) identify several intrinsic limitations of RLHF, including human evaluation inconsistencies, the potential for feedback manipulation, trade-offs between feedback depth and efficiency, difficulties in capturing diverse human values in reward functions, and risks of reward hacking and policy deployment shortcomings.
The joint-feedback modeling with feedback models was explored by Korbak et al. (2023), who study pre-training an LLMs with a loss similar to Equation 6, based on feedback from a preference model trained on ranking-based feedback for toxicity. They showed that this leads to models producing less toxic generations, when compared to pretraining a model with vanilla MLE. Note that this is different from techniques discussed in 5.1, as the focus there was to train models with real human feedback, not their model.
5.2.2 Decoding with Feedback Models
In machine translation, Fernandes et al. (2022) and Freitag et al. (2022a) use feedback model training, involving a two-stage process of candidate generation and scoring with quality metrics learned from human judgments (Rei et al., 2020a, b). Top-rated candidates are selected using reranking or MBR decoding (Kumar and Byrne, 2002). Similarly, Li et al. (2022) improves a QA system by gathering numerical and natural language feedback, then refining a pretrained model on this feedback to rerank predictions. Works like Bhattacharyya et al. (2022) also demonstrate efficiency in enhancing machine translation outputs via automatic post-editing systems.
Feedback Model Overoptimization
One problem that arises when optimizing a system with a feedback model is that the model is an imperfect proxy for human feedback. Therefore, systems may overoptimize for good model scores, but not humans. This is known as the overoptimization problem, and is the main reason for the regularization term in Equation 11. Gao et al. (2022) study the overoptimization problem in preference models, by both optimizing against it with RL (training) and reranking outputs with it (decoding), finding that both lead to similar levels of overoptimization.
5.3 Comparing the Effectiveness of Approaches
While numerous approaches have been proposed for incorporating human feedback, it is difficult to directly compare their relative effectiveness, as dealing with feedback introduces many additional experimental variables (the quality of the original generation model/policy, the feedback collection process, etc). Nevertheless, a few studies compare a subset of these approaches within a consistent experimental setup, allowing for some high-level conclusions to be drawn:
Comparing approaches that leverage feedback to optimize the model, while RLHF seems to be the predominant technique for current SotA LLMs (Glaese et al., 2022; OpenAI, 2023a; Touvron et al., 2023), some works claim that simpler approaches (such as joint-feedback modeling directly with human preferences) can lead to better or comparable performance (Yuan et al., 2023; Rafailov et al., 2023).
It is also not clear if optimizing the model with the feedback (model) is necessary to obtain the best performance, and instead using the feedback model to rerank the outputs of the model leads to comparable results (Gao et al., 2022).
6 Collecting and Using Human Feedback
Collecting human feedback can be expensive and may present issues for the inexperienced. We highlight existing datasets, their collection methods, and considerations for those creating their own preference datasets. Annotator variability remains largely unexplored (Plank, 2022; Gehrmann et al., 2023), though evidence suggests well-constructed annotation guidelines are necessary (Ziegler et al., 2019; Parmar et al., 2023) to avoid systemic bias away from the intended task.
In general, there is not much discussion in the literature as to how choices made in the feedback collection process impact the final generalization ability of the model, as this has not been studied in a controlled setting. However, there is an appreciation for the importance of data quality in feedback collection, as researchers make efforts to filter out annotators based on their agreement with gold labels, as well as based on inter-annotator agreement (Stiennon et al., 2020; Bai et al., 2022a). However, Bai et al. (2022a) note that judging data quality is difficult for more open-ended forms of feedback such as dialogues. Despite this, they were able to achieve good results without detailed data quality controls. The impact of collection methods on final results may be a direction for future research, but for now, we present considerations for data collection along different axes below that experimenters should keep in mind, based on previous studies:
Annotator expertise: Depending on task and training (Snow et al., 2008; Sheng et al., 2008; Clark et al., 2021; Gillick and Liu, 2010; Freitag et al., 2021), annotators can be domain experts to crowdworkers or even models. Expert feedback is usually more reliable but considerably more expensive (due to recruitment difficulty) (Kulkarni et al., 2014). In many tasks like general-purpose translation or summarization, using crowdworkers and models can be sufficient and, if given the correct instruction, they can even help mimic expert opinions (Moore et al., 2020).6
Length of engagement: Involves one-time or long-term collaborations with annotators, with preference datasets often involving extended partnerships (Stiennon et al., 2020; Bai et al., 2022a; Freitag et al., 2021).
Type of feedback: Existing datasets generally use rankings or scores, but future work may investigate the value of more detailed feedback, which humans usually prefer to provide (Stumpf et al., 2007; Amershi et al., 2014; Ghai et al., 2021).
Collection method: Data can be gathered explicitly through experiments or implicitly from online sources/user interactions, with varying noise (Kreutzer et al., 2018; Freitag et al., 2021). When explicitly annotating, the choice of feedback type generally influences the type of collection: surveys generally are oriented toward more numerical or ranking-based feedback, while user studies or interviews are used to collect more detailed, open-ended feedback.
Collection platform: Platforms include Amazon Mechanical Turk, Upwork, Scale AI, and — when the collection is implicit — some discussion platforms where human interactions and preferences emerge organically. Alternately, researchers may collect their own feedback through online forms or interfaces, and recruit participants from their own institution.
Annotator demographics: Annotator identities can influence labeling; in some cases, demographics are collected during data collection. (Sap et al., 2022; Ding et al., 2022).
Note that some of these dimensions are shared more broadly across various tasks that involve humans in the loop, including human evaluation (Gehrmann et al., 2023; Liao and Varshney, 2021), interactive model debugging (Lertvittayakumjorn and Toni, 2021), data collection (Suhr et al., 2021), etc. For example the evaluation on text generation can sometimes be viewed similar to preference collection: hosted on crowdsourcing platforms, acquired from non-experts, collected in the form of ranking feedback (e.g., reads better better than the text from a baseline generator). In our enumeration above, we mostly focused on how these dimensions are implemented specifically in the context of feedback collection, and leave cross-comparison with other human-in-the-loop approaches to the reader (Wang et al., 2021). Table 2 shows some existing human feedback datasets.
Task . | Dataset & their descriptions . | Collection method . | Platform . | Feedback Type . |
---|---|---|---|---|
Language assistant | HH-RLHF (Bai et al., 2022a; Perez et al., 2022a) | Explicit | Upwork, MTurk | Ranking |
Language assistant | SHP (Ethayarajh et al., 2023) | Implicit | Scraped from Reddit | Ranking/Score |
Summarization | summarize-from-feedback (Stiennon et al., 2020) | Explicit | Upwork | Ranking |
Question Answering | FeedbackQA (Li et al., 2022) | Explicit | MTurk | Score, NL |
Question Answering | StackExchange (Lambert et al., 2023) | Implicit | StackOverflow | Ranking |
Translation | WMT Metrics Shared Task (Freitag et al., 2022b) | Explicit | Pro translation workflow | MQM, DA |
Task . | Dataset & their descriptions . | Collection method . | Platform . | Feedback Type . |
---|---|---|---|---|
Language assistant | HH-RLHF (Bai et al., 2022a; Perez et al., 2022a) | Explicit | Upwork, MTurk | Ranking |
Language assistant | SHP (Ethayarajh et al., 2023) | Implicit | Scraped from Reddit | Ranking/Score |
Summarization | summarize-from-feedback (Stiennon et al., 2020) | Explicit | Upwork | Ranking |
Question Answering | FeedbackQA (Li et al., 2022) | Explicit | MTurk | Score, NL |
Question Answering | StackExchange (Lambert et al., 2023) | Implicit | StackOverflow | Ranking |
Translation | WMT Metrics Shared Task (Freitag et al., 2022b) | Explicit | Pro translation workflow | MQM, DA |
Variance in Judgement
Considering K annotators with feedback functions , judgments are given on data . Inter-rater reliability metrics, such as Cohen’s Kappa, Fleiss’ Kappa, or Krippendorff’s alpha, can assess annotator agreement (Hayes and Krippendorff, 2007; Fleiss, 1971; Cohen, 1960). Low reliability may result from unclear tasks or evaluation criteria (Gehrmann et al., 2023; Thomson and Reiter, 2021), underqualified annotators, inherent subjectivity, or multiple plausible interpretations (Plank, 2022; Nie et al., 2020; Gordon et al., 2022). Mitigation strategies include viewing humans as making noisily rational choices (Ghosal et al., 2023), learning the reliability level of feedback from multiple humans (Yamagata et al., 2021), augmenting evaluation metrics like COMET with confidence intervals (Glushkova et al., 2021; Zerva et al., 2022), and requiring annotators to justify their rankings (Ziegler et al., 2019).
(In)experienced Annotators
There is generally a trade-off between the effort needed to create the datasets and the reliability of judgments collected. While some have claimed that a small number of crowdworkers can replace a domain expert in certain tasks such as affect recognition, recognizing textual entailment, or word-sense disambiguation (Snow et al., 2008; Sheng et al., 2008), this heavily depends on the task. Untrained crowdworkers may rely on more surface heuristics to evaluate text (Clark et al., 2021; Gillick and Liu, 2010), and one comparison of MT model evaluations performed by expert translators and crowdworkers found low agreement between the groups led to a completely different ranking of the models, with crowdworker evaluation being less reliable than automatic embedding-based metrics (Freitag et al., 2021). Generally, high-stakes applications or applications dependent on specific linguistic or specialized domain knowledge may need to rely on feedback from human experts, and extended partnerships with annotators can provide consistency of annotations. Crowdworkers or AI feedback may be acceptable substitutes in other situations; for general alignment with human preferences, it may instead be prudent to recruit a large and diverse group of annotators to avoid overfitting to the preferences of specific annotators or demographics. As the difficulty of tasks increases, it may become more difficult for non-experts to provide feedback, and evaluation of difficult tasks such as writing secure code may require designing feedback methods that incorporate human-AI teams, or rigorous criteria for evaluating feedback (Saunders et al., 2022; Perry et al., 2022; Bowman et al., 2022).
Subjectivity in Judgment
Some subjectivity in annotator judgment can arise from differences in cultural or social groups (Santurkar et al., 2023). Several works observe that tuning with human feedback increases the model’s alignment with US liberal views on controversial topics (Perez et al., 2022b; Hartmann et al., 2023). Annotators with different backgrounds may disagree on what qualifies as toxic content (Sap et al., 2022; Ding et al., 2022). This is pronounced when annotators are asked to make ethical judgments (Jiang et al., 2022; Talat et al., 2022). Some work has critiqued the idea of a unified human preference (Prabhakaran et al., 2021; Casper et al., 2023), suggesting that some variance in judgment is both expected and potentially useful signal.
Biases in Judgement
Presenting annotators with isolated text can lead to oversight of superior alternatives, mistakenly marking the text as high-quality (Bansal et al., 2021). When generating text, anchoring bias can influence writing style (Jakesch et al., 2023; Lehmann et al., 2022) and the given suggestions or corrections. Fundamentally, there may be a difference between what is correct and what humans want to hear, which may lead models to imitate persuasive behavior, which may influence humans to rate an output more highly if it “feels familiar” (Hasher et al., 1977; Griffin et al., 2023). Mitigation strategies entail ranking diverse outputs and defining explicit evaluation criteria.
Ethical Considerations
Prolonged content moderation tasks can be harmful (Steiger et al., 2021). Tasks involving toxicity classification and generation from open-ended inputs may particularly affect annotators (Shmueli et al., 2021). Media attention has focused on fair pay for annotators, with one TIME article7 describing annotators paid $2/hr or less to provide annotations of toxic content for RLHF datasets. Research on crowdsourcing (Shmueli et al., 2021; Rothschild et al., 2022; Soratana et al., 2022; Toxtli et al., 2021; Hornuf and Vrankar, 2022) cautions that inadequate pay, especially in lower-resourced regions, is a form of worker exploitation.
7 AI Feedback
Feedback models have been crucial in advancing generation techniques by effectively leveraging feedback. However, they are heavily reliant on human input: For example, Gao et al. (2022) found that across various preference model sizes, utilizing fewer than 1,000 comparisons resulted in only chance improvements. Moreover, employing static feedback can make consistency challenging, causing changes in the model’s output distribution. AI-generated feedback, an emerging research area, focuses on harnessing the LLM’s own abilities to enhance the model without human intervention. Two primary approaches have emerged in this domain:
Self AI Feedback
The first approach involves using the same model to provide feedback and improve its output. In this scenario, the model engages in a continuous self-improvement process, learning from its evaluations and refining its capabilities accordingly. Examples of this approach include prompting models to generate harmful responses and revising them for harmlessness (Bai et al., 2022b), or employing rule-based reward models for RLHF fine-tuning (OpenAI, 2023a). Techniques such as iterative output revision through few-shot prompting (Peng et al., 2023; Shinn et al., 2023; Chen et al., 2023; Paul et al., 2023; Madaan et al., 2023; Yang et al., 2022) have been explored using LLMs like GPT-3.5 (Ouyang et al., 2022) and GPT-4 (OpenAI, 2023a). Notably, these techniques demonstrate potential when applied to LLMs trained to adhere to human instructions and align outputs with human preferences. This suggests that incorporating human feedback during training equips AI models to comprehend task requirements better, align outputs with directives, and function as dependable feedback mechanisms, thereby minimizing human intervention. The capacity to offer valuable AI feedback may depend on the model being trained with human feedback.
External AI Feedback
This approach utilizes a separate feedback model to critique the task model’s outputs (Yasunaga and Liang, 2020; Madaan et al., 2021; Welleck et al., 2022; Bai et al., 2022b; Akyürek et al., 2023). A key advantage is that the feedback model need not be large or general-purpose, making smaller feedback models an appealing option when abundant feedback is available.
8 Conclusion
Recent developments in large language models have emphasized the need for human feedback to ensure models have desirable behaviour and generate helpful and harmless text. We provide an overview of a recent line of research on leveraging (human) feedback to improve natural language generation. Despite the relative infancy of this field, several important observations emerge when considering existing works:
Current models often underutilize more expressive forms of feedback like natural language, favouring ranking-based or numerical feedback.
A trade-off exists between effort spent creating datasets and the reliability of judgments. Enlisting expert and diverse annotators can be beneficial for high-stakes applications.
The value of leveraging feedback lies primarily in the feedback itself rather than the specific method. While Reinforcement Learning from Human Feedback (RLHF) has been popular, other methods report notable improvements and might be simpler to apply (Gao et al., 2022; Rafailov et al., 2023; Zhou et al., 2023; Zhao et al., 2023). However, large-scale comparative analysis remains necessary.
Acknowledgments
This work was supported by EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), by the projects DECOLLAGE (ERC-2022-CoG 101088763); Project Center for Responsible AI, grant 62, under PRR funded by NextGenerationEU (2022-C05i0102-02), and by a grant from the Singapore Defence Science and Technology Agency.
Notes
This is sometimes called Goodhart’s law: “when a measure becomes a target, it ceases to be a good measure” (Goodhart, 1984).
Although feedback can be provided independently of the input (for example for fluency), we assume some (potentially empty) input for simplicity of notation.
We mostly ignore the proposed honesty aspect, as none of these works tackle this directly.
We specify the feedback model with respect to the human feedback format, i.e., reward and preference model for numerical and ranking-based human feedback, respectively.
Note that this KL term is different from other algorithm-specific regularization terms, such as the KL terms in PPO (Schulman et al., 2017).
In some cases, when data is collected from user interaction or mined from existing data sources, it may not be possible to control for expertise of annotators.
References
Author notes
Action Editor: Minlie Huang