Abstract
While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.
1 Introduction
Recent years have seen striking empirical successes of large language models (LLMs), as they consistently obtain impressive results across a diverse range of NLP benchmarks (Guo et al., 2023; Suzgun et al., 2023; Qin et al., 2023), while also showcasing surprising abilities of language understanding (Wei et al., 2022a; Begus et al., 2023), generation (Pu and Demberg, 2023; Lin and Chen, 2023; Lyu et al., 2023a), and reasoning (Wei et al., 2022b; Kojima et al., 2022; Dasgupta et al., 2022). However, these models are not without their flaws. LLMs are observed to intermittently display undesired and inconsistent behaviors such as producing seemingly convincing but inaccurate “hallucinations” (Lin et al., 2022; Zhang et al., 2023c; Min et al., 2023), conducting unfaithful reasoning (Golovneva et al., 2023; Lyu et al., 2023b; Wu et al., 2023b), generating inappropriate or harmful content (Gehman et al., 2020; Levy et al., 2021, 2022; Shaikh et al., 2023), and failing to trustfully follow rules and constraints (Zhuo et al., 2023; Wang et al., 2023a). Such flawed behaviors hamper the trust in LLMs and pose hurdles to their real-world applications (OpenAI, 2023).
A prevailing strategy to rectify these undesired behaviors of LLMs is learning from feedback, mirroring a typical human learning strategy where individuals actively refine their behaviors through a cycle of trial, error, and correction. Humans, when making mistakes, often gather feedback either from others or through self-reflection (Boyd and Fales, 1983; Metcalfe, 2017; Ferretti et al., 2019; London et al., 2023; Bellhäuser et al., 2023). Such feedback offers valuable insights for humans to correct mistakes and modify their behavior accordingly. Inspired by this natural learning mechanism, extensive research (Huang et al., 2022; Madaan et al., 2023; Gero et al., 2023; Jiang et al., 2023) has been undertaken to improve LLMs through the paradigm of learning from both internal and external feedback.
One popular line of research involves the use of human feedback to evaluate and refine models, as encapsulated in the survey by Fernandes et al. (2023). These methods typically involve direct optimization of LLMs against human feedback on their outputs (Kreutzer et al., 2018; Glaese et al., 2022; Ouyang et al., 2022; Scheurer et al., 2023), where human evaluations of output quality serve as a reward signal to improve model performance. However, this approach has two primary drawbacks: It can be costly due to the manual labor involved, and it lacks real-time capabilities as humans cannot provide instant feedback.
To minimize the need for human intervention, another strategy is correcting LLMs with automated feedback. As illustrated by the conceptual framework in Figure 1, the language model (iteratively) learns from automatically generated feedback signals to understand the consequences of its actions and adapts its behaviors. The source of automated feedback can be multifaceted, spanning from the LLM itself acting as the feedback model (Madaan et al., 2023; Schick et al., 2023), a separately trained feedback model (Yang et al., 2022b; Paul et al., 2023), readily available external tools (Gou et al., 2023; Chen et al., 2023e), to external knowledge sources such as Wikipedia or the internet (Yu et al., 2023; Li et al., 2023b). Various strategies of correction have been proposed, including self-training (Huang et al., 2022; Bai et al., 2022b), generate-then-rank (He et al., 2023; Weng et al., 2023), feedback-guided decoding (Yang et al., 2022a; Xie et al., 2023), iterative post-hoc revision (Zhang et al., 2023a; Jiang et al., 2023), etc. Recently, the incorporation of such strategies has demonstrated their effectiveness across a myriad of tasks, from question answering (Peng et al., 2023) and reasoning (Pan et al., 2023) to code generation (Zhang et al., 2023b) and toxicity detection (Lu et al., 2022).
In light of these advancements, our paper aims to provide a comprehensive survey. We start by establishing the concept of correcting LLMs with automated feedback and creating a taxonomy of the different methods (§ 2). We then discuss the major techniques (§ 3), categorized as training-time, generation-time, and post-hoc correction. Finally, we discuss the connection to earlier works (§ 4) and five potential future directions (§ 5).
2 Conceptual Framework
For clean exposition, we first present a conceptual framework outlining the overall process of correcting LLMs with feedback in Figure 1, using an analogy of medical treatment in our daily life. Three parties are involved in this process:
Language Model(Patient). A language model performs a specific task by mapping an input to an output text . This formulation encompasses a wide range of NLP tasks, for example, in summarization, x is a passage, is the generated summary; for question-answering, x is a question and is the predicted answer. The initial generation may have problems such as hallucination and incorrect reasoning.
Critic Model(Doctor & Diagnosis). A critic model learns to generate feedback where is the output or partial output of the language model, and c is the feedback of some format, e.g., scalar value, or natural language. A simple example is binary feedback of whether the output is good or bad given the input ().
Refine Model(Treatment). A refine model learns to repair an output based on the feedback c, where ynew is the revised output. Some refine models directly repair the language model ℳ through fine-tuning or reinforcement learning.
Based on the above formulation, the specific model design in existing works varies along five crucial axes, elaborated in the following sections.
2.1 What Gets Corrected?
We summarize the three major error types of LLMs that are targeted for correction in existing works through automated feedback.
Hallucination. An open challenge for LLMs is that they often hallucinate by making up facts or citing sources that do not exist (Li et al., 2023a; Zhang et al., 2023c). This hallucinated content is often quite plausible-sounding, making it difficult even for humans to detect (Clark et al., 2021). To address this, several studies have proposed the collection of automated feedback on potential factual inaccuracies by cross-referencing the generated output with credible knowledge sources. The gathered feedback can then be utilized by a subsequent refinement model to correct hallucinations (Gao et al., 2023b; Peng et al., 2023).
Unfaithful Reasoning. A number of recent studies (Ribeiro et al., 2023; Lyu et al., 2023b; Golovneva et al., 2023) found that LLMs occasionally make unfaithful reasoning, i.e., the derived conclusion does not follow the previously generated reasoning chain. To address this, existing works have used automated feedback from external tools or models for guiding the reasoning process (Xie et al., 2023; Yao et al., 2023a), verifying the reasoning process and rectifying errors (He et al., 2023; Pan et al., 2023), or fine-tuning LLMs with process-based feedback (Huang et al., 2022; Lightman et al., 2023).
Toxic, Biased, and Harmful Content. LLMs have been observed to occasionally generate content that is toxic, biased, or harmful due to biases present in the training data (Shaikh et al., 2023). To rectify this, reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022; Bai et al., 2022a) has been extensively employed to train LLMs to align more closely with human values, such as being helpful, honest, and harmless. However, RLHF is heavily dependent on high-quality human feedback, the collection of which can be resource-intensive. To alleviate this, recent work (Lu et al., 2022; Gou et al., 2023) has also explored collecting automated feedback to identify and correct potentially harmful outputs.
2.2 What Is the Source of the Feedback?
Feedback can be broadly divided into human feedback and automated feedback. Fernandes et al. (2023) provided a survey on integrating human feedback for language generation. In our survey, we focus on the emerging research area of automated feedback, which typically originates from two sources: self-feedback (i.e., the feedback originates from the LLM itself) and external feedback (i.e., the feedback is derived from external models, tools, or knowledge sources).
Self-Feedback. The LLM can act as its own feedback provider by iteratively assessing and refining its generated outputs until it meets a certain standard (Madaan et al., 2023; Shinn et al., 2023). This continuous self-improvement strategy has proven effective in multiple studies, especially when external feedback is unavailable or limited (Ye et al., 2023; Yan et al., 2023).
External Feedback for LLMs comes from other models (Yang et al., 2022b; Lightman et al., 2023), tools (Gou et al., 2023; Charalambous et al., 2023), knowledge sources (Gao et al., 2023b; Yu et al., 2023), and evaluation metrics (Jung et al., 2022; Welleck et al., 2023). External feedback provides a valuable outside perspective for identifying errors that the LLM cannot recognize on its own. For example, code interpreters are widely used in programming tasks to provide real-time error messages; while external knowledge sources are used to verify the factual accuracy.
2.3 What Is the Format of the Feedback?
The selection of the feedback format requires considering its expressivity, ease of collection, and its potential to improve systems (Fernandes et al., 2023). Automated feedback is commonly either a scalar value or in natural language.
Scalar Value Feedback. In this scenario, the critic model maps the input and output to a single score (). Scalar value feedback can be easily integrated into the training/decoding process of LLMs. For example, Xie et al. (2023) use real-value feedback for each intermediate reasoning step to guide the model in performing a stochastic beam search for the optimal solution. Despite its flexibility, scalar feedback is less descriptive for detailed corrections.
Natural Language Feedback provides richer information that can highlight specific errors and provide nuanced suggestions for improvement. This is important for certain applications such as text editing and code generation. For example, Self-Debug (Chen et al., 2023e) uses LLMs to generate explanations for the produced code and utilize both the explanation and the execution results as feedback to enhance coding solutions.
2.4 When to Correct the Model?
Depending on the timing of using automated feedback to correct the model, existing work can be divided into three major categories.
Training-time Correction. The ideal scenario is to rectify a flawed model during training, prior to its deployment for use. Once feedback has been collected, it is directly used to optimize the model parameters. Human feedback is typically used for training-time correction, as exemplified by the widely adopted RLHF approach (Ouyang et al., 2022). For leveraging automated feedback, a common strategy is self-training (Huang et al., 2022), where the model is trained with its own generated high-quality output filtered out by the critic model. However, the practical application of training-time correction may be hindered by the infeasibility of fine-tuning giant closed-source LLMs, such as GPT-4 (OpenAI, 2023) and the potential unavailability of feedback during model training.
Generation-time Correction. It utilizes automated feedback to guide the LLM to correct errors during generation. For example, for proof generation, several studies utilize the automated feedback of the intermediate reasoning steps to guide the model to recover from incorrect generation and search for the optimal solution in a more efficient way (Yang et al., 2022a; Lightman et al., 2023).
Post-hoc Correction. It refines the model output after it has been generated, without updating the model parameters. This typically involves an iterative process of generating output, receiving feedback, and refining output. Post-hoc correction is more flexible as it does not require training the LLM or accessing its parameters. Furthermore, it facilitates the incorporation of more informative natural language feedback, offering a more transparent and explainable self-correction process.
2.5 How to Correct the Model with Feedback?
Various concrete strategies have been proposed to correct LLMs with automated feedback, which are tailored to the different dimensions we mentioned in previous sections. For example, self-training is often used for training-time correction. Generate-then-rank often comes with scalar value feedback. We will cover the comprehensive landscape of self-correction strategies in Section 3.
2.6 Summary of Existing Work
Building upon the taxonomy established in the preceding sections, we collate existing work in Table 1 and Table 2. We have three major selection criteria for a work to be included in this survey:
Method . | Feedback . | Model Refinement . | Application . | ||
---|---|---|---|---|---|
Source . | Format . | Strategy . | Learning . | ||
Training-Time Correction | |||||
RLHF (Ouyang et al., 2022) | Reward Model | Scalar | RLHF | RL | Multiple Tasks |
Fine-Grained RLHF (Wu et al., 2023a) | Reward Model | Scalar | RLHF | RL | Detoxification, Long-form QA |
HH-RLHF (Bai et al., 2022a) | Reward Model | Scalar | RLHF | SL & RL | Helpfulness, Harmlessness |
Moral RLHF (Ganguli et al., 2023) | Reward Model | Scalar | RLHF | RL | Moral Correction |
Sparrow (Glaese et al., 2022) | Reward Model | NL | RLHF | SL & RL | Dialogue |
ILF (Scheurer et al., 2023) | Human Feedback | NL | Fine-tuning | SL | Summarization |
ILF-Code (Chen et al., 2023a) | Human Feedback | NL | Fine-tuning | SL | Code Generation |
SLT (Yuan et al., 2023) | Human Feedback | NL | Fine-tuning | SL | Response Generation |
Chain-of-Hindsight (Liu et al., 2023a) | Human Feedback | NL | Fine-tuning | SL | Multiple Tasks |
Crystal (Liu et al., 2023b) | Language Model | Scalar | Fine-Tuning | SL & RL | Commonsense Reasoning |
STaR (Zelikman et al., 2022) | Language Model | NL | Self-Training | SL | QA, Reasoning |
RLAIF (Bai et al., 2022b) | Language Model | NL | Self-Training | SL & RL | Dialogue |
SIRLC (Pang et al., 2023) | Language Model | NL | Self-Training | RL | Reasoning, Translation, Summary |
Self-Improve (Huang et al., 2022) | Language Model | NL | Self-Training | SL | QA, Reasoning, NLI |
AlpacaFarm (Dubois et al., 2023) | Language Model | NL | Self-Training | SL & RL | None (Intrinsic Evaluation) |
ReST (Gulcehre et al., 2023) | Language Model | NL | Self-Training | RL | Machine Translation |
Generation-Time Correction | |||||
Self-Verification (Weng et al., 2023) | Language Model | Scalar | Re-Ranking | ICL | Arithmetic Reasoning |
CodeT (Chen et al., 2023b) | Program Executor | Scalar | Re-Ranking | ICL | Code Generation |
LEVER (Ni et al., 2023) | Program Executor | Scalar | Re-Ranking | SL | Table QA, Math QA, Program |
RR (He et al., 2023) | External Knowledge | Scalar | Re-Ranking | — | Reasoning |
InstructScore (Xu et al., 2023) | Language Model | NL | Re-Ranking | SL | Generation Evaluation |
MBR Decoding (Freitag et al., 2022) | External Metrics | Scalar | Re-Ranking | SL | Machine Translation |
DIVERSE (Li et al., 2023d) | Trained Model | Scalar | Re-Ranking | SL | Arithmetic Reasoning |
PRM (Lightman et al., 2023) | Reward Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
DiffusionLM (Li et al., 2022) | Trained Model | Scalar | Feedback-guided | SL | Controlled Text Generation |
Fudge (Yang and Klein, 2021) | Trained Model | Scalar | Feedback-guided | SL | Controlled Text Generation |
Entailer (Tafjord et al., 2022) | Trained Model | Scalar | Feedback-guided | SL | Proof Generation |
NLProofS (Yang et al., 2022a) | Trained Model | Scalar | Feedback-guided | SL | Proof Generation |
GRACE (Khalifa et al., 2023) | Trained Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
CoRe (Zhu et al., 2023) | Trained Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
Varshney et al. (2023) | External Knowledge | NL | Feedback-guided | ICL | Hallucination Detection |
MemPrompt (Madaan et al., 2022) | External Knowledge | NL | Feedback-guided | ICL | Lexical and Ethical Reasoning |
Maieutic Prompting (Jung et al., 2022) | External Metrics | Scalar | Feedback-guided | ICL | Commonsense Reasoning |
SI (Creswell and Shanahan, 2022) | Language Model | Scalar | Feedback-guided | ICL | Proof Generation |
RAP (Hao et al., 2023) | Language Model | Scalar | Feedback-guided | ICL | Planning, Reasoning |
SelfEval-Decoding (Xie et al., 2023) | Language Model | Scalar | Feedback-guided | ICL | Arithmetic / Symbolic Reasoning |
SelfCheck (Miao et al., 2023) | Language Model | NL | Feedback-guided | ICL | Arithmetic Reasoning |
Tree of Thoughts (Yao et al., 2023a) | Language Model | NL / Scalar | Feedback-guided | ICL | Games, Writing |
Method . | Feedback . | Model Refinement . | Application . | ||
---|---|---|---|---|---|
Source . | Format . | Strategy . | Learning . | ||
Training-Time Correction | |||||
RLHF (Ouyang et al., 2022) | Reward Model | Scalar | RLHF | RL | Multiple Tasks |
Fine-Grained RLHF (Wu et al., 2023a) | Reward Model | Scalar | RLHF | RL | Detoxification, Long-form QA |
HH-RLHF (Bai et al., 2022a) | Reward Model | Scalar | RLHF | SL & RL | Helpfulness, Harmlessness |
Moral RLHF (Ganguli et al., 2023) | Reward Model | Scalar | RLHF | RL | Moral Correction |
Sparrow (Glaese et al., 2022) | Reward Model | NL | RLHF | SL & RL | Dialogue |
ILF (Scheurer et al., 2023) | Human Feedback | NL | Fine-tuning | SL | Summarization |
ILF-Code (Chen et al., 2023a) | Human Feedback | NL | Fine-tuning | SL | Code Generation |
SLT (Yuan et al., 2023) | Human Feedback | NL | Fine-tuning | SL | Response Generation |
Chain-of-Hindsight (Liu et al., 2023a) | Human Feedback | NL | Fine-tuning | SL | Multiple Tasks |
Crystal (Liu et al., 2023b) | Language Model | Scalar | Fine-Tuning | SL & RL | Commonsense Reasoning |
STaR (Zelikman et al., 2022) | Language Model | NL | Self-Training | SL | QA, Reasoning |
RLAIF (Bai et al., 2022b) | Language Model | NL | Self-Training | SL & RL | Dialogue |
SIRLC (Pang et al., 2023) | Language Model | NL | Self-Training | RL | Reasoning, Translation, Summary |
Self-Improve (Huang et al., 2022) | Language Model | NL | Self-Training | SL | QA, Reasoning, NLI |
AlpacaFarm (Dubois et al., 2023) | Language Model | NL | Self-Training | SL & RL | None (Intrinsic Evaluation) |
ReST (Gulcehre et al., 2023) | Language Model | NL | Self-Training | RL | Machine Translation |
Generation-Time Correction | |||||
Self-Verification (Weng et al., 2023) | Language Model | Scalar | Re-Ranking | ICL | Arithmetic Reasoning |
CodeT (Chen et al., 2023b) | Program Executor | Scalar | Re-Ranking | ICL | Code Generation |
LEVER (Ni et al., 2023) | Program Executor | Scalar | Re-Ranking | SL | Table QA, Math QA, Program |
RR (He et al., 2023) | External Knowledge | Scalar | Re-Ranking | — | Reasoning |
InstructScore (Xu et al., 2023) | Language Model | NL | Re-Ranking | SL | Generation Evaluation |
MBR Decoding (Freitag et al., 2022) | External Metrics | Scalar | Re-Ranking | SL | Machine Translation |
DIVERSE (Li et al., 2023d) | Trained Model | Scalar | Re-Ranking | SL | Arithmetic Reasoning |
PRM (Lightman et al., 2023) | Reward Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
DiffusionLM (Li et al., 2022) | Trained Model | Scalar | Feedback-guided | SL | Controlled Text Generation |
Fudge (Yang and Klein, 2021) | Trained Model | Scalar | Feedback-guided | SL | Controlled Text Generation |
Entailer (Tafjord et al., 2022) | Trained Model | Scalar | Feedback-guided | SL | Proof Generation |
NLProofS (Yang et al., 2022a) | Trained Model | Scalar | Feedback-guided | SL | Proof Generation |
GRACE (Khalifa et al., 2023) | Trained Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
CoRe (Zhu et al., 2023) | Trained Model | Scalar | Feedback-guided | SL | Arithmetic Reasoning |
Varshney et al. (2023) | External Knowledge | NL | Feedback-guided | ICL | Hallucination Detection |
MemPrompt (Madaan et al., 2022) | External Knowledge | NL | Feedback-guided | ICL | Lexical and Ethical Reasoning |
Maieutic Prompting (Jung et al., 2022) | External Metrics | Scalar | Feedback-guided | ICL | Commonsense Reasoning |
SI (Creswell and Shanahan, 2022) | Language Model | Scalar | Feedback-guided | ICL | Proof Generation |
RAP (Hao et al., 2023) | Language Model | Scalar | Feedback-guided | ICL | Planning, Reasoning |
SelfEval-Decoding (Xie et al., 2023) | Language Model | Scalar | Feedback-guided | ICL | Arithmetic / Symbolic Reasoning |
SelfCheck (Miao et al., 2023) | Language Model | NL | Feedback-guided | ICL | Arithmetic Reasoning |
Tree of Thoughts (Yao et al., 2023a) | Language Model | NL / Scalar | Feedback-guided | ICL | Games, Writing |
Method . | Feedback . | Model Refinement . | Application . | |||
---|---|---|---|---|---|---|
Source . | Format . | Strategy . | Learning . | Iter. . | ||
Post-hoc Correction | ||||||
Self-Refine (Madaan et al., 2023) | Language Model | NL | Self-Refine | ICL | ✓ | Multiple Tasks |
Clinical SV (Gero et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Information Extraction |
Reflexion (Shinn et al., 2023) | Language Model | NL | Self-Refine | RL | ✓ | QA, Code Generation |
IterRefinement (Chen et al., 2023d) | Language Model | NL | Self-Refine | ICL | ✓ | Machine Translation |
Auto-Post-Editing (Raunak et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Machine Translation |
RCI (Kim et al., 2023) | Language Model | NL | Self-Refine | ICL | ✓ | Computer Tasks |
SelFee (Ye et al., 2023) | Language Model | NL | Self-Refine | SL | ✓ | Dialogue |
SelfCheckGPT (Manakul et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Hallucination Detection |
LLM Self Defense (Helbling et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Harmful Text Correction |
Re3 (Yang et al., 2022b) | Trained Model | Scalar | External Feedback | SL & ICL | ✓ | Story Generation |
CodeRL (Le et al., 2022) | Trained Model | Scalar | External Feedback | RL | ✗ | Code Generation |
FLIRT (Mehrabi et al., 2023) | Trained Model | Scalar | External Feedback | ICL | ✓ | Adversarial Prompt Generation |
REFINER (Paul et al., 2023) | Trained Model | NL | External Feedback | SL & ICL | ✓ | Reasoning, Moral Story |
RL4F (Akyürek et al., 2023) | Trained Model | NL | External Feedback | SL & RL | ✓ | Planning, Summarization |
Yan et al. (2023) | Trained Model | NL | External Feedback | SL | ✓ | Semantic Parsing |
Baldur (First et al., 2023) | Trained Model | NL | External Feedback | ICL | ✓ | Proof Generation |
CRITIC (Gou et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Program, Toxicity |
FacTool (Chern et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Reasoning, Generation |
MAF (Nathani et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Reasoning |
RARR (Gao et al., 2023b) | External Knowledge | NL | External Feedback | ICL | ✗ | Open-Domain QA |
LLM-Augmenter (Peng et al., 2023) | External Knowledge | NL | External Feedback | RL | ✓ | Open-Domain QA |
Self-Checker (Li et al., 2023b) | External Knowledge | NL | External Feedback | ICL | ✗ | Fact-Checking |
REFEED (Yu et al., 2023) | External Knowledge | NL | External Feedback | ICL | ✗ | QA, Dialogue |
Olausson et al. (2023) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Edit (Zhang et al., 2023a) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Debug (Chen et al., 2023e) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Evolve (Jiang et al., 2023) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Logic-LM (Pan et al., 2023) | Symbolic Solver | NL | External Feedback | ICL | ✓ | Logical Reasoning |
Self-Critique (Saunders et al., 2022) | LLMs + Human | NL | External Feedback | SL | ✗ | Summarization |
ALGO (Zhang et al., 2023b) | Oracle Verifier | Scalar | External Feedback | ICL | ✓ | Code Generation |
Charalambous et al. (2023) | BMC Tool | NL | External Feedback | ICL | ✗ | Software Verification |
Self-Correction (Welleck et al., 2023) | External Metrics | NL / Scalar | External Feedback | SL | ✓ | Reasoning, Generation, Toxicity |
Multiagent Debate (Du et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Reasoning, Factuality |
LM vs LM (Cohen et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Factual Error Detection |
ICL-AIF (Fu et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Bargaining Game |
PRD (Li et al., 2023c) | Language Model | NL | Model Debate | ICL | ✓ | Open-ended QA |
MADRA (Wang et al., 2023b) | Language Model | NL | Model Debate | ICL | ✓ | QA, Fact-Checking |
ReConcile (Chen et al., 2023c) | Language Model | NL | Model Debate | ICL | ✓ | Reasoning |
Method . | Feedback . | Model Refinement . | Application . | |||
---|---|---|---|---|---|---|
Source . | Format . | Strategy . | Learning . | Iter. . | ||
Post-hoc Correction | ||||||
Self-Refine (Madaan et al., 2023) | Language Model | NL | Self-Refine | ICL | ✓ | Multiple Tasks |
Clinical SV (Gero et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Information Extraction |
Reflexion (Shinn et al., 2023) | Language Model | NL | Self-Refine | RL | ✓ | QA, Code Generation |
IterRefinement (Chen et al., 2023d) | Language Model | NL | Self-Refine | ICL | ✓ | Machine Translation |
Auto-Post-Editing (Raunak et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Machine Translation |
RCI (Kim et al., 2023) | Language Model | NL | Self-Refine | ICL | ✓ | Computer Tasks |
SelFee (Ye et al., 2023) | Language Model | NL | Self-Refine | SL | ✓ | Dialogue |
SelfCheckGPT (Manakul et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Hallucination Detection |
LLM Self Defense (Helbling et al., 2023) | Language Model | NL | Self-Refine | ICL | ✗ | Harmful Text Correction |
Re3 (Yang et al., 2022b) | Trained Model | Scalar | External Feedback | SL & ICL | ✓ | Story Generation |
CodeRL (Le et al., 2022) | Trained Model | Scalar | External Feedback | RL | ✗ | Code Generation |
FLIRT (Mehrabi et al., 2023) | Trained Model | Scalar | External Feedback | ICL | ✓ | Adversarial Prompt Generation |
REFINER (Paul et al., 2023) | Trained Model | NL | External Feedback | SL & ICL | ✓ | Reasoning, Moral Story |
RL4F (Akyürek et al., 2023) | Trained Model | NL | External Feedback | SL & RL | ✓ | Planning, Summarization |
Yan et al. (2023) | Trained Model | NL | External Feedback | SL | ✓ | Semantic Parsing |
Baldur (First et al., 2023) | Trained Model | NL | External Feedback | ICL | ✓ | Proof Generation |
CRITIC (Gou et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Program, Toxicity |
FacTool (Chern et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Reasoning, Generation |
MAF (Nathani et al., 2023) | External Tools | NL | External Feedback | ICL | ✓ | QA, Reasoning |
RARR (Gao et al., 2023b) | External Knowledge | NL | External Feedback | ICL | ✗ | Open-Domain QA |
LLM-Augmenter (Peng et al., 2023) | External Knowledge | NL | External Feedback | RL | ✓ | Open-Domain QA |
Self-Checker (Li et al., 2023b) | External Knowledge | NL | External Feedback | ICL | ✗ | Fact-Checking |
REFEED (Yu et al., 2023) | External Knowledge | NL | External Feedback | ICL | ✗ | QA, Dialogue |
Olausson et al. (2023) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Edit (Zhang et al., 2023a) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Debug (Chen et al., 2023e) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Self-Evolve (Jiang et al., 2023) | Program Executor | NL | External Feedback | ICL | ✓ | Code Generation |
Logic-LM (Pan et al., 2023) | Symbolic Solver | NL | External Feedback | ICL | ✓ | Logical Reasoning |
Self-Critique (Saunders et al., 2022) | LLMs + Human | NL | External Feedback | SL | ✗ | Summarization |
ALGO (Zhang et al., 2023b) | Oracle Verifier | Scalar | External Feedback | ICL | ✓ | Code Generation |
Charalambous et al. (2023) | BMC Tool | NL | External Feedback | ICL | ✗ | Software Verification |
Self-Correction (Welleck et al., 2023) | External Metrics | NL / Scalar | External Feedback | SL | ✓ | Reasoning, Generation, Toxicity |
Multiagent Debate (Du et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Reasoning, Factuality |
LM vs LM (Cohen et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Factual Error Detection |
ICL-AIF (Fu et al., 2023) | Language Model | NL | Model Debate | ICL | ✓ | Bargaining Game |
PRD (Li et al., 2023c) | Language Model | NL | Model Debate | ICL | ✓ | Open-ended QA |
MADRA (Wang et al., 2023b) | Language Model | NL | Model Debate | ICL | ✓ | QA, Fact-Checking |
ReConcile (Chen et al., 2023c) | Language Model | NL | Model Debate | ICL | ✓ | Reasoning |
1. Automated Feedback: Explicit feedback is involved to assess the quality of the model output. We focus on automated feedback that originates from external models, metrics, knowledge, etc. However, we will cover some representative works of human feedback for completeness.
2. Model Refinement: The feedback should act as a directive to enhance the LLM, either by: 1) updating model parameters, or 2) altering the model’s output during or post the generation.
3. Large Language Model: We primarily focus on automated correction strategies in the era of modern large language models. Given this focus, we mainly emphasize very recent work from 2022 and 2023. However, it is important to acknowledge that the concept of automated correction is not new and has roots in early NLP research. To provide a complete historical perspective, we provide a succinct overview of these initial approaches to automated correction in Section 4.1.
These studies are categorized based on the three strategies introduced in Section 2.4. We also summarize key features of each study, including: 1) the source of feedback, 2) the format of feedback, 3) the strategy and learning method employed for the refinement, 4) whether the refinement process is iterative, and 5) the application of the method.
3 Methodologies
In this section, we delve into a detailed review of various correction methodologies. Depending on the time that the correction happens, we categorize them as Training-Time Correction, Generation-Time Correction, and Post-hoc Correction.
3.1 Training-Time Correction
Training-time correction rectifies model behavior during the training phase. We identify three typical strategies shown in Figure 2. Each strategy utilizes different forms of feedback to optimize the model during training: human feedback (a), a reward model (b), and automated feedback (c).
Direct Optimization with Human Feedback.
In an ideal scenario, we would directly leverage human feedback to optimize the model parameters, following the framework in Figure 2(a): 1) Candidate outputs are generated by LLMs, 2) Humans provide feedback or refinements on these outputs, and 3) LLMs are then directly optimized on the collected (outputs, feedback) to better align with human preferences. A simple strategy is to fine-tune the model on the outputs that receive positive feedback from human raters (Glaese et al., 2022; Scheurer et al., 2023; Chen et al., 2023a). However, only utilizing positive-rated data may constrain the model’s ability to identify and correct negative attributes or errors. To address this, Chain-of-Hindsight (Liu et al., 2023a) fine-tunes the LLM on model outputs paired with both positive and negative feedback. Beyond fine-tuning, other optimization methods are explored as well. For example, Gao et al. (2023a) utilize human feedback as the reward signal and optimize the model with contextual bandit learning.
Reward Modeling and RLHF.
Direct optimization with human feedback may not always be practical, since collecting human feedback can be both labor-intensive and time-consuming. An efficient alternative is to train a reward model that emulates human feedback. Once trained, this reward model can provide consistent, real-time feedback for every model output, thereby circumventing the need for constant human involvement. A prominent example of this approach is RLHF (Ouyang et al., 2022), as illustrated in Figure 2(b). It first asks human annotators to label the preference for different LLM outputs and then train the reward model to predict the human preference. Afterward, reinforcement learning (RL) algorithms (e.g., Proximal Policy Optimization [Schulman et al., 2017]) are employed to optimize the model. RLHF and its variants have proven effective in correcting LLMs to become more beneficial and less harmful (Bai et al., 2022a), as well as instilling moral correctness (Ganguli et al., 2023).
Self-Training with Automated Feedback.
Reward modeling still requires the collection of human feedback. To build a fully autonomous self-improving agent, recent work has adopted the self-training strategy that self-improves LLM by bootstrapping its original outputs, as depicted in Figure 2(c). The language model itself is used to provide feedback for its own output. STaR (Zelikman et al., 2022) leverages the idea of chain-of-thought to prompt LLM to generate answers with rationales. They found that the performance of LLM can be improved by iteratively selecting rationales leading to the correct answer to further finetune LLM. Self-training has also been used to reduce the harmful responses of LLMs. For example, in RLAIF (Bai et al., 2022b), the initial toxic responses are criticiqued and revised by the LLM itself following a set of human-defined principles. Afterward, the LLM is fine-tuned on the revised responses. AlpacaFarm (Dubois et al., 2023) further shows that LLMs can self-improve with RL. It designs LLM prompts to simulate human feedback in RLHF and shows that the feedback is effective and greatly reduces the cost.
3.2 Generation-Time Correction
Correcting LLMs at training time is ideal but not always feasible because it can be resource-intensive or even impractical for many LLMs, e.g., closed-source LLMs where weights are inaccessible, and colossal LLMs with billions of parameters. This necessitates generation-time correction methods that correct LLMs during the generation time. Two main strategies are Generate-then-Rank and Feedback-Guided Decoding.
Generate-then-Rank.
This involves sampling a large number of candidate generations and subsequently picking up the best generation based on the feedback provided by the critic model, as illustrated in Figure 3(a). This approach is often integrated with chain-of-thought prompting (Wei et al., 2022b) to tackle complex reasoning tasks, such as solving math word problems. Given an input problem x, the LLM initially generates multiple candidate solutions y1, ⋯, yn. Each solution yi = [zi, ai] comprises a reasoning path (explanation) zi leading to the predicted answer ai. Subsequently, the critic model assigns a plausibility score si to each candidate reasoning path zi. The best solution is selected from the scored set via either ranking or voting.
Various critic models have been used for LLM output verification. DIVERSE (Li et al., 2023d) trains a binary verifier based on DeBERTa (He et al., 2021) to rate each reasoning path. Weng et al. (2023) introduced a training-free critic model based on the consistency between forward and backward reasoning. In a different vein, RR (He et al., 2023) used a critic model to assess reasoning path faithfulness by retrieving supporting information from a knowledge base. In code generation, LEVER (Ni et al., 2023) uses a verifier trained on program execution results. CodeT (Chen et al., 2023b) similarly employs dual execution agreement to select the best code solution.
Feedback-Guided Decoding.
Despite its efficiency, the generate-then-rank strategy has several limitations: 1) The critic model provides only coarse-grained, output-level feedback, 2) The long length of the output can complicate its quality assessment, and 3) It requires the LLM to wait until the entire output is generated for any corrections.
The feedback-guided decoding strategy shown in Figure 3(b) overcomes the above limitations by using step-level feedback for fine-grained control during generation. Each output y is split into multiple reasoning steps y = [o1, o2, ⋯, on]. A critic model evaluates each step ot, guiding algorithms like beam search to explore the output space systematically and correct early mistakes. This strategy also helps alleviate the reasoning inconsistency problem (Zelikman et al., 2022; Creswell and Shanahan, 2022), i.e., incorrect reasoning leads to correct final answer. This strategy has been adopted in recent works like Tree-of-Thought (Yao et al., 2023a), GRACE (Khalifa et al., 2023), and RAP (Hao et al., 2023), which vary mainly in the critic model they employ, categorized into methods involving human feedback, trained verifiers, external metrics, external knowledge, and self-evaluation.
Reward Model from Human Feedback: Studies like Uesato et al. (2022) and Lightman et al. (2023) collect human-annotated step-level feedback to train a more robust reward model, which improves reasoning faithfulness.
Trained Verifier: To reduce the cost of human annotations, some work (Yang et al., 2022a; Tafjord et al., 2022; Li et al., 2023d; Khalifa et al., 2023) uses automated methods to generate training data for obtaining a step-wise verifier. Positive examples are derived from ground-truth reasoning paths, while negative examples are synthesized by proposing an alignment algorithm (Khalifa et al., 2023) or by making text perturbations on positive samples (Yang et al., 2022a).
External Metric: Several studies also leverage external metrics to re-rank or guide text generation without additional model training, such as using minimum Bayes risk decoding (Freitag et al., 2022), attribute classifiers (Dathathri et al., 2020; Yang and Klein, 2021), and Gaussian denoising (Li et al., 2022).
External Knowledge: External knowledge sources have also been used to provide feedback. Varshney et al. (2023) use Wikipedia to validate and correct each generated sentence, which is then reinserted for further generation. Alternatively, MemPrompt (Madaan et al., 2022) utilizes a pool of prior user feedback to guide the text generation based on the current query’s intent.
Self-Evaluation: For better flexibility, methods such as Tree-of-Thought (Yao et al., 2023a) and Guided-decoding (Xie et al., 2023) use the LLM itself as the critic model by prompting it to evaluate each individual reasoning step, avoiding the need for fine-tuning task-specific verifier.
Different strategies are adopted to control the decoding process with the help of the step-level critic model. Tree-of-Thought uses breadth-first and depth-first searches, while GRACE (Khalifa et al., 2023) and Xie et al. (2023) employ beam search. CoRe (Zhu et al., 2023) and RAP (Hao et al., 2023) use Monte Carlo Tree Search for a balance between exploration and exploitation.
3.3 Post-hoc Correction
The effectiveness of generation-time correction hinges on the critic model’s ability to give precise feedback for intermediate outputs, a challenging task in holistic NLP evaluations like summarization. This motivates the post-hoc correction methods, where both critic and refinement models act only after generating the complete output. Post-hoc correction allows for more diverse natural language feedback, ranging from specific diagnostic reports to broader writing suggestions. As shown in Figure 4, we categorize the key post-hoc correction strategies into Self-Correction, Correction with External Feedback, and Multi-Agent Debate.
Self-Correction.
In “Self-Correction”, a single LLM both generates and refines its output. As shown in Figure 4(a), the LLM first produces an output and then acts as its critic for iterative refinements. This process continues until the output obtains an acceptable quality or a pre-specified number of iterations is reached. Self-Refine (Madaan et al., 2023) introduced an effective framework using one LLM guided by varied prompts for the roles of generation, critic, and refinement, respectively. Clinical Self-Verification (Gero et al., 2023) applies this to extract clinical data, refining by spotting missing elements and verifying data accuracy. Reflexion (Shinn et al., 2023) extends the method, adding a “long-term memory” to recall past errors and integrating diverse feedback forms.
Though beneficial in many text-generation tasks, self-correction usually demands powerful, large-scale LLMs for effectiveness, which sacrifices efficiency. As observed by Madaan et al. (2023), smaller models often falter in refining, even with correct feedback. A possible solution involves explicitly training models for this self-correction process. SelFee (Ye et al., 2023) proposes training a model to emulate the self-correction process by generating output, feedback, and a refined solution in an auto-regressive manner. They use more powerful LLMs to provide feedback and refinement data, with data collection facilitated through ChatGPT.
Models/Tools as Feedback.
In self-correction, the quality of the feedback is constrained by the inherent limitations of LLMs, such as the inability to access up-to-date information, take actions, or perform precise mathematical reasoning. To enhance feedback quality, recent research leverages external tools, as shown in Figure 4(b). These tools, including trained models, code interpreters, and search engines, offer specialized feedback to address LLM constraints.
Code Interpreter. In code generation, models like Self-Edit (Zhang et al., 2023a) and Self-Evolve employ program executors to provide feedback from executed test cases. Others, like Self-Debug (Chen et al., 2023e) and ALGO (Zhang et al., 2023b), explore detailed feedback mechanisms using unit tests, program explanations, or comparison with reference oracle programs. Charalambous et al. (2023) use Bounded Model Checking for software verification.
Logic Reasoner. Logic-LM (Pan et al., 2023) and Baldur (First et al., 2023) harness external logic reasoners and proof assistants to refine LLM outputs, using error messages as feedback for logical reasoning and theorem-proof generation.
External Knowledge is used to ensure factual accuracy of the output. Models like RARR (Gao et al., 2023b), REFEED (Yu et al., 2023), and LLM-Augmenter (Peng et al., 2023) prompt LLMs to question their outputs. An external retriever then searches for relevant evidence, which is used to refine outputs. FACTOOL (Chern et al., 2023) extends this approach to a wider range of tasks, including code generation, mathematical reasoning, and scientific literature review.
Trained Model. Research has fine-tuned specialized critic models to provide feedback for iterative refinement alongside more powerful language models. For example, CodeRL (Le et al., 2022) treats program synthesis as a reinforcement learning task and trains a critic model whose output optimizes the main model. REFINER (Paul et al., 2023) uses a critique model to provide feedback on an intermediate representation, suitable for refining larger models like ChatGPT. Similarly, RL4F (Akyürek et al., 2023) trains a critic via reinforcement learning, fine-tuning it with policy optimization. The effectiveness is gauged by comparing the refined output’s accuracy to ground truth. In adversarial contexts, feedback from content filters can guide the generation of better adversarial examples, like how FLIRT (Mehrabi et al., 2023) leverages image classifier signals to guide LLMs in creating adversarial prompts for audit purposes.
Integrating Multiple Tools. Broadening the idea of tool-assisted feedback, CRITIC (Gou et al., 2023) unifies various tools, such as code interpreters, search engines, and LLM feedback, offering a multifaceted feedback approach.
3.4 Multi-Agent Debate
Besides integrating tools, recent research has also explored the debate approach among multiple LLMs, inspired by the idea that multiple perspectives can converge to an improved solution. Multiple LLM instances debate their individual answers over several rounds, aiming for a consensus.
Du et al. (2023) trialed this in arithmetic reasoning. Agents, or LLM duplicates, present individual solutions and justifications. In the debate phase, these responses are aggregated and used as context for each agent to revise its original answer. After several iterations, they typically reach a consensus, showing superior performance compared to self-correction. PRD (Li et al., 2023c) furthered this by introducing the peer rank algorithm to optimize the consensus process. It considers pairwise preferences between all possible answer pairs from individual LLMs and uses these preferences to generate a final ranking of models.
In addition to reasoning tasks, LM vs LM (Cohen et al., 2023) employed this debate approach for factual error detection, where a generating LLM makes a claim and an examining LLM checks for inaccuracies. Extending its applicability, Fu et al. (2023) mimicked real-world human interactions, like a buyer-seller scenario, showcasing the versatility of multi-agent debates.
4 Discussion
4.1 Prior Research on Automated Correction
In our survey, we primarily examine the automated correction strategies in the era of modern large language models. However, the idea of “correcting the model with automated feedback” has been a longstanding practice in diverse NLP tasks. Recognizing these early works provides a deeper historical insight into the evolution of self-correction methods within NLP. Next, we briefly discuss the NLP applications where automated correction has been effectively implemented, and we discuss how these early works link to the automated correction strategies defined in this survey.
Machine Translation.
The concept of post-hoc self-correction has deep roots in the field of machine translation (MT), where it is often called Automatic Post-Editing (APE) (do Carmo et al., 2021). A long line of prior work trains models to fix translation errors by either learning from human correction data (Alabau et al., 2014) or from synthetic training data (Lee et al., 2021). To minimize the cost of data collection, recent work (Chen et al., 2023d; Raunak et al., 2023) has leveraged the in-context learning ability of LLMs for post-editing translations. As well as post-hoc methods, training-time correction (Unanue et al., 2021) and decoding-time correction (Freitag et al., 2022) are also adopted by prior works.
Summarization.
The idea of automated model correction has been commonly used in summarization to ensure the factuality of the generated summary. There are two mainstream methods: 1) training-time correction that imposes factuality constraints during training (Liu and Liu, 2021; Wan and Bansal, 2022; Scheurer et al., 2023), and 2) post-hoc correction that post-edits generated summaries to correct factual errors (Falke et al., 2019; Cao et al., 2020; Saunders et al., 2022). Recent work has investigated using RL to refine the model guided by automated feedback from either reward models (Akyürek et al., 2023) or language models (Pang et al., 2023).
Semantic Parsing.
The use of external feedback in semantic parsing, particularly for Text-to-SQL tasks, has shown significant effectiveness. Execution-guided semantic parsing is a notable approach where the feedback from executing partial SQL queries guides the search for plausible complete SQL programs. Additionally, earlier works also explored training separate discriminative models either to rerank the generated SQL queries (Bogin et al., 2019; Kelkar et al., 2020) or to predict specific SQL components (Xu et al., 2017; Yu et al., 2018; Lee, 2019). The effectiveness of these generation-time correction techniques is largely attributable to the ease of defining intermediate feedback in semantic parsing.
Proof Generation.
Automated correction has been well studied and implemented for proof generation (Saha et al., 2020; Tafjord et al., 2021). External feedback from natural language inference (NLI) are commonly used to spot errors as a heuristic for correction, and as a means to score the quality (Yang et al., 2022a; Golovneva et al., 2023). However, there are some open questions regarding the quality of NLI-based feedback (Srikanth and Rudinger, 2022; Saxon et al., 2023).
Open-Ended Generation.
Post-hoc correction is often adopted to improve the quality of open-ended text generation (Wang et al., 2017; Holtzman et al., 2018; Sagarkar et al., 2018), such as correcting toxic outputs, enhancing the narrative quality in story generation, and refining response generation in dialogues. For example, Holtzman et al. (2018) proposed a framework to refine the generic, repetitive, and inconsistent texts by composing a committee of discriminators to provide multi-aspect feedback. Given the subjectivity involved in assessing the outputs, recent works started to use detailed, natural language feedback and utilize LLMs for iterative post-hoc refinement.
4.2 When Does Automated Correction Work?
Despite the relative infancy of this emerging field, recent studies have explored the efficacy of automated correction in LLMs. Notably, intrinsic self-correction—where the model corrects its initial output based solely on its inherent capabilities—has generally shown disappointing results (Huang et al., 2023; Stechly et al., 2023; Hong et al., 2023; Tyen et al., 2023; Valmeekam et al., 2023; Ke et al., 2023). Most findings indicate that LLMs struggle to rectify their initial mistakes, and their performance even worsens after self-correction. This issue arises because the quality of the model’s self-generated feedback is bounded by its existing knowledge and abilities. Therefore, internal feedback may not offer any extra advantage for improving the results; it might even steer the model away from the correct answer. Preventing such mis-guidance is crucial for successful self-correction (Huang et al., 2023).
In contrast, the use of external feedback for automated correction has shown more promise. Numerous studies (Pan et al., 2023; Chen et al., 2023a; Gou et al., 2023; Huang et al., 2023) report positive outcomes when LLMs leverage high-quality feedback from external sources. However, high-quality external feedback is unavailable in many real-world applications. This constraint narrows down the scope of automated correction to only those tasks where precise and readily obtainable external feedback exists, such as arithmetic reasoning, semantic parsing, and code generation.
The empirical study by Huang et al. (2023) highlighted multi-agent debate as an effective method for automated correction in LLMs. However, the observed improvement primarily stems from the model-driven voting process among different LLMs, rather than from self-correction. This approach represents another successful instance of learning through external feedback, as each LLM benefits from the input provided by other LLMs in the debate.
5 Research Gaps and Future Directions
5.1 Theoretical Justifications
First of all, whether LLMs can self-correct without any external feedback is still an ongoing debate, with both positive and negative outcomes reported. Numerous studies have discovered that self-correction often brings negative effects (Huang et al., 2023; Tyen et al., 2023), while some research indicates that the effectiveness of self-repair is only seen in GPT-4 (Olausson et al., 2023). Although these empirical studies provide valuable insights, more fundamental theoretical research is needed to gain a mechanistic understanding of self-correction. Key research questions include: Can LLMs truly recognize their own errors without external feedback? What is the upper bound of intrinsic self-correction? Answers to those questions might closely associated with LLMs’ capacity to exhibit metacognitive awareness, i.e., their understanding of their own knowledge and uncertainties (Kadavath et al., 2022). The concept of calibration—how well a model’s predictions match observed outcomes—is also crucial (Lin et al., 2023).
While language models demonstrate some capacity for self-feedback, achieving superior performance often necessitates incorporating external feedback. This ties into the alignment of language models, an area still not fully understood. For example, in RLHF, the choice of the metric to minimize between the reward model output and the final model output significantly impacts downstream task performance (Go et al., 2023), yet this aspect remains underexplored in many applications. Determining the best approach to auto-generate instructive prompts for tasks like output evaluation is also an open challenge.
5.2 Benchmarking Automated Correction
While LLM automated correction has seen empirical advancements across applications, there is a lack of solid quantitative metrics to evaluate this capability. Comprehensive evaluations comparing various strategies on criteria like effectiveness, complexity, and potential limits are still needed. Future studies could develop evaluation frameworks considering variables such as task complexity, degree of initial error, improvement in quality after automated correction, etc.
Setting benchmarks to diagnose automated correction is another potential research avenue. Diagnostic datasets would offer standardized evaluations of LLMs and their correction strategies, fostering the development of more precise models.
5.3 Continual Self-Improvement
Another promising yet under-explored area of LLM self-correction is the idea of continual, life-long self-improvement. As LLMs are integrated into varied and evolving scenarios, their capacity for sustained adaptability becomes crucial. This mirrors the notion of continual (life-long) learning (Wang et al., 2023c), suggesting LLMs should consistently assess outputs, rectify mistakes, update knowledge, and adjust decision-making.
While recent studies like Huang et al. (2022) and Zelikman et al. (2022) indicate that LLMs can enhance themselves through self-training on positively evaluated outputs, they often focus on a single, one-time correction process. The resilience of this self-training in continuous settings is not well-understood. Continual learning poses challenges like catastrophic forgetting (Kirkpatrick et al., 2016), where new skills impair old ones. It’s uncertain if such issues could plague continually self-improving LLMs, e.g., correcting one behavior may unintentionally alter a previously corrected behavior. Combining various self-correction techniques for continual improvement also warrants exploration. Integrating immediate post-hoc corrections with long-cycle training-time corrections—using the former to gather data and the latter to periodically address recurrent problems—could be a promising approach.
5.4 Self-Correction with Model Editing
Recent advancements in model editing (Sinitsin et al., 2020; Cao et al., 2021; Yao et al., 2023b) aim to adjust the model’s behavior for examples within the editing scope while leaving its performance for out-of-scope examples unaltered. It has been applied to update LLMs’ outdated knowledge (Lee et al., 2022; Onoe et al., 2023) and address false associations (Murty et al., 2022; Tanno et al., 2022). Though effective in adjusting LLMs’ factual knowledge, challenges like limited generalization (Yao et al., 2023b) and unintended side effects persist (Hoelscher-Obermaier et al., 2023).
We believe model editing offers great potential for LLM self-correction. It enables accurate, fine-grained corrections without full-scale retraining. Analyzing the impact of these model edits could yield insights into self-correction. Techniques mitigating model editing’s side effects (Hoelscher-Obermaier et al., 2023) may also enhance self-correction. We anticipate future research to increasingly merge model editing with LLM self-correction, a relatively untouched domain.
5.5 Multi-modal Self-Correction
Self-correction strategies have been well-tested on the textual modality, where both the model outputs and the feedback are in textual form. The recent surge in multi-modal data usage, including image, audio, and video modalities, presents enticing opportunities for expansion. These include the exploration of self-correction capabilities within multi-modal LLMs, the incorporation of visual feedback, and improving vision-language tasks through self-correction.
6 Conclusion
In this paper, we present a comprehensive survey of self-correcting large language models with automated feedback. We categorize and analyze various self-correction strategies, including training-time, generation-time, and post-hoc corrections. We also connect recent work with prior research and discuss the applicable scenarios for automated correction. Finally, we outline five potential future directions and associated challenges in this field. Our goal with this paper is to provide a comprehensive and useful resource for readers interested in the development of this rapidly evolving domain. To aid in this effort, we create a continually updated reading list in a GitHub repository as follows: https://github.comhttps://github.com/teacherpeterpan/self-correction-llm-papers.
Acknowledgments
This work was supported by the National Science Foundation (award #2048122). The views expressed are those of the authors and do not reflect the official policy or position of the US government. Thanks to Xinyuan Lu for assisting with the Github reading list repo.
References
Author notes
Action Editor: Ivan Titov