Abstract
Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent studies also report negative results. In this work, we critically survey broad papers and discuss the conditions required for successful self-correction. We first find that prior studies often do not define their research questions in detail and involve impractical frameworks or unfair evaluations that over-evaluate self-correction. To tackle these issues, we categorize research questions in self-correction research and provide a checklist for designing appropriate experiments. Our critical survey based on the newly categorized research questions shows that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.
1 Introduction
Self-correction is a popular approach to improve responses from large language models (LLMs) by refining them using LLMs during inference (Bai et al., 2022; Madaan et al., 2023). Extensive studies on self-correction have been conducted in various tasks, including arithmetic reasoning, code generation, and question answering (Gao et al., 2023; Shinn et al., 2023). The simplest approach of self-correction prompts LLMs to provide feedback on their own responses and refine the responses using the feedback (Huang et al., 2024a), under the hypothesis that recognizing errors is easier than avoiding them (Saunders et al., 2022). As in Figure 1, self-correction has also been studied using additional information for improving feedback, including external tools such as code interpreters (Chen et al., 2024d; Gou et al., 2024), external knowledge retrieved via web search (Gao et al., 2023; Jiang et al., 2023b), or fine-tuning (Welleck et al., 2023; Ye et al., 2023). However, recent studies also report negative results indicating that LLMs cannot self-correct (Huang et al., 2024a; Gou et al., 2024; Li et al., 2024b) or even self-detect (Chen and Shu, 2024; Tyen et al., 2024; Hong et al., 2024; Jiang et al., 2024; Kamoi et al., 2024) their own mistakes at least in certain conditions. These conflicting observations indicate that further analysis of self-correction is needed.
In this work, we provide a critical survey to investigate the conditions required for successful self-correction. First, our analysis finds that prior studies often do not define their research questions in detail. As a result, many papers fail to provide appropriate experiments to evaluate the research questions they implicitly target. To address this issue, we categorize research questions in self-correction research (§3.1) and discuss frameworks that should be used for verifying each research question (§3.2). Finally, we provide a checklist for designing appropriate experiments (§8).
Next, we analyze prior work to identify when LLMs can self-correct their mistakes, using the new definitions of the research questions. Our analysis highlights that the bottleneck is in the feedback generation (§7). Specifically, (1) no prior work shows successful self-correction with feedback from prompted LLMs in general tasks (§4), (2) self-correction works well in tasks where reliable external feedback is available (§5.1), (3) large-scale fine-tuning enables self-correction (§5.2), and (4) some tasks have properties exceptionally suitable for self-correction (§4). In summary, our analysis identifies the properties required for successful self-correction as follows:
[RQ1] When can LLMs self-correct based solely on the inherent capabilities of LLMs?
In general tasks, no prior work shows reliable evidence of successful self-correction with in-context learning. (§4)
In tasks with specific properties that are exceptionally favorable for self-correction (e.g., responses are decomposable), self-correction is effective even with in-context learning. (§4)
[RQ2] When can LLMs self-correct the best-possible initial responses with external information?
[RQ3] When are the final outputs of self-correction better than other approaches?
Self-correction is often not compared with sufficiently strong baselines, and it is still unclear whether it is better than other approaches. (§6)
This survey is organized as follows. Section 2 provides an overview of self-correction. Section 3 introduces a new approach to classify research questions and frameworks in self-correction research. Sections 4 and 5 analyze prior work in self-correction with in-context learning and external information (external tools, external knowledge, fine-tuning), respectively. Section 6 explains related approaches that should be compared with self-correction as baselines. Section 7 summarizes our findings from the analysis. Section 8 provides a checklist for self-correction research. Section 9 explains differences from other surveys. Section 10 provides studies related to self-correction. Section 11 provides future directions.
2 Self-Correction of LLMs
The term “self-correction” is used in a wide range of scenarios, from a strict definition in which LLMs refine their own responses by themselves (Madaan et al., 2023; Huang et al., 2024a) to broader concepts that also involve feedback from external tools or knowledge (Shinn et al., 2023; Gou et al., 2024). In this work, we define self-correction as a framework that refines responses from LLMs using LLMs during inference, possibly with external tools or knowledge. As in Table 1, Figure 2, and Figure 3, self-correction has been studied in various frameworks with different sources of feedback.
2.1 Frameworks
Prior studies propose self-correction frameworks with various different architectures.
Explicit Feedback vs. Direct Refinement.
Self-correction often consists of three stages including feedback generation (Kim et al., 2023; Madaan et al., 2023; Shinn et al., 2023; Huang et al., 2024a):
Initial Response Generation is a stage of generating initial responses from an LLM.
Feedback model generates feedback given the original input and initial response. This stage may use external tools or knowledge.
Refinement model generates a refined response, given the input, initial response, and feedback.
Post-hoc vs. Generation-time.
Post-hoc correction refines responses after they are generated (Pan et al., 2024). Generation-time correction or step-level correction (Paul et al., 2024; Jiang et al., 2023b) improves step-by-step reasoning by providing feedback on intermediate reasoning steps. Post-hoc correction is more flexible and applicable to broader tasks, although generation-time correction is popular for reasoning tasks (Pan et al., 2024).
Same-model vs. Cross-model.
Cross-model correction generates feedback or refines the responses using models different from the model that generates initial responses. Cross-model correction has been mostly studied in the settings of correcting mistakes of large proprietary LLMs using small fine-tuned models (Welleck et al., 2023; Akyurek et al., 2023; Paul et al., 2024) or multi-agent debate of multiple models with similar capabilities (Liang et al., 2023; Li et al., 2023; Cohen et al., 2023; Du et al., 2023; Zhang et al., 2023a; Chen et al., 2024b; Chan et al., 2024; Wang et al., 2024a).
2.2 Sources of Feedback
Intrinsic (§4).
Intrinsic self-correction prompts LLMs to generate feedback on their own responses. Prompting strategies include simple zero-shot or few-shot prompts (Madaan et al., 2023; Kim et al., 2023), decomposing the responses (Dhuliawala et al., 2023), and evaluating confidence (Varshney et al., 2023; Jiang et al., 2023b; Wu et al., 2024).
External Information (§5.1).
Self-correction often relies on external information, including external tools such as code executors (Jiang et al., 2023a; Gou et al., 2024; Chen et al., 2024d; Stengel-Eskin et al., 2024), symbolic reasoners (Pan et al., 2023), proof assistant (First et al., 2023), or task-specific metrics (Xu et al., 2023), external knowledge from search engines (Jiang et al., 2023b; Gao et al., 2023; Zhao et al., 2023), Wikipedia (Yu et al., 2023; Zhao et al., 2023), or other corpora (Peng et al., 2023; Zhao et al., 2023), oracle information such as ground-truth answers (Kim et al., 2023; Shinn et al., 2023), human feedback (Chen et al., 2024a), or stronger models (Zhang et al., 2024).
Fine-tuning (§5.2).
2.3 Tasks
Self-correction has been studied in various tasks, including Reasoning: arithmetic reasoning (Madaan et al., 2023; Nathani et al., 2023; Gou et al., 2024), code generation (Jiang et al., 2023a; Charalambous et al., 2023; Gou et al., 2024; Chen et al., 2024d; Olausson et al., 2024), proof generation (First et al., 2023), logical reasoning (Pan et al., 2023); Knowledge: closed-book QA (Shinn et al., 2023; Gao et al., 2023; Jiang et al., 2023b; Gou et al., 2024); Context-based Generation: dialogue generation (Madaan et al., 2023; Peng et al., 2023), text summarization (Saunders et al., 2022); Open-ended Generation: conditional text generation (Ye et al., 2023; Schick et al., 2023), story generation (Yang et al., 2022b), detoxification (Schick et al., 2021; Bai et al., 2022; Gou et al., 2024; Phute et al., 2024); Others: machine translation (Chen et al., 2023b; Raunak et al., 2023; Ki and Carpuat, 2024), information retrieval (Gero et al., 2023), vision language tasks (Yin et al., 2023; Ge et al., 2023; Zhou et al., 2024; Lee et al., 2024; Huang et al., 2024b; Liu et al., 2024), and prompt optimization (Pryzant et al., 2023; Mehrabi et al., 2023; Yang et al., 2024).
2.4 Differences from Related Approaches
In this work, we define self-consistency (Wang et al., 2023) or generate-and-rank (Shen et al., 2021; Weng et al., 2023) to be different from self-correction because these approaches do not refine responses and assume that LLMs generate correct answers with a reasonable probability. We discuss these methods in Section 6 as strong baselines that should be compared with self-correction.
3 Research Questions
We find that prior studies often do not define their research questions in detail and fail to use appropriate self-correction frameworks in their experiments. We propose a new approach to classify research questions and frameworks in self-correction.
3.1 RQs in Self-Correction Research
Prior studies often simply state their research questions as whether LLMs can self-correct their mistakes (e.g., Kim et al., 2023; Madaan et al., 2023). However, we claim that research questions in self-correction research should be defined in more detail. We identify the following research questions implicitly targeted in prior studies, as in Table 2.
RQ . | Self-Refine . | Huang et al. . | RCI . | RCI . | CRITIC . | CRITIC . | RARR . |
---|---|---|---|---|---|---|---|
(2023) . | (2024a) . | (2023, §3.1) . | (2023, §3.2) . | (2024, §4.2) . | (2024, §4.3) . | (2023) . | |
RQ1 | ✓ | ✗ (§3,5) | ✓ | – | ✗ | ✗ | – |
RQ2 | – | – | – | ✓ | ✓ | ✓ | – |
RQ3 | – | ✗ (§4) | – | ✓ | – | ✓ | ✓ |
RQ . | Self-Refine . | Huang et al. . | RCI . | RCI . | CRITIC . | CRITIC . | RARR . |
---|---|---|---|---|---|---|---|
(2023) . | (2024a) . | (2023, §3.1) . | (2023, §3.2) . | (2024, §4.2) . | (2024, §4.3) . | (2023) . | |
RQ1 | ✓ | ✗ (§3,5) | ✓ | – | ✗ | ✗ | – |
RQ2 | – | – | – | ✓ | ✓ | ✓ | – |
RQ3 | – | ✗ (§4) | – | ✓ | – | ✓ | ✓ |
We define the best-possible initial responses as initial responses generated with best effort, usinginformation that self-correction modules can access, such as external tools, knowledge, or fine-tuning.
Requirements for Verifying RQs.
Experiments for verifying these research questions need to satisfy different requirements, as shown in Table 3. External Information: RQ1 needs to be evaluated on frameworks that refine responses using the same model without additional information. RQ2 and RQ3 can be evaluated on frameworks that use external information. Initial Responses: RQ1 and RQ2 need to be evaluated on frameworks that use the best-possible initial responses. RQ3 is about the final performance, so it is not necessary to start from strong initial responses. Evaluation: RQ1 and RQ2 only require to show that self-correction improves performance from the initial responses. RQ3 requires comparison with strong baselines (§6).
RQ . | Requirements for Frameworks . | Required Experiments . | |||
---|---|---|---|---|---|
Information Symmetricity . | Best-possible Initial Responses . | Realistic . | Comparison to Initial Responses . | Comparison to Strong Baselines . | |
RQ1 | ✓ | ✓ | ✓ | ✓ | – |
RQ2 | – | ✓ | ✓ | ✓ | – |
RQ3 | – | – | ✓ | – | ✓ |
RQ . | Requirements for Frameworks . | Required Experiments . | |||
---|---|---|---|---|---|
Information Symmetricity . | Best-possible Initial Responses . | Realistic . | Comparison to Initial Responses . | Comparison to Strong Baselines . | |
RQ1 | ✓ | ✓ | ✓ | ✓ | – |
RQ2 | – | ✓ | ✓ | ✓ | – |
RQ3 | – | – | ✓ | – | ✓ |
Confusion in Prior Work.
Some prior studies implicitly target different research questions in a single work without clearly distinguishing them. As in Table 2, Kim et al. (2023) target RQ1 for arithmetic reasoning by comparing self-corrected responses only with initial responses, but they target RQ3 for MiniWoB++ by comparing self-correction with baseline methods. Similarly, Gou et al. (2024) target RQ2 for arithmetic reasoning but target RQ3 for detoxification.
3.2 Frameworks for Verifying RQs
Prior work often categorizes self-correction frameworks based on approaches for generating feedback (§2). However, we point out that we also need to categorize them by the quality of initial responses because the frameworks we need to use for verifying different research questions vary by whether they use the best-possible initial responses (§3.1).
We propose categories of (same-model) self-correction that correspond to different research questions (§3.1), as shown in Figure 2. Specifically, we propose to categorize the self-correction frameworks as follows.
Realistic: Can be used in real-world applications.
- –
Fair: Using best-possible initial responses
- –
Unfair: Using sub-optimal initial responses
- –
Unrealistic: Using information that is not accessible in real-world applications.
In this work, we focus on categorizing self-correction frameworks that do not involve multiple language models with different architectures. Cross-model correction uses different models for initial response generation and self-correction, so it is unsuitable for evaluating whether LLMs can improve their own initial responses [RQ1, RQ2]. However, it can be used to evaluate [RQ3] whether the final responses from self-correction are better than other methods.
Realistic vs. Unrealistic.
Fair vs. Unfair.
Realistic frameworks can be categorized by whether they use the best-possible initial responses. Fair self-correction represents frameworks that refine the best-possible initial responses. (1) Intrinsic self-correction (Huang et al., 2024a) uses the same model and information for initial response generation and self-correction. Intrinsic self-correction can be used to assess [RQ1] whether LLMs can self-correct based solely on their inherent capabilities. (2) Fair-asymmetric self-correction uses additional information for self-correction, but also uses information to improve initial response generation as much as possible. For example, self-correction with code interpreters (Chen et al., 2024d; Gou et al., 2024) is not intrinsic but fair because we cannot easily use code interpreters to directly improve the initial response generation. Fair-asymmetric self-correction can be used to evaluate [RQ2] whether LLMs can self-correct the best-possible initial responses using external information. Unfair self-correction (or unfair-asymmetric self-correction) represents frameworks that are practical but do not use the best-possible initial responses. For example, methods that use search engines only for self-correction (Gao et al., 2023; Yu et al., 2023) are unfair because they can use search engines to directly improve the initial response generation. Unfair self-correction can evaluate [RQ3] whether the final responses from self-correction outperform other methods but cannot evaluate [RQ2] whether self-correction can improve the best-possible initial responses.
4 Self-Correction with Prompting
[RQ1] Can LLMs self-correct their best-possible initial responses based solely on the inherent capabilities?
Several studies propose intrinsic self-correction methods, which self-correct responses from LLMs by prompting themselves to generate feedback and refine the responses. Bai et al. (2022) propose self-correcting harmful responses from LLMs by prompting themselves. Self-Refine (Madaan et al., 2023) and RCI Prompting (Kim et al., 2023) iteratively prompt LLMs to self-correct their own responses in tasks such as arithmetic reasoning.
Negative Results.
However, recent studies report that intrinsic self-correction does not improve or even degrade the performance in tasks such as arithmetic reasoning, closed-book QA (Huang et al., 2024a; Gou et al., 2024), code generation (Gou et al., 2024; Olausson et al., 2024), plan generation (Valmeekam et al., 2023), and graph coloring (Stechly et al., 2023). Several studies claim that a bottleneck is in the feedback generation, and it is difficult to generate reliable feedback on their responses only by prompting themselves (Gou et al., 2024; Huang et al., 2024a; Olausson et al., 2024).
Unrealistic or Unfair Settings.
The conflicting positive and negative results motivate us to analyze when LLMs can self-correct only by prompting themselves. Specifically, we assess whether prior studies satisfy the requirements to verify that [RQ1] LLMs can self-correct their responses based solely on their inherent capabilities. As in Table 4, we find that many studies use either oracle information in the self-correction processes (unrealistic frameworks) or weak prompts that can be easily improved for generating initial responses (unfair settings), which over-evaluate self-correction. Consequently, we conclude that no major work shows successful self-correction of responses from LLMs using feedback generated by prompting themselves under fair settings in general tasks. Oracle Information: RCI Prompting (Kim et al., 2023) uses ground-truth answers and does not apply self-correction when the initial responses are correct, which unfairly ignores mistakes caused by updating correct responses incorrectly. Reflexion (Shinn et al., 2023) generates feedback by using an exact match between the generated and ground-truth answers, which cannot be accessed in real-world applications. Weak Initial Responses: Detoxifying harmful responses is a popular task in self-correction research, but prior studies often study in situations where initial response generation is not instructed to generate harmless responses (Bai et al., 2022; Wang et al., 2024b). Although detecting harmful contents using LLMs is a reasonable research topic, this setting is not the self-correction from best-possible initial responses, since we can improve the initial response generation process by instructing not to generate harmful responses. As more obvious weak prompts, Self-Refine (Madaan et al., 2023) uses instructions or few-shot examples that do not correctly correspond to the target task only for initial response generation (e.g., providing wrong target labels in few-shot examples), while using appropriate instructions for self-correction, as shown in Tables 9 and 10. These settings evaluate improvement from weak initial responses, which over-evaluate the improvement by self-correction.
Paper . | Task . | Using Oracle Info for Feedback . | Weak Prompt for Initial Responses . | Comments . |
---|---|---|---|---|
RCI (2023, §3.1) | Computer Tasks | ✓ stop condition | – | Using ground-truth answers and do not update correct responses, which unfairly ignores false-positive correction |
Reflexion (2023, §4.2) | HotpotQA (Context) | ✓ feedback | – | Feedback is the exact match between the responses and ground-truth answers |
CAI Revisions (2022) | Detoxification | – | ✓ | Initial generation is not prompted to remove harmful outputs |
Self-Refine (2023) | Math, Coding, Dialogue | – | ✓ | Unfairly weak or wrong instructions or few-shot demonstrations for initial response generation |
Paper . | Task . | Using Oracle Info for Feedback . | Weak Prompt for Initial Responses . | Comments . |
---|---|---|---|---|
RCI (2023, §3.1) | Computer Tasks | ✓ stop condition | – | Using ground-truth answers and do not update correct responses, which unfairly ignores false-positive correction |
Reflexion (2023, §4.2) | HotpotQA (Context) | ✓ feedback | – | Feedback is the exact match between the responses and ground-truth answers |
CAI Revisions (2022) | Detoxification | – | ✓ | Initial generation is not prompted to remove harmful outputs |
Self-Refine (2023) | Math, Coding, Dialogue | – | ✓ | Unfairly weak or wrong instructions or few-shot demonstrations for initial response generation |
Tasks in which Self-Correction is Exceptionally Effective.
Although our analysis of prior studies shows that intrinsic self-correction is difficult in general, some tasks have properties that make feedback generation easy and enable intrinsic self-correction. For example, CoVe (Dhuliawala et al., 2023) is an intrinsic self-correction method for tasks of generating multiple answers, such as Name some politicians who were born in NY, New York. Generated responses include multiple answers, but the feedback generation can be decomposed into easier sub-tasks of verifying each answer. Tasks with decomposable responses are one of the few groups of tasks for which verification is clearly easier than generation, which enables intrinsic self-correction. However, many real-world tasks do not satisfy this property.
5 Self-Correction with External Information
[RQ2] Can LLMs self-correct their best-possible initial responses assisted by external information?
This section analyzes self-correction frameworks that make use of external tools, external knowledge, and fine-tuning.
5.1 Self-Correction with External Tools or Knowledge
Given the observation that feedback generation is a bottleneck of self-correction (§4), improving feedback using external tools or knowledge is a promising direction. External tools used for self-correction include code interpreters for code generation tasks (Chen et al., 2024d; Gou et al., 2024) and symbolic reasoners for logical reasoning tasks (Pan et al., 2023). A popular source of knowledge is search engines, which are often used with queries generated from initial responses to retrieve information for validating their correctness (Gao et al., 2023; Jiang et al., 2023b). These prior studies widely agree that self-correction can improve LLM responses when reliable external tools or knowledge suitable for improving feedback are available.
Unfair Self-correction with External Information.
Although using external tools or knowledge is known to be effective in self-correction, we raise caution that the way of using external tools or knowledge influences the research questions we can verify (§3.1). As shown in Table 5, some prior studies (Gao et al., 2023; Yu et al., 2023; Zhao et al., 2023) use external knowledge only for self-correction, while they can also directly use external knowledge to improve the initial response generation process. For example, RARR (Gao et al., 2023) uses external knowledge to detect mistakes in initial responses, while it does not use any external knowledge when generating initial responses. These methods are reasonable when only focusingon [RQ3] the performance of final responses, butit is not fair to use them for evaluating [RQ2] whetherself-correction can improve from the best-possibleinitial responses. In contrast, using code interpreters for self-correction (Gou et al., 2024; Chen et al., 2024d) can be regarded as using best-possible initial responses because there is no easy way to improve the initial response generation directly.
Paper . | Main Task . | External Tools or Knowledge . | |
---|---|---|---|
For Initial Response Generation . | For Feedback Generation . | ||
Reflexion (2023, §4.1, 4.3) | Games, Coding | – | Game Envs, Code Interpreter |
CRITIC (2024) | GSM8k, SVAMP | – | Python interpreter |
Self-Debug (2024d) | Text-to-Code | – | Code Interpreter |
CRITIC (2024) | HotpotQA | Web Search | Web Search |
FLARE (2023b) | 2WikiMultihopQA, StrategyQA, ASQA | Web Search | Web Search |
RARR (2023) | NQ, SQA, QReCC | – | Web Search |
ReFeed (2023) | NQ, TriviaQA, HotpotQA | – | Wikipedia |
Paper . | Main Task . | External Tools or Knowledge . | |
---|---|---|---|
For Initial Response Generation . | For Feedback Generation . | ||
Reflexion (2023, §4.1, 4.3) | Games, Coding | – | Game Envs, Code Interpreter |
CRITIC (2024) | GSM8k, SVAMP | – | Python interpreter |
Self-Debug (2024d) | Text-to-Code | – | Code Interpreter |
CRITIC (2024) | HotpotQA | Web Search | Web Search |
FLARE (2023b) | 2WikiMultihopQA, StrategyQA, ASQA | Web Search | Web Search |
RARR (2023) | NQ, SQA, QReCC | – | Web Search |
ReFeed (2023) | NQ, TriviaQA, HotpotQA | – | Wikipedia |
Verifiable Tasks.
Some tasks have a property that allows the correctness of the responses to be verified easily, even without external information. For example, the constrained generation task evaluated in Self-Refine (Madaan et al., 2023) is a task to generate a sentence that includes five specified words. We can easily evaluate the correctness by checking whether the five words are included in the generated sentence. Tree-of-thought (Yao et al., 2023) is a generate-and-rank method for verifiable tasks,1 such as Game of 24, the task to obtain 24 using basic arithmetic operations (+,−,×,÷) and provided four integers. For Game of 24, we can easily verify the answer by checking whether the generated answer is 24. We consider self-correction to work well in these tasks because they are in the same situations as using strong external tools or the oracle information to generate feedback.
5.2 Self-Correction with Fine-tuning
Prior work shows that fine-tuning LLMs for generating feedback or refining responses improves the self-correction capability. A common approach fine-tunes feedback models to generate reference feedback given initial responses and fine-tunes refinement models to generate reference answers given the initial responses and reference feedback (Ye et al., 2023; Lee et al., 2024; Saunders et al., 2022). Frameworks: The first approach fine-tunes the same model to correct its own responses. In this approach, most methods fine-tune models for all stages: initial responses, feedback, and refinement (Saunders et al., 2022; Ye et al., 2023; Lee et al., 2024). Another approach corrects responses from larger models using smaller fine-tuned models. This cross-model correction approach often instructs the larger models to refine their own responses using feedback from the smaller fine-tuned models (Yang et al., 2022b; Welleck et al., 2023; Akyurek et al., 2023; Paul et al., 2024), which can be viewed as using the small fine-tuned models as external tools. Training Strategies: A popular approach is supervised fine-tuning, which fine-tunes self-correction modules on human-annotated feedback (Saunders et al., 2022), feedback from stronger models (Ye et al., 2023), or synthetic negative responses (Paul et al., 2024). As other approaches, to avoid the cost of collecting human feedback, self-corrective learning (Welleck et al., 2023) selects model-generated feedback that successfully refines responses as training data, and RL4L (Akyurek et al., 2023) uses reinforcement-learning. External Tools: Some works fine-tune models to refine responses given feedback from external tools. Self-Edit (Zhang et al., 2023b) uses the results on test cases evaluated by code executors for code generation, and Baldur (First et al., 2023) uses proof assistants for improving proof generation.
Large Training Data for SFT of Feedback.
As shown in Table 6, many methods with supervised fine-tuning for feedback generation rely on training data with more than 100K instances. These studies often use feedback generated by stronger models to simulate human annotation, but this approach requires large-scale human annotations to be implemented on state-of-the-art models. We expect future research to explore approaches that do not require large-scale human annotations (§11).
Unfair Fine-tuning.
Some studies (Welleck et al., 2023) apply stronger fine-tuning for self-correction models than initial response generation models, which do not use best-possible initial responses in the available resources (§3.2). This approach can be used to evaluate [RQ3] the performance of the final responses to compare with other methods but cannot be used to evaluate [RQ2] the improvement from best-possible initial responses.
6 Strong Baselines
[RQ3] Are the final outputs from self-correction better than other methods?
Self-correction involves multiple LLM calls for generating feedback and refinement. Therefore, to claim that [RQ3] the performance of the final outputs from self-correction frameworks is better than other approaches, it should be compared with sufficiently strong baselines, possibly relying on additional LLM calls or computational cost. Many self-correction studies do not compare their methods with strong baselines, although some studies pointed out this issue and compare self-correction with self-consistency (Gou et al., 2024; Huang et al., 2024a) or pass@k in code generation (Zhang et al., 2023b; Olausson et al., 2024). We encourage future research to compare self-correction with strong baselines, including self-consistency and generate-and-rank, to further explore RQ3.
Self-Consistency.
(Wang et al., 2023) is an approach that generates multiple responses for the same input and takes the majority vote of the final answers in reasoning tasks. The idea of selecting good responses using the consistency between multiple responses from the same model has also been extended to other tasks such as text generation (Manakul et al., 2023; Elaraby et al., 2023; Chen et al., 2024c) and code generation (Shi et al., 2022).
Generate-and-Rank.
is an approach that generates multiple responses and selects the best response using verifiers. Post-hoc approach ranks responses using self-evaluation (Weng et al., 2023; Zhang et al., 2023d), confidence (Manakul et al., 2023), fine-tuned verifiers (Cobbe et al., 2021; Shen et al., 2021; Lightman et al., 2024), or verifiers with external tools (Shi et al., 2022; Chen et al., 2023a; Ni et al., 2023). Feedback-guided decoding generates multiple responses and selects the best response for each reasoning step using generation probability (Hao et al., 2023; Tyen et al., 2024), prompted self-evaluation (Jung et al., 2022; Creswell and Shanahan, 2022; Xie et al., 2023; Yao et al., 2023; Miao et al., 2024), or fine-tuned verifiers (Uesato et al., 2022; Tafjord et al., 2022; Yang et al., 2022a; Asai et al., 2024).
7 Summary of Our Analysis
Bottleneck is in Feedback Generation.
Prior studies widely agree that LLMs can refine their responses given reliable feedback (§5). However, generating reliable feedback on their own responses is still observed to be challenging for LLMs without using additional information (§4). In other words, for the current LLMs, the hypothesis that recognizing errors is easier than avoiding them (Saunders et al., 2022) is only true for certain tasks whose verification is exceptionally easy, according to our analysis of the experiments in prior studies. We recommend that self-correction research analyze the quality of generated feedback in more detail, not only evaluate the downstream performance of the refined responses.
Tasks Suitable for Self-Correction.
Our analysis identifies the properties of tasks that are suitable for self-correction under different conditions.
Intrinsic Self-Correction (§4)
- –
Tasks whose verification tasks are much easier than the original tasks (e.g., tasks whose responses are decomposable)
- –
Self-Correction with External Information (§5.1)
- –
Tasks for which external tools that provide reliable feedback exist (e.g., code generation)
- –
Tasks for which responses can be utilized to obtain useful information that is difficult to obtain before generating initial responses (e.g., generate queries from responses to retrieve documents for verifying information)
- –
Self-Correction with Fine-tuning (§5.2)
- –
Self-correction works in many tasks when large training data for feedback generation is available
- –
Tasks that can use reinforcement learning or self-corrective learning (Welleck et al., 2023), i.e., tasks whose responses can be easily evaluated given ground-truth answers
- –
8 Checklist for Self-Correction Research
Our analysis shows that many studies do not clearly define their research questions and fail to conduct appropriate experiments (§3.1, 4). To tackle these issues, we provide a checklist for self-correction research that provides requirements for designing appropriate experiments for verifying target RQs and recommended experiments for comprehensive analysis. Table 7 provides a checklist for verifying different RQs identified in Section 3.1. Table 8 provides a checklist for reporting negative results.
9 Differences from Other Survey
Pan et al. (2024) provide a comprehensive survey on broad topics related to self-correction, including training strategies. Our work specifically focuses on (inference-time) self-correction and provides a more detailed and critical analysis of prior work. Huang et al. (2024a) provide an analysis of problems in the evaluation settings of self-correction research, which motivates our work. They focus on analyzing a few papers on intrinsic self-correction in reasoning tasks. We provide a more comprehensive analysis of self-correction with in-context learning, external tools, and fine-tuning.
10 Related Work of Self-Correction
Self-Detection.
of mistakes in LLM responses using LLMs (possibly with external information) has been studied in various domains, including misinformation detection (Zhang et al., 2023c; Chern et al., 2023; Chen and Shu, 2024; Mishra et al., 2024), context-faithfulness (Wang et al., 2020; Durmus et al., 2020; Scialom et al., 2021), harmful content detection (Rauh et al., 2022), and bias detection (Blodgett et al., 2020; Feng et al., 2023). However, recent studies (Tyen et al., 2024; Kamoi et al., 2024) show that even strong LLMs often cannot detect their own mistakes in various tasks.
Editing Human-Written Text.
by using language models has been studied in various domains, including information update (Shah et al., 2020; Iv et al., 2022; Schick et al., 2023), grammatical error correction (Ng et al., 2014; Lichtarge et al., 2019), factual error correction (Cao et al., 2020; Thorne and Vlachos, 2021), and code repair (Gupta et al., 2017; Mesbah et al., 2019; Bader et al., 2019; Chen et al., 2021; Yasunaga and Liang, 2020, 2021).
Self-Training.
or self-improvement is an approach to train models using their own responses. Some studies use self-evaluation or self-correction for creating training data (Bai et al., 2022; Gulcehre et al., 2023) or use self-evaluation as training signals (Pang et al., 2024). Another approach improves the reasoning of LLMs using LLM-generated reasoning by selecting high-quality outputs using ground-truth final answers (Zelikman et al., 2022) or self-consistency (Huang et al., 2023). As another direction, Meng et al. (2022) use sentences generated by LLMs with high confidence for training classifiers.
11 Future Directions
Improving Feedback.
Prior studies indicate that it is difficult for LLMs to generate feedback on their own responses with in-context learning (§4, 7). However, most studies in intrinsic self-correction (Madaan et al., 2023; Huang et al., 2024a) use simple prompts for generating feedback, and there is room for improvement. A possible direction to improve feedback is to apply (reference-free and point-wise) LLM-based evaluation metrics. Recent approaches for improving the model-based evaluation include using human-written evaluation criteria (Chiang and Lee, 2023; Liu et al., 2023) and decomposing responses (Saha et al., 2024; Min et al., 2023). As another direction, recent studies in self-correction propose frameworks using the confidence in their responses, estimated by generation probabilities (Varshney et al., 2023; Jiang et al., 2023b), prompting (Li et al., 2024a), or generating new questions from their answers to evaluate logical consistency (Jung et al., 2022; Tafjord et al., 2022; Wu et al., 2024).
Unexplored Tasks.
The difficulty of self-evaluation differs from task to task (§4), while many studies assume that verification is consistently easier than generation. We expect that there are unexplored tasks in which intrinsic self-correction works well, although self-correction research mostly focuses on reasoning tasks such as math reasoning and coding (Madaan et al., 2023; Gou et al., 2024; Huang et al., 2024a). For example, LLM-based evaluation is often studied in open-ended text generation, such as dialogue generation and text summarization (Fu et al., 2024; Liu et al., 2023), suggesting that reasonable model-based feedback is available for these tasks.
Fine-tuning on Small Training Data.
Fine-tuning of feedback generation often relies on large training data, which requires large-scale human annotations (§5.2). We expect future work to explore self-correction with smaller training data. Although reinforcement learning (Akyurek et al., 2023) or self-corrective learning (Welleck et al., 2023) do not require human feedback, they require reasonable reward functions for evaluating LLM responses, which are not available in many tasks. For example, RL4F (Akyurek et al., 2023) uses ROUGE as a reward function for text summarization and action planning, which is sub-optimal.
Pre-training for Improving Self-Correction.
Prior studies show that large-scale fine-tuning on reference feedback improves the self-correction capability of LLMs (§5.2). This observation suggests that the current approach or datasets for pre-training LLMs are insufficient to make LLMs acquire self-correction capability. We expect future work to explore pre-training strategies to improve the intrinsic self-correction capability of LLMs.
12 Conclusion
We provide a critical survey of self-correction to identify in which conditions LLMs can self-correct their mistakes. Our analysis reveals that many studies fail to define their research questions clearly or design experiments appropriately. To tackle these issues, we categorize research questions and frameworks in self-correction research and provide a checklist for conducting appropriate experiments.
Acknowledgments
This work was supported by a Cisco Research Grant. We appreciate valuable suggestions from the action editor and anonymous reviewers.
Notes
Tree-of-thought is a generate-and-rank method and not a self-correction method in our definition.
References
Author notes
Action Editor: Grzegorz Chrupała