Abstract
Digital education has gained popularity in the last decade, especially after the COVID-19 pandemic. With the improving capabilities of large language models to reason and communicate with users, envisioning intelligent tutoring systems that can facilitate self-learning is not very far-fetched. One integral component to fulfill this vision is the ability to give accurate and effective feedback via hints to scaffold the learning process. In this survey article, we present a comprehensive review of prior research on hint generation, aiming to bridge the gap between research in education and cognitive science, and research in AI and Natural Language Processing. Informed by our findings, we propose a formal definition of the hint generation task, and discuss the roadmap of building an effective hint generation system aligned with the formal definition, including open challenges, future directions and ethical considerations.
1 Introduction
Prior research has established a correlation between the student-teacher ratio and a student’s overall performance (Koc and Celik, 2015). However, private tutoring is not accessible to everyone, and finding expert tutors is often difficult and incurs considerable costs (Bray, 1999; Graesser et al., 2012). Intelligent tutoring systems (ITSs) hold the key to addressing these educational challenges, notably the need for personalized learning in a system often reliant on instructional teaching and standardized testing (Anderson et al., 1985).
The hallmark of intelligent tutoring systems is their ability to provide step-by-step guidance to students while they work on problems, and hints play a critical role in their ability to provide this help (Biswas et al., 2014). Hints are a tool to provide scaffolded support to the learners, and can be traced back to the socio-cultural theory of Vygotsky’s Zone of Proximal Development, referring to “the gap between what a learner can do without assistance and what a learner can do with adult guidance or in collaboration with more capable peers” (Vygotsky, 1978).
Within learning science, hints refer to the clues, prompts, questions, or suggestions provided to learners to aid them in solving problems, answering questions, or completing tasks, thereby encouraging critical thinking, problem-solving skills, and independent learning. In Figure 1, we provide an example of a hint generation system capable of the reasoning required to answer the question, acknowledging the wrong attempt by the learner and providing informative hints linked to the learner’s existing knowledge. This framework of scaffolding is well established in education (Van de Pol et al., 2010), and expert tutors are guided to incorporate it in their teaching practices (Belland, 2017).
An example of a hint generation system capable of acknowledging the learner’s wrong answer, and scaffolding them to the correct direction.
An example of a hint generation system capable of acknowledging the learner’s wrong answer, and scaffolding them to the correct direction.
With the aim of developing hint generation systems with capabilities such as the ones showcased in Figure 1, we consolidate the dispersed efforts on hint generation, bridging the gap between research in education and cognitive sciences on the one hand, and research in AI and Natural Language Processing (NLP) on the other. Grounded in the findings from our literature review, we provide a roadmap for future research on automatic hint generation. We summarize the key characteristics of a successful hint as observed by research with human tutors in Section 2 and review the automated hint generation systems in Section 3. We identify the gaps and propose a roadmap for future research in hint generation in Section 4. We provide a rethinking of the formal definition of the hint generation task (Section 4.1), a brief review of research areas that can inform the design of a hint generation system that aligns with the formal definition (Section 4.2), open challenges for future directions for effective automatic hint generation systems (Section 4.3), and ethical considerations (Section 4.4). Our major contributions include:
A literature review on hint generation that bridges the gap between research in education and cognitive science on the one hand and research in AI and NLP on the other.
A formal definition of the hint generation task, grounded in the cognitive theories on learning and findings from qualitative research.
A roadmap for research on automatic hint generation, outlining challenges, promising future directions, and ethical considerations for the field.
2 Anatomy of a Hint
In this section, we draw on research from education and cognitive sciences to describe the key characteristics of an effective hint formulation process. We start by describing the pragmatics of a hint (‘context’), covering some prominent traits exhibited by expert tutors and educators while generating hints, and then dive deeper into the anatomy of a hint by discussing the semantic (‘what to say’) and the stylistic (‘how to say it’) aspects of a hint.
2.1 Pragmatics of a Hint
Expert tutors are able to provide high quality support to students because they are aware of each student’s individual learning style, strengths, and weaknesses. These tutors often exhibit a contextual awareness about their students, and we describe two such common practices adopted by educators’ when supporting students below.
Scaffolding Support: Structuring hints in a scaffolded manner, with incremental steps leading to the solution, helps learners systematically build their understanding. These “just-in-time interventions” (Wood et al., 1976) allow students to build their understanding step by step, starting with foundational concepts and progressing toward more advanced aspects without being overwhelmed by the information or task complexity (Zurek et al., 2014; Lin et al., 2012; Hammond, 2001).
Personalization and Learner Feedback: Every learner is unique and has different needs and preferences when it comes to learning (Bulger, 2016). Prior studies point towards a learner-centered pedagogical system, where personalization and individualization of learning have a significant role in the students’ overall learning process and strengthen their sense of self and individuality (Radovic-Markovic and Markovic, 2012). To generate effective hints, it is important to recognize and cater to these individual needs by considering learners’ strengths, challenges, cultural sensitivity, and preferences (Chamberlain, 2005; Ibrahim and Hussein, 2016; Suaib, 2017). A good learning environment also incorporates a feedback loop, where hints are accompanied by opportunities for learners to provide feedback, promoting active engagement. This two-way communication allows tutors to gauge the effectiveness of their guidance and adjust their learning plans (Boud and Molloy, 2013).
2.2 Semantics of a Hint
Semantics of a hint refers to the information conveyed by the hint, which includes explaining the key concepts and ideas required to scaffold the learning process. We observed the following properties of effective hint’s semantics.
Relevance to the Learning Objective: Learning objectives serve as a measure of achievable goals that articulate what learners should know or be able to do by the end of a learning experience. Learning objectives can broadly be categorized across three domains: cognitive, affective, and psychomotor objectives (Hoque, 2016; Sönmez, 2017). Each domain has different expectations and goals to assess the effectiveness of a hint. A hint generation system should model these objectives to create successful high-quality hints.
Link to Prior Knowledge: A successful hint would act as a bridge between a learner’s existing knowledge base and the current learning step to foster continuity in learning. Studies have shown that building on prior knowledge helps students bridge gaps, clear misconceptions, and reinforces the relevance of new information (Hailikari et al., 2008, 2007; Dong et al., 2020).
Conceptual Depth: Many learning sessions focus on teaching learners how to harness latent cognitive abilities and mold them into deep conceptual thinkers with the ability to discuss and question more, seeking to understand rather than only memorize (Rillero, 2016). It is important to balance the complexity of a hint that strikes a student’s interest without overwhelming them.
2.3 Style of a Hint
Expert human tutors adopt diverse techniques to convey information to learners. These strategies vary from non-verbal cues such as body language, facial expressions, and vocal tone (Bambaeeroo and Shokrpour, 2017; Wahyuni, 2018) to adopting multimedia content to teach complex topics (e.g., using animations and maps to teach the geographical concept of “folded mountains”) (Kapi et al., 2017). We try to cover the most relevant aesthetic aspects of hints that might be useful in building better hint-generation system.
Clarity and Simplicity: Hints should be expressed in clear and simple language to ensure that learners easily grasp the underlying concept or problem-solving strategy. Avoiding unnecessary complexity enhances the usefulness of the hint and is usually well received by the learners. This is a well-established practice within the learning sciences community known as direct instructions (Kozloff et al., 1999; Kim and Axelrod, 2005; Rosenshine, 2008).
Encouragement and Positive Tone: The role of encouragement and positive attitude has been extensively investigated in several human studies in classroom settings, and all unanimously align with the significance of motivating learners towards better performance, increased participation, and improved self-confidence (Ducca, 2014; Yuan et al., 2019; Li, 2021; Lalić, 2005). A hint generation systems could benefit by incorporating a positive, encouraging tone (as demonstrated in Figure 1).
Adopting Creative and Multi-modal Elements: In order to encourage active participation and retain learners’ interest, human tutors often adopt several creative and multi-modal elements to facilitate better understanding and information retention. These creative elements include interactive literary devices like analogy (Richland and Simms, 2015; Gray and Holyoak, 2021; Nichter and Nichter, 2003; Thagard, 1992), questions (Hume et al., 1996; Chi, 1996), and metaphors (Low, 2008; Sfard, 2012; Guilherme and Souza de Freitas, 2018). We can also expand beyond text, and incorporate information from other modalities such as maps (Winn, 1991), diagrams (Winn, 1991; Swidan and Naftaliev, 2019; Tippett, 2016), and multimedia content (Abdulrahaman et al., 2020; Collins et al., 2002; Kapi et al., 2017) to effectively complement the learning experience. A good hint can take inspiration from some of these creative elements for a successful transfer of knowledge.
A good instructor typically takes the general guidelines into consideration and uses a mixture of the aforementioned semantic and stylistic features to create effective hints based on their prior tutoring experiences. For instance, to develop a hint that uses the literary device of analogy, the tutor must understand the prior knowledge of the learner to create successful hints (Gray and Holyoak, 2021).
3 Survey of Computational Approaches
In this section, we provide a comprehensive overview of the recent advancements of computational approaches for automatic hint generation. We first describe the extensively studied hint generation techniques for computer programming that focus on revealing code snippets to help learn how to program. Next, we dive into the relatively under-explored natural language hint generation, where we explore strategies for diverse domains like mathematics, language acquisition, or factual entity-based questions. We conclude this section by describing some limitations of automatic hint generation systems today, and propose a roadmap for future research in the field in Section 4.
3.1 Hint Generation for Computer Programming
A vast majority of computational approaches for hint generation have focused on the specific domain of computer programming, owing to the more objective nature of the task and abundance of data. We briefly discuss the approaches, datasets, and evaluation metrics adopted in the field. For a more comprehensive review in this specific domain, we refer the readers to the surveys written by Le et al. (2013), Crow et al. (2018), McBroom et al. (2021), and Mahdaoui et al. (2022).
Datasets.
Two widely popular datasets in the programming hint generation space are iSnap (Price et al., 2017) and ITAP (Rivers and Koedinger, 2017). Both datasets consist of detailed logs collected from several students working on multiple programming tasks, including the complete traces of the code and records of when the hints were requested. iSnap (Price et al., 2017) is based on Snap!1 —a block-based educational graphical programming language, while ITAP (Rivers and Koedinger, 2017) is a Python dataset collected from two introductory programming courses taught at Carnegie Mellon University. In Table 1, we describe an example from the ITAP dataset, where the goal is to write a program to determine if a given day is weekend. Given a student’s code that fails to pass the pre-determined test cases (e.g.,isWeekend(“Sunday”) will return False), the aim of a hint generation system is to provide hints to help them successfully solve the problem (e.g., replacing lowercase ‘saturday’ to uppercase ‘Saturday’).
Hint generation examples from selected prior research discussed in Section 3.
Data Source . | Input . | Expected or Generated outputs . |
---|---|---|
ITAP | Question- | Hint generated by Rivers and Koedinger (2017)- |
Write a program that if a given day is weekend. | Type: Replace | |
Learner’s response- | Old expression: “saturday” | |
def isWeekend(day): | New expression: “Saturday” | |
return bool(day==’sunday’ or day ==’saturday’) | ||
ReMath | Expected output | |
Question- | Error type- guess | |
Mike has 4 cookies and he eats 3 cookies. | Response strategy- provide a solution strategy | |
So Mike has ____ cookies left? | Response intention- | |
Learner’s response - | Help student understand the lessons topic or solution strategy | |
He has 10 cookies left. | Response- | |
Great try! Let’s try to draw a picture. | ||
Let’s start with 4 cookies and erase the 3 that Mike eats. | ||
SQUAD | Question- | |
Who became the most respected entrepreneur | Hint Generated by Jatowt et al. (2023)- | |
in the world according to Financial Times in 2003? | The searched person held the position of chief executive officer. | |
Expected Answer- | ||
Bill Gates | ||
TrivialQA | Question- | Hint Generated by Mozafari et al. (2024b)- |
In which city are the headquarters of the International Monetary Fund? | The city is known for its neoclassical architecture. | |
Expected Answer- | The city is located on the Potomac River. | |
Washington D.C. | The city is the capital of the USA located on the east coast. |
Data Source . | Input . | Expected or Generated outputs . |
---|---|---|
ITAP | Question- | Hint generated by Rivers and Koedinger (2017)- |
Write a program that if a given day is weekend. | Type: Replace | |
Learner’s response- | Old expression: “saturday” | |
def isWeekend(day): | New expression: “Saturday” | |
return bool(day==’sunday’ or day ==’saturday’) | ||
ReMath | Expected output | |
Question- | Error type- guess | |
Mike has 4 cookies and he eats 3 cookies. | Response strategy- provide a solution strategy | |
So Mike has ____ cookies left? | Response intention- | |
Learner’s response - | Help student understand the lessons topic or solution strategy | |
He has 10 cookies left. | Response- | |
Great try! Let’s try to draw a picture. | ||
Let’s start with 4 cookies and erase the 3 that Mike eats. | ||
SQUAD | Question- | |
Who became the most respected entrepreneur | Hint Generated by Jatowt et al. (2023)- | |
in the world according to Financial Times in 2003? | The searched person held the position of chief executive officer. | |
Expected Answer- | ||
Bill Gates | ||
TrivialQA | Question- | Hint Generated by Mozafari et al. (2024b)- |
In which city are the headquarters of the International Monetary Fund? | The city is known for its neoclassical architecture. | |
Expected Answer- | The city is located on the Potomac River. | |
Washington D.C. | The city is the capital of the USA located on the east coast. |
Approaches.
Most of the recent efforts in programming hint generation adopt a data-driven deterministic approach (Barnes and Stamper, 2008; Rivers and Koedinger, 2017; Obermüller et al., 2021; Jin et al., 2012; Zimmerman and Rupakheti, 2015; Paaßen et al., 2018; Price et al., 2016; Rolim et al., 2017), that comprises three key components: a corpus of diverse candidate solutions (usually obtained via past student attempts), a matching algorithm to select the best candidate response given an ongoing attempt based on similarity, and graph-based solution path construction to synthesize hints (Figure 2 describes one approach in detail). We found abstract syntax tree (ASTs) (McCarthy, 1964; Knuth, 1968) to be the most popular choice of graph representation for hint synthesis due to its vast literature and language-agnostic nature. Similarly, McBroom et al. (2021) provide a detailed generalization of hint generation techniques for the programming domain called HINTS (Hint Iteration by Narrow-down and Transformation Steps framework). Although effective, programming hints are rarely natural language responses and, therefore, are not capable of incorporating the stylistic aspects of hints that improve the learner’s experience (Section 2.3). We believe an amalgamation of NLP technologies and current hint generation systems could improve upon this limitation.
Illustration of the path construction algorithm (Rivers and Koedinger, 2014), which generates programming hints for an ongoing student attempt (current state), given reference solution(s) (goal states) and a test case-based scoring function .
Illustration of the path construction algorithm (Rivers and Koedinger, 2014), which generates programming hints for an ongoing student attempt (current state), given reference solution(s) (goal states) and a test case-based scoring function .
Evaluation Metrics.
Given the vast majority of work adopting a data-driven hint strategy, programming hints are designed to reveal some aspect of the program not written by the learner. Within our survey, all the evaluation metrics adopted a code-based similarity measure to gauge the quality of the hint, and we found two prominent evaluation paradigms that we categorize into reference-based and reference-free evaluation metrics. The reference-based evaluation metrics (Price et al., 2019) assume the availability of reference hints developed by expert tutors. For instance, QualityScore (Price et al., 2019) is a reference-based evaluation metric that uses abstract syntax tree-based similarity measure to evaluate the quality of the generated programming hints with respect to expert tutor written hints.
Reference-free evaluation metrics (Rivers and Koedinger, 2017; Paaßen et al., 2018; Obermüller et al., 2021; Zimmerman and Rupakheti, 2015) focus on measuring the impact of a hint to the learner’s response. For example, Paaßen et al. (2018) proposed root-mean-square error over two distance measures: (1) distance between the predicted post-hint state and true next state, and (2) distance between the predicted post-hint state and learner’s true final state. Rivers and Koedinger (2017), on the other hand, adopted a learner-agnostic approach to evaluate the hint quality. They used two measures of a successful chain of hints: (1) the ability of the hint sequence to reach the correct solution state and (2) the length of the chain.
Reference-based evaluation metrics are able to compare the quality of generated hints with the expert tutor’s feedback capabilities, but can be difficult to scale. Reference-free evaluation metrics, on the other hand, are capable of evaluating previously unseen problems but require human evaluations to obtain performance signals for the metrics. Each of these paradigms has pros and cons, but currently, there is no holistic measure of the quality of programming hints that emphasizes the pragmatics, semantics and style characteristics of the generated hints.
3.2 Natural Language Hint Generation
Natural language hint generation has risen to popularity in the last few years as a consequence of the recent advancements in NLP, particularly large language models capable to generate fluent and coherent text (Min et al., 2023; Naveed et al., 2023). We found the question answering format to be the most prevalent setup for natural language hint generation systems, where the learner attempts to answer a question to recall and concertize their understanding of a concept.
Datasets.
ReMath (Wang et al., 2023) is a benchmark co-developed with math teachers (i.e., experts) for evaluating and tutoring students in the mathematics domain. ReMath provides a systematic breakdown of the human-tutoring process into three steps: (1) identifying the error type, (2) determining a response strategy and intention, and (3) generating a feedback response that adheres to this strategy (example in Table 1). Each of these steps is manually annotated by an expert math teacher. Wang et al. (2023) also provide a set of error types (e.g., guess, careless, misinterpret, right-idea), response strategies (e.g., explain a concept, ask a question), and intentions (e.g., motivate the student, get the student to elaborate the answer) to facilitate the feedback generation process. ReMath is a great example of high-quality data collection with human experts towards building better hint generation systems.
TriviaHG (Mozafari et al., 2024b) is another hint generation dataset developed by extending the TriviaQA dataset (Joshi et al., 2017). Mozafari et al. (2024b) utilize Microsoft’s CoPilot2 to generate hints due to its retrieval augmented generation approach that generated more reliable responses grounded on internet-retrieved documents. TriviaHG includes 10 generated hints for questions that CoPilot is capable on answering. In a follow-up work, Mozafari et al. (2024a) propose HintQA that utilizes these hints as concise context, improving the QA capabilities of LLMs over other context-retrieval and context-generation baselines.
Approaches.
For open-ended hint generation of factoid questions, Jatowt et al. (2023) proposed a Wikipedia-based retrieval framework for the “Who?”, “Where?” and “When?” type questions. They propose a popularity-based framework for the “When” question type, where the popularity of an event for the answer year is measured by the count of Wikipedia hyperlinks directing to the event’s website, and a hand-curated template approach for the “Who” and “Where” question types (example hint described in Table 1). On the other hand, Wang et al. (2023) benchmarked the ReMath dataset by instruction fine-tuning (Wei et al., 2021) the language models like Flan-T5 (Chung et al., 2022) and GODEL (Peng et al., 2022), and using in-context learning (Dong et al., 2022) prompts for gpt-3.5-turbo and gpt-4 (Achiam et al., 2023). Tack and Piech (2022) and Wang et al. (2023) found the direct use of LLMs to fall short in comparison to human expert responses.
Pal Chowdhury et al. (2024) propose a hint generation framework for middle-school level math word problems (termed MWPTutor). MWPTutor provides a hint by formulating a question around the next operation to be performed in the state space. This hint is obtained by matching the ongoing response with all possible decomposed solutions obtained by using a language model to decompose the solutions into atomic mathematical operation steps.
Current question answering hint generation systems do not personalize the hints to learners’ preferences, learning objectives, or their prior knowledge (Sections 2.1 and 2.2). We discuss how we can improve these hint generation systems to aid the learning process in Section 4.
Evaluation Metrics.
All the discussed approaches for hint generation have used human evaluation to assess the quality of the system’s output. Jatowt et al. (2023) conducted a between-subjects study to evaluate their proposed hint generation strategies across different experimental groups. Tack and Piech (2022) proposed the “AI Teacher Test”, comparing the generated responses against the teacher responses across three dimensions—“speak like a teacher”, “understand a student”, and “help a student”. They identified that the LLMs are good at conversation uptake (i.e., the first two requirements) but are quantifiably worse than real teachers on several pedagogical dimensions, especially helpfulness to a student. Wang et al. (2023) evaluate the error type identification and feedback response strategy selection as a multi-class classification task and utilize exact match and Cohen’s kappa to measure the accuracy and entropy to measure the output diversity. They also conducted human evaluations for the response generation task and found that all models constrained by knowledge of ground-truth error type and response strategy outperformed their unconstrained counterparts.
For evaluation of hint generation systems in quantitative user studies, Pal Chowdhury et al. (2024) extends the idea of success rate (S: fraction of correctly answered questions) and telling rate (T: fraction of conversations where the answer was revealed) proposed by Macina et al. (2023). Pal Chowdhury et al. (2024) suggested an adjusted success (S −T) to prevent an overtly revealing framework from achieving high performance, and the harmonic mean of success rate and adjusted success rate as the overall tutoring score ().
Mozafari et al. (2024b), on the other hand, proposed two learner-agnostic automatic evaluation metrics: convergence to measure the ability of a hint to eliminate wrong candidate answers, and familiarity to measure the recognizability of answer entities. To measure convergence of a hint, they adopt a three-step process: (i) generating candidate answers using LLMs, (ii) validating the entailment of answer given a hint, and (iii) computing an aggregate score for a hint across all candidate answers. For familiarity, they utilize the Page Views of Wikipedia3 articles corresponding to the named entities present in the hint as a measure of global familiarity to the hint normalized across all questions in the TrivialHG corpus (Mozafari et al., 2024b).
4 Roadmap for Future Research in Hint Generation
Great progress has been made in automatic hint generation over the last two decades; however, there is still room for improvement. Existing hint generation frameworks do not personalize the hints to the learner’s prior knowledge (Section 2.1), and are only evaluated in short-term studies. We still do not know the effects of long-term exposure to these interventions on the learners. These frameworks are also limited to certain domains and have not been widely explored in other domains, including different branches of science and social science. To improve upon these factors, we discuss a roadmap for future efforts in automatic hint generation. We propose a computational hint generation framework that draws on education and cognitive sciences (Figure 3).
4.1 Formal Definition
Jatowt et al. (2023) proposed a formal definition for hint generation as follows: Given a question q and its correct answer a, the task is to generate a hint h, such that P(a|q, h) −P(a|q) > ϵ, where P(a|q, h) denotes the probability of a user answering q if the hint h is given, P(a|q) is the probability of a user answering q without the hint, and ϵ is a threshold parameter (ϵ > 0). This definition emphasizes the hint’s ability to help answer a question and does not incorporate any pedagogical aspects or principles. It also does not take individual preferences into consideration.
Below, we provide a comprehensive definition that draws inspiration from widely adopted cognitive frameworks on human learning, such as Anderson’s adaptive control of thought rational (ACT-R) theory (Anderson, 2013) and Ausubel’s theory on meaningful learning (Ausubel, 1963, 1962, 2012). We also incorporate the findings from the qualitative experiments discussed in Section 2, explicitly integrating the alignment of hints to students’ learning objectives and their prior knowledge (Section 2.2), as well as incorporating the pragmatics of a hint (Section 2.1) by accounting for learners’ preferences. We formulate the hint generation task within a tutoring framework with the goal of correctly answering a question. We will later explain how the definition can be extended to generate hints for tasks beyond question answering.
Refined Formal Definition.
Given a learner l attempting to answer a question q, a hint generation systemH (Figure 3) generates a hint h ∈ℋ by mapping , where is the input to the hint generation system having the following elements:
q: question
a: correct answer
: supporting knowledge for the question answer pair <q, a >,
: ongoing dialogue, where and are, respectively, the learner’s past attempts and hints related to q,
: learner l’s past learning history,
: a function to measure the learner l’s learning objective(s), and
: a measure of learner l’s preference of and familiarity to a hint.
Both Anderson’s ACT-R theory and Ausubel’s meaningful learning theory imply the significance of connecting the new knowledge required to solve a problem to the existing concepts in the learner’s knowledge base (entailed in Equation (4)). We also account for the diversity of learners’ motivation to study and incorporate the notion of improving an individual’s learning objective (Equation (3)). The hint generation strategy for a learner aiming to improve their answer accu racy would greatly differ from someone aiming to maximize the diversity of acquired knowledge in a learning session.
Extending the Formal Definition.
The given definition assumes a question-answering setup for the hint generation task, with an additional assumption of the availability of objective answers. We can modify this definition to accommodate for other hint generation settings as described below.
For subjective questions, we should replace the correct answer a with an evaluation rubric instead .
For a hint generation system for the writing assistance task, we need to replace the question with the writing task description () and ongoing dialogue with an interactive sequence of learner’s writings (), and past hints (), , and the answer with a target rubric for writing evaluation .
For a multi-modal hint generation system, we can assume the atomic instances of the dialogue (namely, q, a, h, , and ) are constituted of different modalities depending on the task specifications.
4.2 Components of an Effective Automatic Hint Generation System
In this section, we briefly review some NLP research areas that can inform the design of an automatic hint generation system (Figure 3) that aligns with the proposed formal definition introduced in Section 4.1.
Question Answering.
Current hint generation systems are framed as scaffolding tools within a question answering (QA) setup, where learners are assessed on their ability to answer questions on a relevant topic. These systems can benefit from research in question answering in two ways.
Firstly, hint generation models should be equipped with reasoning abilities to answer questions. Anderson et al. (1995) emphasizes the significance of an underlying production-rule model that helps to break down a learning goal into achievable subgoals in their cognitive tutoring framework, and we posit the QA systems today can form the foundation of this production-rule model. Datasets like StrategyQA (Geva et al., 2021) and strategies like Self-Ask (Press et al., 2022) and Socratic Questioning (Qi et al., 2023) are great illustrations of the goal decomposition ability described in the ACT-R cognitive learning theory framework (Anderson et al., 1995). We can adapt these computational question answering frameworks coupled with the answer assessment module (described below) to learn mapping .
Secondly, question answering systems are quite diverse and can solve complex questions spread across multiple modalities such as knowledge bases (LAN et al., 2022), tables (Jin et al., 2022), images (Srivastava et al., 2021; de Faria et al., 2023), and videos (Zhong et al., 2022), as well as explain the reasoning for the answer (Danilevsky et al., 2020; Schwalbe and Finzel, 2023). These question answering models can form the basis of hint generation systems that take into account incomplete solutions and the necessary reasoning steps to answer the question.
Answer Assessment.
Answer assessment plays a crucial role in evaluating and understanding learner responses, enabling the system to provide targeted feedback and adaptive guidance. Accurate answer assessment is foundational for standardizing the grading process, identifying misconceptions, tracking individual progress, and tailoring subsequent hints to address specific learning needs. We can utilize an answer assessment module to identify the missing (or wrong) components of an ongoing answer () juxtaposed to a reference answer (a) and provide appropriate hints to scaffold them towards a correct solution. We can also utilize the answer assessment module to aggregate the mistakes made by a learner over time to track their progress.
Prior work includes several answer evaluation strategies, varying from naive exact match and token overlap approaches to BERT-based semantic strategies for short form answers (Bulian et al., 2022), explainable systems trained on LLM-distilled rationales (Li et al., 2023a), semantic grouping based systems for batch grading (Chang et al., 2022) and multi-modal assessment frameworks to evaluate oral presentations (Liu et al., 2020). For a detailed overview of answer assessment strategies, we point the readers to the survey written by Das et al. (2021).
User Modeling.
User modeling is extensively explored within the recommendation systems research area, with the aim of adapting and customizing a service to a user’s specific needs. User modeling holds the key to identifying the learner’s interests and preferences within the hint generation framework (Equation (4)), helping align the hints to the learner’s prior knowledge (Section 2.2) and personalize the feedback (Section 2.1). Using a learner’s past interaction data (ℒl), we can learn a preference function () based on the success of previous hints in helping the learner achieve their learning goal. We can measure this learning goal () using direct or indirect learner feedback from the user interaction. For instance, we can obtain learners’ self-reported data about the effectiveness of a hint, or develop alternate indicators of success by observing the learner’s behavior across several dimensions, such as the number of interactions, response initiation time, degree of change in response, and frequency of hints asked.
Liu et al. (2022), for instance, utilizes long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) for knowledge estimation (Corbett and Anderson, 1994) of student’s current understanding based on their past responses for open-ended program synthesis. They utilize these time-varying student knowledge states to predict students’ responses to programming problems, monitoring and analyzing their progress. We can adopt similar user modeling techniques to measure progress, adapt the hints to the learner’s preferences, and design a curriculum (or suggest areas) for further improvement. We direct the readers to the comprehensive survey authored by He et al. (2023) for an extensive overview on user modeling.
Dialogue Generation.
Intelligent tutoring systems are mediated by dialogue systems (or conversational agents), providing an interface between ITS and learners. Recent developments in retrieval augmented generation (Gao et al., 2023; Cai et al., 2022; Li et al., 2022), emotion-aware dialogue systems (Ma et al., 2020), and enhanced language understanding capabilities in multi-turn dialogues (Zhang and Zhao, 2021) can help improve the hint generation models, empowering them with abilities to better understand the learner’s queries, understand and respond to complex domains, generate affective responses, and keep track of long conversation sessions for adaptive tutoring that would help link the generated hints to learner’s prior knowledge (Section 2.2). For further exploration, we invite readers to read the surveys written by Ni et al. (2023), Deriu et al. (2021), and Ma et al. (2020).
Question Generation.
Questions play a pivotal role in education, serving to help recall knowledge, test comprehension, and foster critical thinking. Al Faraby et al. (2023) classify neural question generation for educational purposes into three broad categories: (1) Question Generation from Reading Materials, (2) Word Problem Generation, and (3) Conversation Question Generation. We posit a similar use case for question generation to generate hints within conversations to clarify not just the learner’s response but also their understanding and ability to present their ideas. We point the readers to surveys written by Kurdi et al. (2020), Zhang et al. (2021), Das et al. (2021), and Pan et al. (2019) for a comprehensive overview.
Modular Structure of Proposed Approach.
Although an ideal hint generation system would be able to satisfy all the criterion described in our formal definition (Equations (1) to (4))—providing hints that are personalized to individual learners preferences and learning objectives—we acknowledge the limitations of technology at the time of this survey (Section 4.3). Keeping that in mind, our proposed formal definition (Section 4.1) and components of hint generation system (Section 4.2) suggests a modular structure, comprising of some essential components (such as Equations (1) and (2) that form the crux of a hint generation system, or the question answering module that helps develop the reasoning to develop the hints) and some complementary modules (like Equations (3) and (4) that helps personalize the hints to individual learners or question generation module that adapts particular hint giving strategies) to further enhance the quality of generated hints. This paper proposes thus a long term roadmap for automatic hint generation systems that can showcase intermediate progress along the way.
4.3 Open Challenges for Effective Automatic Hint Generation Systems
In this section, we outline challenges and future directions for building more effective automatic hint generation systems that align with our formal definition.
Privacy-preserving Self-evolving Frameworks.
Current hint generation frameworks limit their applications to a fixed dataset or a pre-defined set of problems, which does not guarantee high quality performance in real-world applications. In order to make these frameworks more effective for the learners, we need to adopt self-evolving frameworks that can incorporate the learner’s feedback and implicit preferences. These self-evolving frameworks should have the capability of identifying user feedback from an ongoing conversation (), gathering implicit preferences () from past learning interactions (ℒl), and adapting their hint generation process to an individual learner while respecting their right to privacy.
Although there is prior work on incorporating human-feedback to improve generation quality (e.g., prompt optimization (Chang et al., 2024; Sahoo et al., 2024), reinforcement learning with human feedback (Kaufmann et al., 2023; Casper et al., 2023)), incorporating these user-modeling aspects in a privacy-preserving manner remains an active area of research (Miranda et al., 2024). Differential privacy (Zhao and Chen, 2022; Yang et al., 2023) and federated learning (Li et al., 2021) could be relevant research avenues to help create these self-evolving frameworks with a responsible approach. For an in-depth review of the state of preserving privacy in LLMs, readers are encouraged to consult the survey by Miranda et al. (2024).
Diverse Domain Exploration.
Most efforts within the hint generation space are limited to programming, language acquisition or mathematics, similar to the domain trends in intelligent tutoring systems (Mousavinasab et al., 2021). However, education and tutoring often involve other subjects within natural sciences (e.g., physics, chemistry, biology, earth sciences), social sciences (e.g., history, civics, geography, law), and learning beyond educational institutions (e.g., educating patients about their health conditions effectively (Gupta et al., 2020)). Tackling the hint generation problem in these subjects raises several challenges, such as the evaluation of subjective long-form answers, the need for domain knowledge and enhanced reasoning capabilities (Lyu et al., 2024; Huang and Chang, 2022; Zhang et al., 2023). However, it simultaneously opens up new opportunities to expand the capabilities of hint generation systems, including the development of smaller models that combine the power of pre-trained language models with the power of adapted knowledge models (e.g., COMET (Bosselut et al., 2019)).
Efforts in question answering can also help provide seed datasets and data annotation strategies to expand the hint generation resources. The literature is rich in education-related datasets ranging from generic datasets like RACE (Lai et al., 2017), LearningQ (Chen et al., 2018), TQA (Kembhavi et al., 2017), and so forth to subject-specific datasets like SciQ (Welbl et al., 2017), ScienceQA (Lu et al., 2022), SituatedQA (Zhang and Choi, 2021), GSM8K (Cobbe et al., 2021), BioASQ (Tsatsaronis et al., 2015), CORD-19 (Wang et al., 2020), and PubMedQA (Jin et al., 2019). To extend these datasets we envision a collaborative approach between the ML practitioners and expert tutors to explore the utilization of hints in classroom or one-on-one tutoring setting across different domains.
Multi-lingual and Multi-cultural Aspects.
Prior studies have found a correlation between linguistic and cultural diversity, and capabilities for innovation (Hofstra et al., 2020; Evans and Levinson, 2009). However, we found that the majority of research at the intersection of natural language processing and learning sciences is limited to the English language, either as the mode of education or a subject for language acquisition. Although utilizing multi-lingual large language models (mLLMs) for hint generation can potentially help incorporate the linguistic aspects of the learning experience, providing culturally aware hints still remains a challenging task. Liu et al. (2024) reveals that while mLLMs are aware of cultural proverbs, they struggle to reason with figurative proverbs and sayings. Building benchmarks to evaluate cultural awareness of generation models such as MAPS (Liu et al., 2024), PARIKSHA (Liu et al., 2024), CultureAtlas (Fung et al., 2024), and CUNIT (Li et al., 2024) is an essential first step towards identifying the shortcomings of hint generation frameworks. Moving forward, we need to explore the hint generation capabilities of mLLMs and develop culturally aware systems that accommodate the education for non-English learners, and sustain the diverse cultural and linguistic heritage (Bernard, 1992; Soto et al., 1999).
Multi-modal Elements.
Prior qualitative research has established gains from complementing education with additional modalities like maps, diagrams, and multi-media content (Winn, 1991; Swidan and Naftaliev, 2019; Tippett, 2016; Kapi et al., 2017; Abdulrahaman et al., 2020). Research in intelligent tutoring systems has also explored incorporating certain gamifying elements like badges, leaderboards, narratives, and virtual currency to keep the learners more engaged and motivated (González et al., 2014; Ramadhan et al., 2024). We believe incorporating these cues into a hint generation system would improve the students’ memory, understanding, and overall learning experience to help create a holistic educational tool (Pourkamali et al., 2021).
Affective Systems.
Affective aspects are often neglected when building intelligent tutoring systems and hint generation systems (Hasan et al., 2020). However, incorporating them into the education pipeline is key for personality development, encouragement and improving self-motivation (Jiménez et al., 2018). Jiménez et al. (2018) show that using affective feedback has a positive impact on students facing learning challenges. Thus, affective hint generation systems that take into consideration the emotional state of the learners when providing feedback is an important area of study.
In recent years, there has been increased interest in building dialogue agents that can adapt to the emotional state of the user in an ongoing dialogue setting. Haydarov et al. (2023) develop a large-scale multi-modal benchmark for visually grounded emotional reasoning-based conversations that use visual stimuli to stir a conversation to test out the emotional reasoning capabilities of multi-modal systems. Li et al. (2023b), on the other hand, proposed a future emotion state prediction framework in spoken dialogue systems to predict the future affective reactions of users based on the ongoing conversation. Resources and findings from the ongoing research in affective dialogue systems could be leveraged and advanced to develop more adaptive and emotion-aware tutoring systems. We point the readers to Ma et al. (2020), Raamkumar and Yang (2022), and Zhang et al. (2024) for a comprehensive review of affective dialogue systems.
Accessible Systems.
Another fruitful and challenging area of research is to develop accesible hint generation systems. For people with neurodevelopmental disorders (e.g., attention-deficit/ hyperactivity disorder and autism spectrum disorder and learning disabilities (e.g., dyslexia and dyscalculia), one can modify the hint generation systems to have: (1) simplified text: using plain language, and avoid using jargon and complex terminology (Štajner, 2021), (2) multi-sensory supports: leveraging a combination of visual, auditory, and kinesthetic modalities to present hints in multiple formats (Vezzoli et al., 2017; Gori and Facoetti, 2014), (3) interactive elements: incorporating interactive elements to engage learners to explore concepts in a hands-on manner (García-Carrión et al., 2018), and (4) predictable routine: establishing a consistent routine for delivering hints consistently can helps learners feel more comfortable and confident (Love et al., 2012).
Evaluation Metrics.
Evaluation of generated hints is a non-trivial task, depending on multiple factors to determine the quality and the success of a hint. Prior work determines the success of a hint generation system by the learner’s abilities in producing the reference solution (Equation (2)) (Jatowt et al., 2023). However, this evaluation framework does not penalize the generated hints if they make the problem too simple, and also does not take into account the individual learner’s preferences and learning objectives. Therefore, we need to build human-centered evaluation frameworks (Lee et al., 2022) that can help measure factors beyond the learner’s answering capabilities, such as learner’s ownership over the learning process, long-term capabilities of generated knowledge, motivation and enjoyment they receive during their interactions.
4.4 Ethics Considerations
The integration of NLP within educational settings raises distinct concerns, such as the impact on pedagogical approaches, the dynamics of teacher-student interaction, and learner agency (Holstein et al., 2019). The adoption of NLP technologies in the classroom implements a particular theory of teaching and learning, and these values must be made explicit (Blodgett and Madaio, 2021). How does introducing a new tool reconfigure the dynamics of the teacher-student relationship? Here, it would be crucial to avoid the solutionism trap, define the boundaries of where the system is useful, and ensure that the intention is to augment educators’ workflows instead of substituting them (Remian, 2019). Researchers and practitioners must also attend to the longer-term impacts of engaging with young individuals during a formative period (Holmes et al., 2021). Below, we outline various ethics considerations, including data privacy and consent, bias and fairness, and effects on language variation (Schneider, 2022), and offer strategies to address these concerns.
One of the biggest sets of ethical considerations relates to the use of student and teacher information (Nguyen et al., 2023). Given the sensitive nature of educational data, it will be important to set up privacy measures and enable informed consent. Students’ information beyond individual responses to questions may need to be tracked to provide an effective learning experience (Kerr, 2020). However, this opens up the possibility of surveillance and misuse, jeopardizing learners’ trust and autonomy (Regan and Steeves, 2019). It would be critical to promote data literacy among educators and learners (e.g., through workshops) to enable them to minimize the risk of their participation (Kerr, 2020). The issue of data ownership raises questions about who holds control over the information collected through education platforms.
We must collectively explore the broader implications of integrating NLP in education on representativeness and equity and exacerbating systemic inequalities (Weidinger et al., 2022). There is a risk that the hints generated by NLP models may not adequately reflect the diverse backgrounds and lived experiences of students (Dixon-Román et al., 2020) and potentially perpetuate harmful stereotypes about different identities (Dev et al., 2021). Prior work has demonstrated the various forms of ‘bias’ in NLP systems (Blodgett et al., 2020), which may contribute to the construction of language hierarchies and limiting language variation (Schneider, 2022). To promote inclusive design and mitigate these ethical considerations, it is essential to understand how power, privilege, and resources are redistributed as a result of introducing AI in the classroom. Is there a possibility of diminishing quality education for marginalized and under-resourced groups (Remian, 2019)? We must take a community-collaborative approach to understand how to design justice-oriented and accountable systems (Madaio et al., 2022) where learners can truly benefit from hint generation systems.
5 Conclusion
In this paper, we consolidate prior research in hint generation, bridging the gap between research in education and cognitive science, and research in AI and natural language processing. Based on our findings, we propose a roadmap for the future research in hint generation, where we provide a rethinking of the formal task definition, a brief review of research areas that can inform the design of future systems, open challenges for effective hint generation systems, and the ethical considerations. Although hint generation has a long history dating back over three decades (Hume et al., 1996), recent advances in natural language processing could serve useful for future hint generation systems. Beyond education, hint generation is also an excellent atomic task to measure a system’s ability to personalize content to user needs and requirements. We invite researchers to foster a community, develop new benchmarks, create shared tasks and workshops for automatic hint generation.
Limitations
Due to the rich and diverse literature on intelligent tutoring systems, we limit our survey to research directly relevant to hint generation and do not cover other types of learning-related feedback.
Acknowledgments
We would like to thank Shivani Kapania for identifying ethics considerations and proofreading this work.
Notes
References
Author notes
Action Editor: Keith Hall