Abstract
A patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions (Zhao et al., 2017). In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients’ discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients’ misunderstandings. Our comprehensive automatic & human evaluation results demonstrate our PaniniQA is capable of improving patients’ mastery of their medical instructions through effective interactions.1
1 Introduction
Limited patient understanding of their medical conditions can lead to poor self-care at home. Upon hospital discharge, physicians often provide discharge instructions to aid in patients’ recovery and disease self-management (Federman et al., 2018). However, some patients may have difficulty understanding and memorizing instructions due to low health literacy, limited memory, or an absence of supervision. For example, research shows that patients only retain a minimal amount of information from discharge instructions, with an immediate forgetting rate of up to 80% (Kessels, 2003; Richard et al., 2017). Further, when instructions are misinterpreted by patients, there is often a lack of corrective intervention. Limitations in a patient’s understanding of their medical conditions hinder their prospects of recovery. It is imperative to investigate new methods of patient education to enhance health outcomes.
In this study, we explore a novel method inspired by Dialogic Reading (Whitehurst, 2002) to educate patients through interactive question-answering. Dialogic Reading actively involves patients in the learning process by following the P.E.E.R. sequence: Prompt, Evaluate, Expand, and Repeat, which enables patients to engage in a meaningful dialogue, further strengthening their understanding and retention of the material. As illustrated in Figure 1, our dialog agent asks questions about key aspects of discharge instructions and encourages patients to read and understand the instructions to provide accurate answers thoroughly.
Crafting questions that effectively meet educational objectives is challenging (Boyd-Graber and Börschinger, 2020; Dugan et al., 2022). A suitable question should be based on the patient’s discharge instruction and aim to improve their understanding of health conditions, such as “What was the probable cause of your chest pain?”. Conversely, the question “How does cardiac catheterization help treat a heart attack?” illustrated in Figure 1, may exceed the education scope, as it is unanswerable or requires knowledge beyond the provided discharge instruction. Such questions are considered unsuitable for patient education.
We introduce new question-generation methods that draw on the advancements of large language models (LLMs) (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023). Utilizing OpenAI’s GPT-3.5 model, we generate informative questions from discharge instructions. Further, we combine LLMs with medical event and relation extraction to constrain the model, producing questions that target salient medical events identified in the discharge instructions. We create a new dataset with expert-annotated medical events and relations for discharge instructions from the MIMIC-III (Johnson et al., 2016a) database. While earlier efforts have annotated events that physicians would discuss during patient handoff (Pampari et al., 2018; Lehman et al., 2022), our focus is on identifying pairs of medical events with correlational or causal relationships. By posing questions about one event, we guide patients toward the other as potential answers.
Our system further incorporates an answer verification module to provide instant patient feedback. When patients give correct answers, the bot confirms them, reinforcing their understanding. If answers are incorrect or partially correct, the bot clarifies misunderstandings and provides additional information. Extensive automatic and human evaluations demonstrate the efficacy of our question-generation methods and show that PaniniQA holds great promise for promoting patient education. To summarize, our research contributions are as follows.
- ◇
We explore a new way of educating patients regarding their health conditions through interactive question-answering. Our approach aligns with the P.E.E.R. dialogic reading theory that promotes patients’ active participation in comprehending medical events.
- ◇
We compare questions generated using OpenAI’s GPT-3.5 model, our enhanced method with medical event extraction, and human-written questions tailored for patient education. We meticulously evaluated all questions, answers, and patients’ educational outcomes.
- ◇
Through comprehensive human evaluations, we demonstrate that PaniniQA holds promise for patient education. Future work includes controlling the difficulty of questions, prioritizing questions given patients’ health literacy, and enabling interactive learning of medical concepts.
2 Related Work
There is a growing need to improve patients’ understanding regarding their hospital experiences (Federman et al., 2018; Weerahandi et al., 2018; Kwon et al., 2022). Lack of understanding can result in non-adherence to discharge instructions and readmission to the hospital due to poor self-care at home. Previous research has attempted to generate hospital course summaries for patients using lay language (Di Eugenio et al., 2014; Acharya et al., 2018; Adams et al., 2021; Cai et al., 2022a; Hartman and Campion, 2022; Adams et al., 2022). This paper goes a step further by utilizing interactive question answering to communicate essential medical events from discharge instructions to patients, thus enhancing their understanding and retention of the material.
Our proposed method differs from existing clinical question-answering studies in several aspects. Most clinical QAs are designed to satisfy individuals’ information needs, with questions modeled after those that can be asked by physicians (Pampari et al., 2018; Jin et al., 2019; Raghavan et al., 2021; Lehman et al., 2022). These systems focus on improving the accuracy of their answers (Soni and Roberts, 2020; Rawat et al., 2020; Yue et al., 2020a, b). In contrast, our goal is to educate patients and prompt them with questions that will enhance patients’ understanding of their doctors’ visits. A successful QA system should be comprehensive and exhaustive, asking all relevant questions and prioritizing them based on the patient’s medical history and health literacy.
Successful patient education requires effective questioning (Pylman and Ward, 2020). Particularly, question generation has been studied using template-based (Heilman and Smith, 2010; Chali and Hasan, 2015; Fabbri et al., 2020) and neural seq2seq models (Du and Cardie, 2017; Duan et al., 2017; Kim et al., 2018; Sultan et al., 2020; Shwartz et al., 2020). Instruction-tuned LLMs have demonstrated exceptional abilities in conversing with humans (Brown et al., 2020; Sanh et al., 2021; Ouyang et al., 2022; Chowdhery et al., 2022; Longpre et al., 2023). However, most research has been conducted using CommonCrawl, Wikipedia, and other generic texts. Considering the factuality issues of neural language models (Maynez et al., 2020; Pagnoni et al., 2021), question generation in the medical domain remains challenging.
Learning through conversation can improve education outcomes (Golinkoff et al., 2019; Zhang et al., 2020; Cai et al., 2022b; Yao et al., 2022a, a; Xu et al., 2022). Dialogic Reading (Whitehurst, 2002; Mol et al., 2008; Lever and Sénéchal, 2011) has demonstrated that engaging children in a guided conversation with parents while reading storybooks can significantly enhance their learning outcomes. While engaging physicians in high-quality conversations may not always be feasible, the use of question answering facilitated by a chatbot could be a valuable means of helping patients acquire a deeper understanding of their health conditions.
3 Question Answering in the GPT Era
LLMs such as ChatGPT have led to significant advancements in generative AI (Brown et al., 2020; Sanh et al., 2021; Chowdhery et al., 2022; Longpre et al., 2023; OpenAI, 2023; Wang et al., 2023a). Fine-tuning neural models on specific tasks often yields superior results. Furthermore, LLMs acquire emergent abilities through instruction tuning and reinforcement learning using human feedback (Ouyang et al., 2022). This allows them to generalize to new tasks effectively. Common human–LLM interactions include (a) zero-shot prompting, where users provide a prompt for the LLM to complete, and (b) in-context learning, where users give task examples and ask the LLM to solve a new case, potentially involving a multi-step reasoning process (Wei et al., 2022). In this study, we focus on zero-shot prompting to assess the LLM’s ability to comprehend discharge instructions.
LLMs possess vast world knowledge, and their performance on knowledge-intensive tasks correlates with training data and model size (Bommasani et al., 2022). However, it remains unclear whether LLMs have enough domain knowledge to facilitate patient education. For example, GPT-3, with its 175 billion parameters, is trained on general data sources such as Common Crawl, WebText2, Books, and Wikipedia (Brown et al., 2020). Yet, the model still generates factually inconsistent errors within their output. Our study presents an initial evaluation of GPT models’ potential in interactive patient education. Following the P.E.E.R. framework of dialogic reading, we employ GPT models to perform the following tasks:
Question Generation.
We use OpenAI’s GPT-3.5 model (text-davinci-003) to generate informative questions from a discharge instruction. The questions aim at helping patients understand crucial medical events. Our prompt is “Generate N questions to help the patient understand crucial medical events in the above discharge instruction.” Similar to a teacher designing exam questions, we anticipate the GPT model to produce a set of questions all at once rather than incrementally. The questions must collectively cover the salient events identified in the discharge instruction while minimizing redundancy.
Answer Verification.
Useful feedback is essential for improving patient comprehension of the material. To perform this task, we prompt the GPT model with “As a physician, your goal in the conversation is to help your patient better understand the discharge instructions before they leave the hospital.” Utilizing OpenAI’s API, we also provide the original discharge instruction, interaction history, and current question-answer pair as key-value pairs for the model. We then instruct the model to “verify if the patient’s answer is correct, incorrect, or partially correct, and generate a suitable response to improve the patient’s comprehension of this question.” We empirically compared two GPT models, text-davinci-003 and gpt-3.5-turbo (ChatGPT), and selected ChatGPT for answer verification as it is optimized for chat and generally produces higher quality responses.
4 Extracting Salient Medical Events
In this section, we present our question-answering system that emphasizes identifying salient medical events and their relations. We generate targeted questions using them and apply the same answer verification module described previously.
A typical discharge instruction includes Visit Recap, which recaps a patient’s clinical visit, including symptoms, diagnoses, treatments, and test results. Patients are expected to understand the relationships among these medical events, such as how the treatment ERCP relates to cholangitis as illustrated in Table 2 (top). Detailed Instructions include medication and aftercare instructions (bottom). They may be easy to understand but contain trivial details that patients may overlook, potentially hindering their self-care at home. We propose automatically extracting key medical events and relations from them (§4.1). Given their unique characteristics, we apply two distinct information extraction and question generation strategies for Visit Recap and Detailed Instructions to produce targeted questions (§4.2).
4.1 Event and Relation Identification
Key event and relation identification are conducted on Visit Recap. Event identification is framed as a sequence labeling task, where we assign a label to each token of the discharge note, representing its event type. We define 11 event types in this study, detailed in Table 3, including symptoms, diseases, complications, tests, test goals/results/implications, procedures, medicines, treatment goals, and results. We fine-tune pre-trained sequence labeling models on our dataset, optimizing the cross-entropy loss of gold standard labels.
Relation identification is framed as a sequence classification task. We focus on binary relations consisting of two medical events. We evaluate all pairwise combinations of identified medical events as candidates, provided their event types align with the six event relations defined in Table 1. Special tokens are inserted before and after each identified event to indicate both its position and event type.2 The sequence, enhanced with special tokens, is fed into a sequence classification model to predict a binary label, where 1 indicates a relation between the two events, and 0 otherwise. We fine-tune pre-trained sequence classification models on our dataset (§5) by optimizing the cross-entropy loss for gold-standard labels.
We perform key event identification on Detailed Instructions using a different tool, as they contain medication and aftercare specifics that patients might overlook. We use an existing high-performing medical NER system to extract medical entities.3 This model was pre-trained on the MACCROBAT dataset (Caufield et al., 2019) and can identify 84 biomedical entities within clinical narratives. We limit the model to identify 7 entity types: Medicine Dosage, Medicine Frequency, Medicine Duration, Medication Name, Sign & Symptom, Diagnostic Procedure, Upcoming Appointment. Relation identification is not performed on detailed instructions.
4.2 Question Generation
Visit Recap.
We generate a question from each identified binary relation. Different relation types are mapped to specific questions using templates provided by physicians according to their domain knowledge (see Table 1). Using a template-based approach allows us to create questions targeting salient medical events. By asking questions about one event, we guide patients towards the other as potential answers.
Detailed Instructions.
We generate a question for each identified medical entity by creating a fill-in-the-blank question, which is then converted into a natural language question using the GPT model. An example is shown in Table 2. Although cloze-style questions can serve educational purposes, we want to prevent patients from using string matching to find answers. Instead, natural language questions require patients to have a deeper understanding of the discharge note, thus fulfilling our education objective. When selecting medical entities as triggers, we prioritize four categories: Medicine Dosage, Medicine Frequency, Medicine Duration, and Upcoming Appointment, as they are informative and better guide patient comprehension. To convert a cloze-style question into a natural question, we provide this prompt to the GPT model: [Fill-in-the-Blank Sentence] Generate a simple question targeting the blank in the above sentence.
5 Data Annotation
We seek to annotate discharge instructions from the MIMIC-III database (v1.4) (Johnson et al., 2016a) with key medical events that are important for patients to understand. MIMIC-III is a publicly available repository of de-identified health records of over 40,000 patients collected from the Beth Israel Deaconess Medical Center in Massachusetts. Our aim is to identify text snippets in discharge instructions that correspond to significant medical events, including symptoms, diseases, test results, and treatments. We annotate not only individual events but also their relationships. They are organized into a hierarchy as outlined in the schema shown in Table 3. Consistent with Lehman et al.’s (2022) approach, we utilize events and their relationships as triggers that prompt the generation of questions.
We recruited five medical experts to create a sizable dataset. They are MD students at UMass Chan Medical School and have a high level of language proficiency. Each expert is given 150 discharge notes to annotate. It is possible to skip some notes due to low text quality. Annotators were also given detailed instructions and examples. We developed a web-based interface to facilitate the annotation process, which has been iteratively improved to meet the needs of this study. Due to budget constraints, we assign one annotator to each discharge note. In total, we completed 458 discharge notes with medical event annotations.
Our annotation consists of two phases. In the first phase, an expert selects text snippets from the discharge instruction corresponding to medical events that the patient needs to understand. Each snippet is assigned a coarse event category, such as a medical issue, laboratory test, treatment. The expert further refines it by assigning a fine-grained event type, resulting in a schema with 11 event types (Table 3). In the second phase, the expert identifies relationships between medical events using a set of 6 pre-defined relationships, such as “[Symptom] …caused by [Disease].” We show a distribution of medical events in Figure 2.
A key distinction between our work and earlier dataset curation efforts (Pampari et al., 2018; Yue et al., 2020b; Lehman et al., 2022) is that the earlier efforts aim to annotate questions that physicians would ask during patient hand-off, which may be informal and unanswerable based on the discharge instruction. In contrast, our focus is on annotating salient medical events that are essential to patient’s understanding of their medical conditions.
We split our annotated data into train / validation / test splits, which contain 338 / 60 / 60 discharge instructions, respectively. For relation identification, we use the event pairs from the human-annotated relations as positive relations and all other medical event pairs of compliant types (e.g., the event pair types in Table 3) as negative relations. We collect all negative event pairs4 as negative cases. Overall, our medical relation dataset contains 2530 / 399 / 332 instances in the train / validation / test set, respectively; 28.7% instances are positive relations.
6 Evaluating Information Extraction
To improve LLMs’ ability to generate educationally effective questions for patient education, we designed an Information Extraction (IE) module (medical event/relation identification) to guide question generation. We report automatic evaluation results for different IE methods in this section.
6.1 IE Evaluation Settings
We fine-tune four pre-trained language models on our annotated dataset for key medical event and relation identification in Section 5. These models are obtained from HuggingFace: (1) BERT-large (Devlin et al., 2019); (2) BioBERT (Lee et al., 2020); (3) PubmedBERT (Gu et al., 2020); (4) ClinicalRoBERTa (Lewis et al., 2020); All four pre-trained models have the same scale of parameters (345 million). The later three language models were pre-trained on different bio-medical or clinical corpora, thus are better transferable to our patient education task due to the model’s level of medical knowledge (Sung et al., 2021; Yao et al., 2022b, 2022c). The models are trained on a single RTX 6000 GPU with 24G memory. The average training time for the relation identification model is around 20 minutes.5 For evaluation metrics, we report the model’s Micro-average precision, recall, and F-1 score.
6.2 IE Evaluation Results
The performance of four evaluated models is in Table 4. The results suggest that the models pre-trained with biomedical or clinical corpus show better performance than the naive BERT model. For both tasks, ClinicalRoberta achieves the best performance, so we report only this model’s performance in following category-wise performance analysis.
. | Pretrained Model . | P(%) . | R(%) . | F1(%) . |
---|---|---|---|---|
Medical Events | Bert | 31.38 | 44.58 | 36.83 |
BioBert | 40.43 | 51.63 | 45.35 | |
PubmedBERT | 42.70 | 50.12 | 46.11 | |
ClinicalRoBERTa | 44.28 | 54.03 | 48.67 | |
Event Relations | Bert | 57.48 | 75.31 | 65.21 |
BioBert | 73.41 | 80.37 | 76.73 | |
PubmedBERT | 72.56 | 75.31 | 73.91 | |
ClinicalRoBERTa | 74.28 | 82.27 | 78.07 |
. | Pretrained Model . | P(%) . | R(%) . | F1(%) . |
---|---|---|---|---|
Medical Events | Bert | 31.38 | 44.58 | 36.83 |
BioBert | 40.43 | 51.63 | 45.35 | |
PubmedBERT | 42.70 | 50.12 | 46.11 | |
ClinicalRoBERTa | 44.28 | 54.03 | 48.67 | |
Event Relations | Bert | 57.48 | 75.31 | 65.21 |
BioBert | 73.41 | 80.37 | 76.73 | |
PubmedBERT | 72.56 | 75.31 | 73.91 | |
ClinicalRoBERTa | 74.28 | 82.27 | 78.07 |
We further report more fine-grained results of the medical event extraction per category in Table 5, and the Symptom, Disease, Test, Procedure, and Medicine categories generally achieve better performance, as we suspect it is due to a more abundant training data. Table 6 shows the fine-grained performance of event-relation identification per category. The F-1 scores of most relations are around 80%, implying fair performance. The relation Test goal achieves 100% in precision because our test set contains eight Test goal instances.
To explore the generalization ability of the model, we compare the model’s performance on seen and unseen medical events during training. Specifically, seen events are events that appear in the training set, while unseen events are not. We observed that 15.21% of the test instances are unseen medical events. For the medical-event extraction task, the F1 score of seen events is 49.36%, and the F1 score of unseen events is 44.82%. For the event-relation identification task, if both event in the event pair are seen events, the model achieves 78.72% in F1 score, otherwise the performance drop to 74.50%. This indicates the model only shows slight drop in performance when encountering unseen medical events during training.
7 Evaluating Patient Education
In order to comprehensively evaluate patient education outcome, We conducted human evaluations from both the patient’s and the physician’s perspectives, as well as a GPT-4 powered automatic evaluation. These evaluations focus on two main aspects: (1) The generated question’s quality of different models (GPT, GPT+IE, and human ground-truth); (2) The preference of different designs of the interaction experience (None of support, Raising Questions only, and Raising Questions and Verifying Answers).
7.1 Human Evaluation Settings
The goal of physician evaluation is to have human domain experts evaluate whether these machine-generated questions are comparable to the human-crafted questions or not. To do so, we recruited 3 medical practitioners6 and their tasks are to read the discharge instructions, and provide qualitative feedback on if these machine-generated questions are educationally effective to the patients; if not, how should they be improved.
The goal of patient evaluation is to have the general public users interact with and provide ratings on the different combinations of the question-generation models and the interaction designs. We also designed a post-experiment evaluation task (i.e., Cloze Test) to quantitatively measure their understanding outcome. We recruit 30 human evaluators to participate in our patient education experiment. All the evaluators have bachelor’s degrees but do not have any medical education background. We present a screenshot of our patient evaluation user interface in Figure 3.
In our study, we have the following three options for the user interaction experience design:
Condition None: The evaluator only sees the discharge instruction, no question-answer interaction. This is today’s baseline.
Condition Q: The evaluator reads the discharge instruction, and interact with the chatbot, which can only ask questions but do not to provide feedback to users’ answers.
Condition QA: The evaluator reads the discharge instruction, and interact with the chatbot, which can ask questions and provide answer feedback to the user.
The questions asked by the chatbots can come from following three sources:
Human: Expert-written questions based on discharge instructions. We ask an MD student to read each discharge instruction and write down all questions she would ask a patient about this discharge instruction for patient-education purposes.
GPT: We utilize GPT-3 model to generate a series of questions (at least four) directly from the discharge instruction. Specifically, we use the following prompt: [Discharge Instruction] Generate at least four questions to help the patient understand crucial medical events in the above discharge instruction.7
GPT+IE: Our question generation model enhances by the information extraction technique described in Section 4.
The average number of questions from approach Human / GPT / GPT+IE are 7.5 / 6.17 / 6.1. When combining the variety of interaction designs and question-generation methods, there are five different conditions: (1) None; (2) Q (Human); (3) QA (GPT); (4) QA (GPT+IE); (5) QA (Human). We perform a within-subject experiment setup, where each of the 30 human evaluators should experience all five conditions using different discharge instructions. In total, we have 150 data points (30 per each condition). The order of the five conditions are shuffled so that each condition appears six times at each of the five orders.
7.2 Patient Evaluation Measurements
We use two measurements to evaluate patient’s educational outcome and preference.
1) Cloze Test:
We recruited an MD student to identify 5–7 important medical events that she thinks the patient should be aware of, and replace them with blanks. We use these cloze tests as a post-study evaluation to ask each participant to try their best to fill in the blanks using their memory. The more blanks they fill in correctly, the better the patient’s education outcome is. We report the participant’s accuracy rate as the primary evaluation outcome.
2) Preference Ranking:
We ask evaluators to rank their experience using the following four questionnaire items (Evaluators are allowed to rank two conditions as tied):
Coverage: Does the conversation cover the cloze test in the evaluation?
Appropriateness: Are the questions properly raised, and appropriate for patient education?
Education Outcome: How do you think the learning experience improves your understanding of discharge instructions?
Overall: How do you like the general learning experience considering the above aspects?
We report the Mean Reciprocal Rank (MRR) (Radev et al., 2002) of each model’s final ranking. Generally, a higher MRR value implies the evaluators have more preference over an approach.
7.3 GPT-4’s Automatic Evaluation Settings
Following recent practice of applying LLMs in evaluating dialogue tasks (Liu et al., 2023), we utilize GPT-4 as the evaluation model to automatically measure the quality of AI generated questions and feedback. Similar to patient evaluation in Section 7.2, we evaluate the quality of generated questions from the four perspective (i.e., Coverage, Question Appropriateness, Education Outcome and Overall). Additionally, we also evaluate the quality of AI models’ feedback from two perspectives, i.e., Correctness and Education Potential. Our prompt to the evaluation model is shown in Table 7. We collect the evaluation model’s responses and report the average score of each perspective.
You are a physician who wants to evaluate how helpful an AI model is for educating patients. The model asks the patient questions, then verifies the patient’s answers, in order to help patients memorize their discharge instructions. |
Four evaluation aspects for AI model’s question quality includes: Coverage: Does the conversation cover the cloze test in the evaluation? |
Question Appropriateness: Are the answers to the questions contained in the discharge instruction? |
Education Outcome: Do you think the chatbot helps patients understand their discharge instructions? |
Overall: How do you like the general experience with the chatbot considering the above aspects? |
Two evaluation aspects of the AI model’s feedback includes: |
Correctness: Are the responses from the AI model factually correct? |
Education Potential: Do the AI model’s responses provide helpful information for educating patients? |
5-point Likert scale: |
1: very low rating |
2: low rating |
3: neutral or medium rating |
4: higher rating |
5: very highly rating |
The patient’s discharge instructions: [The Patient’s Discharge Instruction] The conversation between the patient and the AI model: [The Conversation History] |
Give the 5-point Likert scale of the AI model’s question quality (four aspects) and answer feedback (two aspects) one by one. |
Return the scores as dictionary objects, adhering to the following structure: “Coverage”: …, “Question Appropriateness”: .... |
Please provide your response solely in the dictionary format without including any additional text. |
You are a physician who wants to evaluate how helpful an AI model is for educating patients. The model asks the patient questions, then verifies the patient’s answers, in order to help patients memorize their discharge instructions. |
Four evaluation aspects for AI model’s question quality includes: Coverage: Does the conversation cover the cloze test in the evaluation? |
Question Appropriateness: Are the answers to the questions contained in the discharge instruction? |
Education Outcome: Do you think the chatbot helps patients understand their discharge instructions? |
Overall: How do you like the general experience with the chatbot considering the above aspects? |
Two evaluation aspects of the AI model’s feedback includes: |
Correctness: Are the responses from the AI model factually correct? |
Education Potential: Do the AI model’s responses provide helpful information for educating patients? |
5-point Likert scale: |
1: very low rating |
2: low rating |
3: neutral or medium rating |
4: higher rating |
5: very highly rating |
The patient’s discharge instructions: [The Patient’s Discharge Instruction] The conversation between the patient and the AI model: [The Conversation History] |
Give the 5-point Likert scale of the AI model’s question quality (four aspects) and answer feedback (two aspects) one by one. |
Return the scores as dictionary objects, adhering to the following structure: “Coverage”: …, “Question Appropriateness”: .... |
Please provide your response solely in the dictionary format without including any additional text. |
7.4 Synthesized Dataset for Evaluation
Directly presenting real health records to LLMs or participants can lead to data privacy violation.8 Thus, we created 30 synthesized discharge instructions for our human evaluation study. We randomly sampled 30 hospital course notes (a part of EHR data) from the MIMIC-III database, and converted them into synthetic discharge instructions following a neural abstractive summarization method proposed by Cai et al. (2022a). Our physician collaborators reviewed these synthesized discharge instructions to ensure content validity and anonymity.
We then apply the various ways (human, GPT, GPT+IE) to created question-answer pairs for these anonymized synthesized data. We demonstrate some sampled discharge instructions and corresponding generated questions in Table 8.
7.5 Physician Evaluation Results
We interview three physician participants with the following questions: (1) Do you think the questions are effective for patients to understand the important info in the discharge instruction? If not, what questions would you ask? (2) How do you like the questions generated from GPT and GPT+IE?
Physician participants all believe that GPT-generated questions tend to target content that patients do not need to be aware of (e.g., asking why heart attack could cause chest pain is a medical-domain-specific knowledge not suitable for patient’s education). Sometimes the answers to the GPT-generated questions do not even exist in the discharge instruction. Take example 1 in Table 8, the question asks what the patient should expect in their follow-up visits, but this information is not mentioned in the discharge instruction. These qualitative findings may explain why GPT-generated questions’ are rated by patient participants as having a low accuracy score in the Cloze Test metric, as well as being ranked lower in Coverage, Appropriateness, and Education Outcome in Section 7.6.
Worth noting, in some cases where the answers are not in the discharge instructions, physician participants actually believe those questions could be useful for patient education. In example 2 in Table 8, although the discharge instruction does not contain information on how to maintain the stent, physicians still think it is a question they would ask their patients, as it would motivate patients to have better self-managed recovery activities.
For questions generated by GPT+IE, most questions were perceived by the physicians as appropriate (e.g., example 3). However, the GPT-IE may still generate improper questions due to errors in the medical event-relation identification. As shown in example 4, the information extraction model identifies the symptom “swelling in your throat” as a disease, which leads to improper questions.
Physician participants also suggested that some GPT-IE-generated questions lack language fluency. As shown in example 5, the generated question seems redundant and can be better rephrased as “How long do you need to take Prednisone?”
7.6 Patient Evaluation Results
We summarize the patient evaluation results in Figure 4. From the (a) Cloze Test chart, we observe that having a chatbot interact with patient participants (regardless of only with Q or with both QA) can indeed improve their performance over the baseline condition None, which suggests our proposed interactive question-answering design is a promising for patient education. In terms of whether having an answer feedback is helpful or not, the 92.7% accuracy of QA (Human) significantly outperforms the 80.6% accuracy performance of Q(Human), this implies the importance of validating patients’ answers and presenting feedback, thus we decided to always including an answer feedback when conducting further comparison analysis regarding the GPT v.s. GPT+IE question generation algorithms. The result shows that QA (GPT+IE) 88.3% achieves higher accuracy than QA (GPT) 74.1%. This demonstrates the improvement by applying enhancements to LLMs for patient education purposes.
The result related to Evaluator Ranking shows (plots (b, c, d, e) in Figure 4): (1) Considering the Overall ranking of three sets of questions using QA interactive approach, Human quesions performs better than AI generated questions. This suggests machine-generated questions are still not comparable to human ones. (2) Comparing the three interactive approaches, we observe QA (Human) >> Q(Human) > None, which is in line with the findings of Cloze Test. (3) In terms of Appropriateness, and Education outcome, GPT achieves the lowest ranking. According to our observation, many GPT-generated questions ask the evaluators about content not existing in the discharge instruction. As a result, evaluators think the questions are inappropriate and do not help patient education. (4) QA(GPT+IE) has higher ranking in Coverage than QA(GPT). This result is consistent with other recent discussions that incorporate the copying mechanism into LM or LLM by modifying the model structure, loss function, or prompting (Wang et al., 2023b; Chang et al., 2023; Eremeev et al., 2023). QA(Human) has higher ranking in Coverage than Q(Human), despite they use the same questions. This suggests much benefit is provided to patients through the answer feedback interaction.
7.7 GTP-4’s Automatic Evaluation Results
In terms of question quality (as shown in Figure 5), we observe GPT-4’s evaluation scores generally follow the same pattern of patient evaluation results, where questions from Q(Enhanced) are deemed better than Q(Direct). In addition, we observe the scores of all approaches are close or higher than 4, this implies GPT-4 judges the generated questions are of good quality in four perspectives. In terms of answer verification, as all interactive conditions all share the same verification method, we only present the average Correctness and Education Potential score. Specifically, GPT-4 gives 4.14 on Correctness and 4.01 on Education Potential. Both scores are above four, indicating GPT-4 judges feedback from our AI agents’ feedback as high quality.
7.8 Heuristic Evaluation of Conversation Log
We further conducted a heuristic evaluation to explore the deficiency of AI-generated responses and potential improvements. Specifically, we asked an MD student to evaluate the conversation log data of all patient participants.9 Overall, we collect 192 responses from 30 conversations between the participants mimicking patients and the AI model.
We ask our MD-background human evaluator to grade each of the AI model’s answer feedback, and we apply the same evaluation metric, i.e., correctness and education potential as introduced in Section 7.3. We apply binary coding, i.e., evaluator judge response as positive or negative. The positive rate for Correctness is 86.4%, and the positive rate for Education Potential is 74.1%. This suggests that most responses are factually correct and provide helpful information to patients.
Table 9 shows some examples of the answer feedback from the chatbot, and we have following design suggestions for future research to improve the quality of the answer feedback: (1) Most responses are helpful for patients in reviewing their discharge instructions (example 1). But, some responses are factually incorrect and may confuse patients. The AI model may state that the patient’s answer is incorrect or partially correct (example 3), while the patient’s response is actually completely correct. (2) While the responses are generally helpful, they still have a deficiency in providing sufficient and attentive responses in educating patients like a human physician. As shown in example 4, a physician will provide more information about the distinctions between the two medications, including the specific diseases for which they are prescribed.
8 Limitations and Ethical Considerations
This study offers valuable insights, but with a few limitations we would like to note.
Biases.
Large language models trained on vast amounts of text data can pick up biases present in data. For example, they may prefer certain questions related to aspirin or even associate certain health conditions with specific groups of people. They may also perpetuate misinformation and provide incorrect information. In addition, people who participated in our evaluation have different levels of language proficiency and medical background. These biases may be mitigated by enhancing model alignment with each individual’s background and health literacy level.
Broader Impacts.
We have performed a preliminary study to educate patients on discharge instructions using interactive question answering. Although we evaluated our system using the MIMIC III dataset, which represents an intensive care unit setting, the system should be generalizable to other settings, including perioperative care (from preparation before the surgery to recovery after the surgery), cancer treatment, and chronic condition management. Our system may help patients receive customized information that is tailored to their individual needs and preferences.
Social Influence.
Our system has two pillars. First, it is grounded in discharge notes, where we identify important medical events and their relationships that patients should know. Second, it serves an education purpose. For that, we explore the P.E.E.R. sequence to prompt the patient, evaluate, extend and ask them to repeat the answer to reinforce their understanding. Additionally, social influence strategies such as small talk, empathy, persuasion can be explored in the future to shape, reinforce, or change a patient’s behavior and promote engagement.
Privacy Implications.
LLMs can present privacy concerns in patient education when health records are used, potentially violating the HIPPA regulations. However, in this study, we handle data usage with great care. We conduct all experiments on open-sourced real patient data and present an approach to synthetic patient discharge notes. Each synthetic discharge note used in this study has been reviewed by physicians to ensure their validity. We strictly limit our API usage to synthetic data.
9 Conclusion
In this study, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand and memorize their discharge instructions. PaniniQA generates educational questions from discharge instructions after identifying salient medical events and event relations. LLMs with prompting is promising for question-answer generation, but sometimes hallucinating. Extensive evaluations highlight the importance of providing answer feedback.
Acknowledgments
The authors would like to express sincere gratitude to Center for Biomedical and Health Research in Data Sciences, UMass Lowell, which made this research possible. Hong Yu is supported in part by NIH R01DA056470 and 1R01AG080670, NSF IIS 2124126, and HSR&D 1I01HX003711-01A1. Fei Liu is supported in part by National Science Foundation grant IIS-2303678. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, National Science Foundation, and Health Services Research & Development.
Notes
Our data and code are released at https://github.com/pengshancai/PaniniQA.
E.g., the sentence “You were admitted for diverticulitis and treated with antibiotics” was modified as “You were admitted for <dsyn> diverticulitis </dsyn> and treated with <medi> antibiotics </medi>, where the special tokens <dsyn> and </dsyn>” indicates the start and end position of this event, and dsyn reflects the event belongs to the category Disease.
That is, no relationship exist between the event pair, in addition, the types of the two events are restricted by Table 1.
Due to data sparsity, when training both the medical event and relation identification models, we first explore the optimal hyper-parameter set using the validation set. We then combine the validation set into the train set to train our models.
Two licensed physicians and one medical student with hospital internship experience.
We have tried a collection of prompts for the similar purpose, and do not observe significant differences in the quality of generated questions. We used the chosen prompt as it is naive to understand and leads to more succinct questions. Specifically, we instruct GPT-3 to generate at least four questions to benchmark against the least number of questions from the human annotator.
The conversation logs are re-used from the patient evaluation described in section 7.6.
References
Author notes
Indicates equal contribution.
Action Editor: Nitin Madnani