Abstract
Automatic text simplification (TS) aims to automate the process of rewriting text to make it easier for people to read. A pre-requisite for TS to be useful is that it should convey information that is consistent with the meaning of the original text. However, current TS evaluation protocols assess system outputs for simplicity and meaning preservation without regard for the document context in which output sentences occur and for how people understand them. In this work, we introduce a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions. With this framework, we conduct a thorough human evaluation of texts by humans and by nine automatic systems. Supervised systems that leverage pre-training knowledge achieve the highest scores on the reading comprehension tasks among the automatic controllable TS systems. However, even the best-performing supervised system struggles with at least 14% of the questions, marking them as “unanswerable” based on simplified content. We further investigate how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments we obtained.
1 Introduction
Rewriting text so that it is easier to understand has the potential to help a wide range of audiences including non-native speakers (Petersen and Ostendorf, 2007; Allen, 2009; Crossley et al., 2014), children (Watanabe et al., 2009), or people with reading or cognitive disabilities (Alonzo et al., 2020) access information more easily (Chandrasekar et al., 1996; Stajner, 2021). Online resources such as Newsela Inc (2023) and OneStopEnglish (Macmillian Education, 2023) or the Cochrane systematic reviews (Cochrane Collaboration, 2023), provide text articles simplified by human editors so that they are easier to understand by K-12 students, English speakers with limited proficiency, and lay people seeking to understand medical literature, respectively. This has motivated a wealth of Natural Language Processing research on text simplification, framed as the task of rewriting an input text into a simpler version while preserving the core meaning of the original (Chandrasekar and Srinivas, 1997), which has been addressed with approaches ranging from dedicated supervised systems (Specia, 2010; Zhang and Lapata, 2017; Scarton and Specia, 2018; Martin et al., 2020; Jiang et al., 2020; Devaraj et al., 2021; Sheang and Saggion, 2021; Agrawal and Carpuat, 2022; Martin et al., 2022) to prompting black-box pre-trained models (Feng et al., 2023; Kew et al., 2023).
However, texts that are easier to read are not helpful if they mislead readers by providing information that is not consistent with the original document. This can happen with automatic text simplification (TS) outputs where deletions or inaccurate rewrites can change how a text is understood (Devaraj et al., 2022). Assessing to what extent the meaning of the original text is preserved should therefore be a critical dimension of TS evaluation (Stajner, 2021), and a pre-requisite to determining whether and how TS can be used in practice. Additionally, evaluating individual sentences out of context may not be sufficient to establish whether model-generated texts preserve meaning, as human simplifications often occur at the document or the passage level (Devaraj et al., 2022). Yet, TS outputs are primarily evaluated intrinsically, with automatic metrics that compare system outputs with human-written reference simplifications and/or the original source (Papineni et al., 2002; Xu et al., 2016; Maddela et al., 2023), or with generic human assessments of simplicity and meaning preservation of individual sentences outside of a context of use (Schwarzer and Kauchak, 2018). While these evaluations can guide model development, they do not address whether readers get information from the simplified text that is consistent with the original content.
In this work, we conduct a human evaluation of the ability of state-of-the-art TS systems to preserve the meaning of the original text by measuring how well people can answer questions about key facts from the original text after reading a simplified version. We design reading comprehension (RC) tasks to directly assess meaning preservation in TS, different from prior uses of reading comprehension to assess people’s reading efficiency (Angrosh et al., 2014; Laban et al., 2021). This framework lets us conduct a controlled comparison of simplified texts, whether written by humans or by TS systems: We compare people’s ability to answer questions about the original text, a simplified version written by humans, and nine TS-generated versions that represent a diverse set of supervised and unsupervised approaches from the recent TS literature.1
We first discuss relevant literature for TS evaluation and the use of RC exercises to assess simplified or other model-generated texts in Section 2. Next, Section 3 elaborates on our RC-centered human evaluation framework, and Section 4 delves into the various design choices we made. Section 5 demonstrates the robustness of our evaluation and presents the main results. As we will see, supervised systems that utilize pre-training knowledge achieve the highest level of accuracy in RC tasks compared to other automatic controllable TS systems (§5.2). However, at least 14% of the questions remain unanswerable even for the best-performing system due to the errors introduced by these systems (§5.3). In Section 6, we shift our focus towards a meta-evaluation of existing automatic TS evaluation metrics which indicates that the 3-way comparison used in SARI makes it a reliable metric for system-level evaluation at the paragraph level. Finally, we include a preliminary discussion and analysis of the potential for automating the RC-based evaluation through the application of model-based question-answering techniques in Section 7.
2 Background
How to design human and automatic evaluation protocols for TS is a research question unto itself. While automatic metrics are key to system development, commonly used metrics like BLEU (Papineni et al., 2002), SARI (Xu et al., 2016), or the Flesch-Kincaid Grade Level (Flesch, 2007) have low correlation with human judgments of simplicity (Sulem et al., 2018; Alva-Manchego et al., 2021; Maddela et al., 2023; Tanprasert and Kauchak, 2021). This suggests that these metrics can fail to capture meaningful differences between simplified texts. Furthermore, there is no standardized framework for measuring the adequacy of simplified outputs (Stajner, 2021; Grabar and Saggion, 2022), where adequacy refers to the degree to which the generated text accurately conveys the meaning from the original text (Blatz et al., 2004).2
Prior work highlights the importance of manually evaluating TS systems. For instance, Maddela et al. (2023) introduce RANK & RATE, a human evaluation framework that rates simplifications from several models at the sentence level by leveraging automatically annotated edit operations that are verified by annotators. These edit operations are then used in rating the output texts on a scale of 0 to 100. However, this rating is meant to jointly account for meaning preservation, simplicity, and fluency. Devaraj et al. (2022) show that factual errors often appear in both human and automatically generated simplified texts (at the sentence level), and define an error taxonomy to account for both the nature and severity of errors. Yet, these intrinsic evaluations do not directly tell us whether people correctly understand key facts conveyed in the original after reading a simplified text. Moreover, the evaluation is only performed at the sentence level without accounting for the context in which they appear which can impact the overall assessments as noted by Devaraj et al. (2022).
Reading comprehension tests are standard tools used by educators to assess readers’ understanding of text materials, and thus provide an assessment of TS that is more in line with its intended use. They have been used to show that human-simplified texts are easier to comprehend by L2 learners (Long and Ross, 1993; Tweissi, 1998; OH, 2001; Crossley et al., 2014; Rets and Rogaten, 2021), as well as secondary and post-secondary students (Heydari et al., 2013). For instance, Long and Ross (1993) conduct a reading comprehension study with 483 Japanese students with varying English language proficiency levels and found that participants who had access to linguistically simplified or elaborated texts scored higher on the RC tasks than those who read the original text. Similarly, Crossley et al. (2014) showed a linear effect of text complexity on comprehension even when accounting for individual language and reading proficiency differences as well as their background knowledge. Rets and Rogaten’s (2021) study with 37 adult English L2 users showed better comprehension and faster recall for participants with low English proficiency levels with simplified texts. Finally, in a within-subject study with four original and four simplified texts involving 103 participants with varying levels of English proficiency (beginner to native), Temnikova and Maneva (2013) show that utilizing Controlled Language (Temnikova et al., 2012) for TS improves reading comprehension. All these studies are conducted at the paragraph or the document level based on human-written simplifications which are implicitly assumed to be correct.
The use of reading comprehension for the evaluation of automatically simplified texts has been more limited. Angrosh et al. (2014) first used reading comprehension to evaluate automatically simplified texts from multiple TS models, with non-native readers of English. They conduct a multiple-choice test using five news summaries chosen from the Breaking News English website, originally at reading level 6 (hard) and simplified manually or via automatic TS systems. Their study found no significant differences between the comprehension accuracy of different user groups when reading automatically generated simplifications. However, they note that the drop in comprehension scores for some of these systems could be accounted to the content removal which can make some questions unanswerable. Hence, it is not clear whether the differences are non-significant due to user understanding, errors introduced by TS systems, or the effectiveness of simplifications. Laban et al. (2021) also conduct a reading comprehension study to evaluate the usefulness of automatic TS outputs with automatically generated questions that can be answered by original text and human-written references. They found that shorter passages generated by automatic TS systems lead to a speed-up in the RC task completion time regardless of simplicity. However, the automatic generation of questions mostly limited them to factoid, thus limiting the scope of understanding tests, and it is unclear whether the TS errors could render the RC questions unanswerable.
Evaluating text generation via automatic question answering (QA) has also received much attention, including for machine translation (Han et al., 2022), and for text summarization, where it has been used to assess the factuality (Wang et al., 2020) or faithfulness (Durmus et al., 2020) of model-generated summaries. For summarization evaluation, questions have been automatically created based on key information from the model-generated summary, such as important nouns or entities. An automatic QA system is then employed to generate answers to these questions using the original document as a reference. The quality of the generated summary is determined by comparing these answers using metrics that measure semantic similarity or exact matches. However, unlike summarization, where the primary goal is to condense a text (either in an extractive or abstractive fashion), text simplification also involves making structural and linguistic changes so that the text is easier to comprehend which the existing automatic QA-based evaluations are not equipped to assess.
In this work, we design a reading comprehension task to assess the ability of TS systems to preserve meaning via carefully constructed multiple- choice questions targeting language comprehension and use it to conduct a thorough controlled evaluation of a diverse set of state-of-the-art TS systems. We conduct our human evaluation at the paragraph level as humans naturally tend to simplify complex text at this granularity, and utilizing complete texts for measuring RC would yield more accurate results compared to relying on individual sentences (Leroy et al., 2022).
3 A Reading Comprehension-based Human Evaluation Framework
Overview
Our human evaluation is based on the following task: Participants are presented with text and then are asked questions to test their understanding of some of the information conveyed in the text, as illustrated in Figure 1. We seek to measure whether participants who read simplified versions of the original paragraph can answer questions as well as those who read the original. However, our goal is not to assess the participants but the TS systems: When working with participants who are proficient in the language tested, we assume that differences in reading comprehension accuracy indicate differences between the quality of TS systems that produce the different simplifications.
OneStopQA
Within this simple framework, the design of the RC questions and answers is critical to directly evaluate the correctness of automatic TS systems. We build on the OneStopQA reading comprehension exercises created using the STARC (Structured Annotations for Reading Comprehension) annotation framework (Berzak et al., 2020), which is well suited to our task since it targets the real-world need of supporting readers with low English proficiency, and there is already evidence that it is a sound instrument to capture differences in reading comprehension from human-written text.
Specifically, OneStopQA is based on texts from the onestopenglish.com English language learning portal (Vajjala and Lučić, 2018), which are drawn from The Guardian newspaper. Questions are designed to assess language comprehension rather than numerical reasoning or extensive external knowledge. More importantly, these questions cannot be answered with simple string-matching and guessing strategies. Furthermore, the answer options under the STARC annotation framework follow a structured format that reflects four fundamental types of responses, ordered by miscomprehension severity: A indicates correct comprehension, B shows the ability to identify essential information but not fully comprehend it, C reflects some attention to the passage’s content, and D shows no evidence of text comprehension (Berzak et al., 2020). Participants are presented with the answer options in a randomized order to minimize any potential bias or pattern recognition. The correct answer typically is not present verbatim in the critical span, a text span from the passage upon which the question is formulated.3 We note that the questions only target a subset of the information conveyed in a passage, and hence, our evaluation framework does not provide a measure of completeness. In other words, correctly answering the RC questions does not require understanding every piece of salient information from the original.4
Further, prior work suggests that OneStopEnglish text and OneStopQA questions provide a sound basis for evaluating automatic TS, as they can capture differences in reading comprehension from manually simplified text: Gooding et al. (2021) found a statistically significant difference between users scrolling interactions and the text difficulty level in a 518-participant study and Vajjala and Lucic (2019) showed that the nature of the reading comprehension questions can impact text understanding.
Targeting Answerability
We augment the OneStopQA answer candidates with a fifth option motivated by the failure modes of automatic TS. For each question, participants have the option to pick “unanswerable” (UA), which they are instructed to select when “The questions or the answer options are not supported by the passage.”. This lets us directly measure how often readers judge that there is no support for answering the question based on the input text, which is a more salient problem when presenting participants with automatic than human-written simplifications. The resulting reading comprehension problems are illustrated in Figure 1.
Text Granularity
Participants are presented with a paragraph of text before answering each question, thus moving away from the prior focus on evaluating TS at the sentence level. In real world settings, people are unlikely to use text simplification on isolated sentences and might be able to understand important information by making inferences from the context. Thus evaluating text simplification outputs at the paragraph level strikes a good balance between providing a realistic amount of context to readers without making the task too long.
Measures
4 Experimental Setup
First, we describe experiment details including data, participant selection, and study design. Then, we outline the selected TS systems for evaluation.
4.1 Study Design
Data
The OneStopQA dataset includes 30 articles containing 162 paragraphs in total at three difficulty levels: Elementary, Intermediate, and Advanced. Each passage is accompanied by three multiple-choice questions that can be answered at all levels of difficulty. The simplified versions include common text simplification operations such as text removal, sentence splitting, and text rewriting. We select the first two paragraphs from each of the 30 articles and associated questions resulting in 60 unique passages and 180 questions in total. Unlike prior studies that evaluate the impact of human-generated simplifications on various target audiences using only a limited number of articles (typically 1-5) and questions (around 3-5) (Long and Ross, 1993; Tweissi, 1998; OH, 2001; Crossley et al., 2014; Rets and Rogaten, 2021), our evaluation is on a larger scale (180 diverse passage-question pairs), which provides more statistical support to rank different systems.
Participants
The participants are paid directly through the crowd-sourcing platform at an average rate of USD 15/hour. The task is conducted on the Prolific crowd-sourcing platform.5 We recruit 112 native speakers of English between ages 18 and 60 years identified by their first language and with an approval rating of at least 80% for evaluating the correctness of TS systems.
Task Design
Each participant is provided with the following instruction: In this study, you will be presented with 6 short excerpts of English text, accompanied by three multiple-choice questions. You are asked to answer the questions based on the information presented in the text. A participant is presented with a random subset of 6 texts from one of the 11 conditions: original, simplified by humans, or simplified by one of the nine TS systems. Each passage-question pair is annotated by one native English speaker resulting in 1980 annotations. Annotations collected were manually spot-checked for straightlining (pattern where participants consistently select the same response option) and time differences to ensure that the participants were paying attention to the RC task.
4.2 Models for Evaluation
We generate simplified outputs for the selected passages at “Advanced” difficulty, i.e., the Original text, using the systems described below as they are representative of the variety of architectures and learning algorithms (supervised, unsupervised, black-box) proposed in the TS literature:
Keep-it-simple (KIS) (Laban et al., 2021) is an unsupervised TS system trained using a reinforcement learning framework to enforce the generation of simple, adequate, and fluent outputs at the paragraph level.6
MUSS (Martin et al., 2022) finetunes a BART-large (Lewis et al., 2020) model with control tokens (Martin et al., 2020) extracted on paired text simplification datasets and/or mined paraphrases to train both supervised and unsupervised TS systems.7 We use the suggested hyperparameters from the original paper to set the control tokens during simplification generation.8
ControlT5-Wiki (Sheang and Saggion, 2021) is a supervised controllable sentence simplification model that finetunes a T5-base model with control tokens. Again, we use the suggested hyperparameters from the original paper.9
ControlSup (Scarton and Specia, 2018) is a controllable supervised TS model that trains a transformer-based sequence-to-sequence model with U.S. target grade as a side- constraint to generate audience-specific simplified outputs. We generate simplified outputs corresponding to Grades 7 and 5 to match the target complexity of the human- written Elementary simplified texts and to assess the impact of the degree of simplification on correctness.
EditingCurriculum (EditCL), proposed by Agrawal and Carpuat (2022) trains a supervised edit-based non-autoregressive model that generates a simplified output for a desired target U.S. grade level through a sequence of edit operations like deletionsand insertions applied to the complex input text. We generate simplified outputs corresponding to Grades 7 and 5.10
We generate paragraph-level simplified outputs using ChatGPT with the following prompt:11
{Text}
Rewrite the above text so that it can be easily understood by a non-native speaker of English:
We also include the Elementary version of the text from the OneStopEnglish corpus to compare the reading comprehension of the original and model-generated simplified texts against a ground truth reference as a control condition.
Statistics for the human-written and automatically generated passages as well as model summary are presented in Table 1. Automatically generated or manually written simplified texts are shorter and include more sentences (due to sentence splitting) than the original unmodified text. Systems that use pre-trained knowledge (MUSS, T5, ChatGPT) receive a higher simplicity (SARI) score than models trained from scratch (ControlSup, EditCL) except KIS, which achieves low simplicity and adequacy scores according to automatic metrics.12 Both ControlSup and EditCL models generate simplified outputs at a higher complexity level than intended (Average FKGL for Grades 7 and 5 are Grades 9 and 7, respectively). Furthermore, the outputs span a wide range of adequacy and simplicity scores where some systems trade-off adequacy for simplicity with low BERTScore and high SARI values (e.g., ChatGPT, EditCL-Grade5) and vice-versa (e.g., ControlSup-Grade7). While the range of BERTScore values appears small, differences of >0.005 are statistically significant suggesting that the 0.4+ wide range includes meaningful differences within this set of systems.
5 Results
We first analyze the results to show the validity of the evaluation set-up, before comparing TS systems on the accuracy and answerability metrics.
5.1 Validity of the Human Evaluation
Results on Human-written Texts Align with the Literature.
As can be seen in Table 2, human-written texts (Original, Elementary) achieve the highest accuracy scores of approximately 78%. This is consistent with a study by Berzak et al. (2020) who report that Prolific crowd workers achieve a score of 80.7% when tested on all 162 passages from OnestopQA. As expected, even with human written texts, participants do not answer all questions perfectly, reflecting individual differences in reading proficiency, background knowledge, and familiarity with the topic (Young, 1999; Rets and Rogaten, 2021) as well as the difficulty of the questions. These results thus provide an upper bound to contextualize the scores obtained by TS systems. Furthermore, the scores obtained on the Original and Elementary versions are very close, as expected when working with native speakers.
Type . | MODEL . | Pre-trained . | % Correct . | B . | C . | D . | Rank . | |
---|---|---|---|---|---|---|---|---|
Human | Original | – | 78.33 | 6.11 | 2.22 | 1.11 | 1 | |
Elementary | – | 77.22 | 5.56 | 2.78 | 0.00 | 2 | ||
Supervised | MUSS-Sup | ✓ | 76.11 | 6.67 | 1.67 | 1.67 | 3 | |
ControlT5-Wiki | ✓ | 74.44 | 6.11 | 2.78 | 1.67 | 4 | * | |
ControlSup-Grade7 | ✗ | 70.56 | 3.89 | 2.78 | 2.78 | 7 | ||
EditCL-Grade7 | ✗ | 69.44 | 10.56 | 2.22 | 0.56 | 8 | ** | |
EditCL-Grade5 | ✗ | 69.44 | 10.00 | 2.22 | 0.00 | 8 | ** | |
ControlSup-Grade5 | ✗ | 67.78 | 11.11 | 3.89 | 0.00 | 10 | ||
Black Box | ChatGPT | ✓ | 74.44 | 9.44 | 1.11 | 0.00 | 4 | * |
Unsupervised | MUSS-Unsup | ✓ | 73.33 | 6.67 | 2.78 | 1.11 | 6 | |
KIS | ✓ | 20.50 | 7.22 | 3.89 | 3.89 | 11 |
Type . | MODEL . | Pre-trained . | % Correct . | B . | C . | D . | Rank . | |
---|---|---|---|---|---|---|---|---|
Human | Original | – | 78.33 | 6.11 | 2.22 | 1.11 | 1 | |
Elementary | – | 77.22 | 5.56 | 2.78 | 0.00 | 2 | ||
Supervised | MUSS-Sup | ✓ | 76.11 | 6.67 | 1.67 | 1.67 | 3 | |
ControlT5-Wiki | ✓ | 74.44 | 6.11 | 2.78 | 1.67 | 4 | * | |
ControlSup-Grade7 | ✗ | 70.56 | 3.89 | 2.78 | 2.78 | 7 | ||
EditCL-Grade7 | ✗ | 69.44 | 10.56 | 2.22 | 0.56 | 8 | ** | |
EditCL-Grade5 | ✗ | 69.44 | 10.00 | 2.22 | 0.00 | 8 | ** | |
ControlSup-Grade5 | ✗ | 67.78 | 11.11 | 3.89 | 0.00 | 10 | ||
Black Box | ChatGPT | ✓ | 74.44 | 9.44 | 1.11 | 0.00 | 4 | * |
Unsupervised | MUSS-Unsup | ✓ | 73.33 | 6.67 | 2.78 | 1.11 | 6 | |
KIS | ✓ | 20.50 | 7.22 | 3.89 | 3.89 | 11 |
Inter-annotator Agreement (IAA).
We collect a second set of annotations for a subset of 6 passages, covering all 11 conditions, and compute the IAA using Cohen’s kappa (McHugh, 2012). The IAA score for selecting the correct answer indicates moderate agreement (0.437) despite the high subjectivity (individual comprehension differences) and complexity (5 answer options) of the task.
System Rankings are Stable.
Sampling 50 random subsets of k passages for k ∈{5,10,20,30,40,50,60}, we aggregate the mean accuracy score for each subset size and show the rankings for the systems in Figure 2. Using >40 unique passages for each condition, i.e., approximately 120 questions, stabilizes the rankings among the 11 systems, with ChatGPT, T5, and EditCL-Grade7, Grade5 system pairs achieving the same rank.
Taken together, these findings suggest that the evaluation framework is sound and provides a valid instrument to evaluate and compare systems.
5.2 TS Adequacy Findings
Table 2 shows the Acc scores for human-written texts (Original, Elementary) and automatic simplifications generated from supervised (edit and non-edit based), unsupervised, and black-box LLMs. Systems achieve a wide range of scores, starting as low as 20% to approaching within 1% of the accuracy achieved on human-written text. We discuss the main findings below.
Results first show that systems based on unsupervised pre-training yield more correct answers. This is the case for MUSS-SUP which achieves the highest accuracy among all systems. CHATGPT attains a similar score to that of CONTROLT5-WIKI, a supervised sentence-level TS model, showing the benefits of large scale pre-training, and of reinforcement learning with human feedback—even though it is unfortunately unknown whether CHATGPT was trained on TS or related tasks. Overall, the scores show that the best performing TS systems rewrite content so that people understand the information tested as well as in human-written text. This suggests that those systems are worth including in usability testing in future work—thus asking not only whether rewrites are adequate as we do here, but also whether they are useful to readers that need simplified text.
At the other end of the spectrum, the texts simplified by KIS lead to answering only 20% of questions correctly. This is consistent with the low BERTScore for this system in Table 1, and manual inspection which suggests that kis is prone to deletions and hallucinations which do not preserve the meaning of the original. We will study the impact of deletions in more depth in the next section.
In the middle of the pack, among systems for grade-specific TS, edit-based models outperform autoregressive models. The autoregressive model ControlSup exhibits a 3% decrease in accuracy, due to a more aggressive deletion (Table 1) when simplifying to Grade 5, whereas edit-based models like EDITCL maintain their accuracy score even when generating simpler outputs at both grade levels 7 and 5. However, these edit-based models also result in miscomprehension as suggested by the relatively high percentage of questions marked with option B by the human participants. Note that option B represents a plausible misunderstanding of the critical span upon which the question is based (Section 3). We hypothesize that this could be due to the reduced fluency of the model-generated simplifications via edit-based models.
5.3 TS Answerability Findings
We show the answerability score, Ans, for all evaluation conditions in Figure 5.
Human-written text does not achieve perfect scores. Using the STARC annotation framework should ideally yield answerable questions, yet in practice, participants still mark 12%–14% of questions with UA. Manual inspection shows that these questions require making complex inferences and hypotheses about the plausibility of the various options. As a result, when given the UA option, participants are more conservative in selecting the four other alternatives.
Most systems achieve 83%–86% answerability for questions, except for KIS, which scores the lowest at 35.56%. On the subset of questions answerable by both Original and Elementary texts, scores range from 53%–92%. This indicates that errors in model-generated texts hinder question answerability beyond individual comprehension differences. Models, except KIS, achieve similar scores but make different errors, as shown in Figure 3, where no passage-question pairs are correctly answered by all models.
Building upon the finding of Devaraj et al. (2022), who show the prevalence of deletion errors in TS system outputs and our own manual inspection, we hypothesize that over-deletion is the key culprit that makes questions unanswerable. To test this hypothesis automatically, we examine how the unigram overlap (after stop word removal) between the question and the passage (Support(Q)) and the answer options and the passage (Support(A)) influence question answerability when model-generated outputs are used (Sugawara et al., 2018). While, in most cases, the correct answer does not appear as is in the critical span, we expect the unigram overlap to still provide a useful signal as the rephrased version often shares at least some unigrams with the critical span.
Figure 4 shows that Support(A) is a more reliable predictor of UA with a true positive rate (TPR) of 0.675 at a false positive rate (FPR) of 0.406 than Support(Q) (TPR: 0.565, FPR: 0.612). Answers that appear verbatim in the passage (Support(A) =1.0) are correctly answered 93% of the time. However, when the question lacks support in the passage, the unigram overlap with just the answer becomes an insufficient signal. Therefore, we also report the distribution of the product of Support(A) and Support(Q) in the same figure to directly capture the support for both the question and the answer options in the passage, i.e., the UA option. The resulting TPR rate for predicting UA is 0.605 at an FPR of 0.353, indicating that the incorrect deletion of partial or complete phrases by the systems affects the support for both the question and the answer options making RC question unanswerable.
These results temper the adequacy results, suggesting that even the best-performing systems delete content. Taken together these findings call for more research on calibrating the deletion tendencies of TS systems, and for human subject studies to develop machine-in-the-loop workflows to validate automatically simplified content before it is presented to readers.
6 Evaluating Automatic TS Evaluation Metrics
We now turn towards investigating to what extent automatic TS evaluation metrics frequently used in the literature capture the system rankings obtained via the RC task (Al-Thanyyan and Azmi, 2021; Maddela et al., 2023; Devaraj et al., 2022). We compute the Spearman-Rank correlation of the system-level scores using selected automatic metrics and the RC accuracy scores in Table 3. For meaning preservation, we evaluate BLEU, BERTScore (Zhang et al., 2020) and the Levenshtein distance computed between the system output and the Elementary text (Ref) or the system output and the Original text (Src). For simplicity and readability dimensions, we report correlation scores with SARI, and FKGL, respectively. SARI measures lexical simplicity based on the n-grams that are kept (K), added (A), and deleted (D) by the system relative to the original text and to the reference simplified (elementary) texts. Note that all metrics are computed at the paragraph level, just like in the RC task, unlike prior evaluation which uses and evaluates these metrics for sentence-level simplification. We also report the correlation scores with QAFactEval, a QA-based metric designed to evaluate factual consistency in summaries (Fabbri et al., 2022).13
metrics . | meaning (ref) . | meaning (src) . | simplicity . | readability . | QAFactEval . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bleu . | BERTScore . | LevDist . | BERTScore . | LevDist . | SARI . | FKGL . | ||||||||||
(P) . | (R) . | (F1) . | (P) . | (R) . | (F1) . | (A) . | (K) . | (D) . | (Avg.) . | . | (F1) . | (EM) . | ||||
All | −0.193 | 0.418 | 0.292 | 0.310 | −0.142 | 0.084 | 0.033 | 0.033 | −0.159 | 0.686 | −0.134 | 0.301 | 0.728 | 0.126 | 0.126 | 0.025 |
All −{KIS} | 0.157 | 0.167 | −0.012 | 0.012 | −0.634 | −0.311 | −0.383 | −0.383 | −0.659 | 0.719 | −0.622 | 0.707 | 0.778 | 0.335 | −0.252 | 0.395 |
metrics . | meaning (ref) . | meaning (src) . | simplicity . | readability . | QAFactEval . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bleu . | BERTScore . | LevDist . | BERTScore . | LevDist . | SARI . | FKGL . | ||||||||||
(P) . | (R) . | (F1) . | (P) . | (R) . | (F1) . | (A) . | (K) . | (D) . | (Avg.) . | . | (F1) . | (EM) . | ||||
All | −0.193 | 0.418 | 0.292 | 0.310 | −0.142 | 0.084 | 0.033 | 0.033 | −0.159 | 0.686 | −0.134 | 0.301 | 0.728 | 0.126 | 0.126 | 0.025 |
All −{KIS} | 0.157 | 0.167 | −0.012 | 0.012 | −0.634 | −0.311 | −0.383 | −0.383 | −0.659 | 0.719 | −0.622 | 0.707 | 0.778 | 0.335 | −0.252 | 0.395 |
Overall, SARI achieves the best correlation across the board with or without including the outlier system, i.e., KIS. The addition component (A) of SARI that rewards the insertion of n-grams present in the simplified reference but absent from the original text achieves a moderate-high correlation score (0.686–0.719) in both settings. The Levenshtein edit distance of the system output with the Original (−0.659) and the Elementary (−0.634) text receives a negative moderate-high correlation with human judgments, outperforming both surface-form (BLEU) and embedding-based metric (BERTScore) after removing the outlier system (KIS). We hypothesize that metrics that focus on similarity to only the original or the simplified text do not fully capture the balance between simplicity and adequacy. SARI’s 3-way comparison between the input, the output, and the reference is key in yielding system rankings that are consistent with those based on our accuracy results, which could be further repurposed to more directly align evaluation metrics with the accuracy scores.
Furthermore, QAFactEval exhibits only a weak correlation (0.395) at best with human judgments. This is consistent with the current findings by Kamoi et al. (2023), who discuss and show how automatically extracting facts from summaries could lead to a fundamental problem in the evaluation where current QA-based frameworks not only struggle to accurately identify errors in the generated summaries but also perform worse than straightforward exact match comparisons.
7 Model-based Question Answering
Our evaluation so far has relied on human-written questions answered by crowd workers, using either model-generated or human-written texts. Automating one or both components would help scale the evaluation and port it to new settings more flexibly. Recent work suggests that this might be plausible: Krubiński et al. (2021) show that automatically generated questions and answers can be used to evaluate Machine Translation systems at the sentence level, and automatic QA techniques (Fabbri et al., 2022; Wang et al., 2020) have been used to assess the factuality and faithfulness of summarization systems.
Here, we assess the performance of a state-of-the-art QA system in recovering the gold-standard ranking induced by human judgments, leaving the more complex study of multiple-choice RC question generation to future work. We use UnifiedQA v214 a QA model, trained to answer questions in 4 different formats using 20 different datasets. This model has been shown to support better generalization to unseen datasets compared to models specialized for individual datasets. We use the format recommended in the original paper: {question} ∖n (A){choice 1} (B) {choice 2} … ∖ n {paragraph} to generate answers for all the conditions. The Spearman-rank correlation between Exact Match (EM) and ground truth accuracies (C) for all systems is 0.838 and 0.744, excluding KIS. However, we note that the system’s ability to distinguish closely competing systems (highlighted by the same color) is limited, as shown in Figure 6.
Interestingly, QA using human-simplified text achieves higher accuracy than using original unmodified text. This finding is in line with prior work where TS has been shown to improve the performance of multiple downstream NLP tasks such as information extraction (Miwa et al., 2010; Schmidek and Barbosa, 2014), parsing (Chandrasekar et al., 1996), semantic role labeling (Vickrey and Koller, 2008), machine translation (Gerber and Hovy, 1998; Štajner and Popovic, 2016; Hasler et al., 2017; Štajner and Popović, 2018; Miyata and Tatsumi, 2019; Mehta et al., 2020), among others (Van et al., 2021). This suggests that automating part of the evaluation framework is a direction worth investigating in more depth in future work.
8 Conclusion
We introduced an evaluation framework based on reading comprehension to directly assess whether TS systems correctly convey salient information from the original texts to readers. This framework lets us conduct a thorough human evaluation of the adequacy of 10 simplified texts: a human-written version and outputs from nine TS systems.
Supervised systems that leverage pre-trained knowledge (MUSS, T5) produce texts that lead to the highest reading comprehension accuracy, approaching the scores obtained on human-written texts. Prompted LLMs (ChatGPT) perform well but are not as accurate as supervised systems. However, we find that even those systems do not preserve the meaning of the original text, with at least 14% of questions marked as “unanswerable” on the basis of the text they generate.
When human evaluation is not practical, our analysis suggests that SARI is a better metric than meaning-preservation metrics such as BERTScore and BLEU to rank systems by adequacy, and that model-based QA can approximate system rankings but at the cost of reduced discriminative power across systems and can introduce other confounding factors.
Overall, these results confirm the importance of directly evaluating the accuracy of the information conveyed by TS systems, and suggest that while some systems are overall correct enough to warrant usability studies, all systems still make critical errors. This motivates future work on machine-in-the-loop workflows to let editors and readers rely on TS appropriately (Leroy et al., 2022), and on improving the over-deletion of content by current TS systems. Our human evaluation framework provides a blueprint for evaluating whether correct TS outputs improve reading comprehension for people who have difficulty understanding complex texts, which we intend to investigate in future work.
Acknowledgments
We thank our TACL action editor, the anonymous reviewers, and the members of the UMD CLIP lab for their helpful and constructive comments on the paper. We also want to thank J. Jessy Li, Ani Nenkova, Philip Resnik, Jordan Boyd-Graber and Abhinav Shrivastava for their feedback on the earlier versions of the work. This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200006, the NSF grant 2147292, funding from Adobe Research, the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for Responsible AI) and by Fundação para a Ciência e Tecnologia through contract UIDB/50008/2020. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Notes
Collected annotations and code are released at https://github.com/sweta20/ATS-EVAL.git.
We use the terms “adequacy” and “meaning preservation” interchangeably to convey whether the information from the original is preserved in the simplified throughout this paper.
We do not use the gold or distractor spans in the evaluation study or when generating the TS outputs.
70% of the passages have critical spans (over the three questions) of at least 60%, showing that the questions generally cover most information conveyed in the original text.
The control tokens are added to the beginning of the input acting as side constraints (Sennrich et al., 2016) and specify the text transformation, like the compression (via the length ratio between the source and the target) or degree of paraphrasing (via the character-level Levenshtein similarity). Please refer to Martin et al. (2020) for more details.
SARI measures lexical simplification based on the words that are added, deleted, and kept by the systems by comparing system output against references and the input text.
References
Author notes
Work done while at the University of Maryland.
Action Editor: Ehud Reiter