Learning to Paraphrase Sentences to Different Complexity Levels

Abstract While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for training, we perform extensive experiments on both multitasking and prompting strategies. Compared to other systems trained on unsupervised parallel data, models trained on our weak classifier labeled dataset achieve state-of-the-art performance on the ASSET simplification benchmark. Our models also outperform previous work on sentence-level targeting. Finally, we establish how a handful of Large Language Models perform on these tasks under a zero-shot setting.


Introduction
Paraphrasing a sentence to a targeted level of complexity is a natural language processing task that has not received much attention.Most work focuses solely on sentence simplification: decreasing the syntactic and lexical complexity of a sentence in order to make it easier to understand while preserving its original meaning (Siddharthan, 2002(Siddharthan, , 2006;;Zhu et al., 2010;Woodsend and Lapata, 2011;Xu et al., 2015;Zhang and Lapata, 2017;Alva-Manchego et al., 2020b).This task has applications for second language (L2) learners and people with neural conditions that impede their reading comprehension abilities (Alva-Manchego et al., 2020b).There has been limited work on sentence complexification, which is the exact opposite of sentence simplification: increas- * Equal contribution ing the syntactic and lexical complexity of a given sentence (Berov and Standvoss, 2018).
As far as we know, there has not been any work done on same-level paraphrasing, which we define as paraphrasing a given sentence without changing its complexity level.However, all three tasks have important potential applications in computerassisted language learning.
Services like Grammarly1 and LinggleWrite (Tsai et al., 2020) aim to correct grammatical and lexical writing errors, especially for L2 learners.Others aim to generate example usage sentences for new words (Huang et al., 2017), as well as suggest potential paraphrases of learners' sentences in order to improve the diversity of their writing (Chen et al., 2015).In addition to suggesting general paraphrase rewrites, the online writing assistant WordTune2 allows users to control both the length (correlated to complexity) and formality level of its paraphrase suggestions (Zhao, 2022).
Despite the existence of these paraphrasing systems commercially, to the best of our knowledge, there has been no academic work on paraphrasing to different complexity levels.Writing assistants and general language learning systems could benefit from this.A learner might want to see more concise ways of expressing their ideas (simplifications), more advanced or idiomatic ways of expressing them (complexifications), or suggestions that match their writing level (same-level paraphrases).We present models for all three tasks.For these tasks, we construct two automatically labeled (unsupervised) datasets and compare them to one human-labeled (supervised) dataset.
Our first automatic labeling method is rulebased according to Flesch-Kincaid Grade Level (FKGL).FKGL can be calculated automatically as a weighted score consisting of sentence length and syllable information (Kincaid et al., 1975).A • Our CEFR-labeled ParaNMT dataset produces state-of-the-art results on the ASSET simplification benchmark for models trained on unsupervised parallel data.• Our absolute prompting models outperform previous level targeting work on the Newsela-Manual benchmark.• We release our ParaNMT data, CEFR classifier, and best fine-tuned paraphrasing models to the public. 5We also release the CEFR-CEP test data used for human evaluation.The source dataset is publicly available on EVP, EGP, and Cambridge websites and can be obtained via their data request process. 6 Related Work

Sentence Complexity Classification
Much work has been done on complexity level classification as a component of Automatic Readability Assessment, but it has mostly focused on the document level (Xia et al., 2016;Lee et al., 2021) and not the sentence level due to a shortage of sentence-level datasets.In English, data from Newsela, 7 which contains articles that have been manually simplified to four different target levels, has been widely used (Xu et al., 2015;Lee et al., 2021;Lee and Vajjala, 2022).Newsela sentences levels can be automatically derived for sentencelevel research.However, since Newsela levels (US grade ranges) are per document, not every sentence level corresponds to its document's level.
The OneStopEnglish corpus (Vajjala and Lučić, 2018), which consists of sentences and documents labeled at three ESL levels, is also widely used.
Since readability is highly subjective and dependent on a specific audience or set of standards, it is difficult to apply a single readability assessment scheme to a variety of domains.Lee and Vajjala (2022)'s pairwise ranking model has made progress on this, demonstrating strong accuracy on out-of-domain (OOD) data.

Changing Sentence Complexity
Most work in changing sentence complexity focuses on lowering sentence level to specific grades.The Newsela corpus (Xu et al., 2015;Jiang et al., 2020) has been used to train controlled simplification models to target level (Scarton and Specia, 2018;Agrawal and Carpuat, 2019;Nishihara et al., 2019;Kew and Ebling, 2022;Tani et al., 2022).To our knowledge, there have been three previous attempts at sentence complexification, also known as text or discourse embellishment.Berov and Standvoss (2018) introduce the task and train a LSTM on a story corpus and the inverse of a simplification corpus, WikiLarge, which contains aligned sentence pairs from English and Simple English Wikipedia articles (Zhang and Lapata, 2017).Naskar et al. (2019) also use Wiki-Large.And more recently, Sun et al. (2023) train BART (Lewis et al., 2020) on reversed simplification sentence pairs from Newsela.There has been no previous work on same-level paraphrasing.

Sentence Simplification
Supervised Data Many sentence simplification systems adopt the architecture of machine translation, requiring complex-simple sentence pairs to train (Zhu et al., 2010;Wubben et al., 2012;Narayan and Gardent, 2014;Zhang and Lapata, 2017;Alva-Manchego et al., 2020b).WikiLarge (Zhang and Lapata, 2017), described in Section 2.2, has been widely used.Models trained on this dataset can be easily applied to test sets that source their data from Wikipedia such as ASSET (Alva-Manchego et al., 2020a) and the Turk Corpus (Xu et al., 2016).Newsela, also described in Section 2.2, has been a popular source for sentence simpli-fication datasets (Xu et al., 2015;Zhang and Lapata, 2017).Jiang et al. (2020) present a sentence alignment model to generate the larger datasets of Wiki-Auto and Newsela-Auto.Their human annotators also developed the smaller Newsela-Manual dataset.Although most of the aforementioned corpora contain sentences that are automatically aligned, they are still considered supervised because the text was simplified by humans.
Unsupervised Data Since there are few supervised datasets, methods have been proposed to generate unsupervised datasets, which often consist of mined paraphrases.Backtranslation, or translating a sentence into a language and then back into the original language, has been used to generate paraphrases (Lu et al., 2021).Other work has used heuristics like embedding similarity to mine semantically similar sentence pairs (Martin et al., 2020b).An effective way of training on unsupervised parallel data is the use of control tokens to allow models to hone in on features that correlate with sentence simplicity.For example, the ACCESS method prepends tokens that specify output length, similarity of output and input, output word rank, and output tree depth to the beginning of each input sentence (Martin et al., 2020a).As these tokens are by default prepended in plain text before tokenization, they are functionally a form of prompt learning.
Multitask Learning Multitask learning has proven useful for overcoming lack of data and improving simplification quality.Entailment (Guo et al., 2018), paraphrase generation (Guo et al., 2018;Maddela et al., 2021), copy prediction (Maddela et al., 2021), translation (Agrawal and Carpuat, 2019;Mallinson et al., 2020), and summarization (Dmitrieva and Tiedemann, 2020) have all been used as auxiliary tasks for simplification models.It has been shown in the past that training a model on multiple very similar tasks can improve its performance on each individual task (Ratner et al., 2018;Liu et al., 2019).Although simplification, complexification, and same-level paraphrasing belong to the same general task of changing sentence complexity, training a multitask model with all three has not previously been attempted.The use of prompts for both training and inference has proven particularly useful for multitasking with pretrained models.Scialom et al. (2022) fine-tune a T5 model with eight new tasks, including sentence simplification, with prompts either prepended to the input text or embedded as part of a template depending on the task.
Inference with Large Language Models Research has been done on whether LLMs can simplify text without further training.Feng et al. (2023) show that GPT-3.5-Turbo9produces a SARI score of 44.67 for zero-shot prompting and 47.06 for single-shot prompting, surpassing previous state-of-the-art scores.Ryan et al. (2023) find that BLOOM (Workshop et al., 2023) achieves high meaning preservation and fluency but fails to simplify as well as smaller fine-tuned models.Aumiller and Gertz (2022) use an ensemble of prompts on GPT-3 (Brown et al., 2020) producing state-of-the-art results for lexical simplification specifically.

CEFR Level Classification
In order to automatically label paraphrase data with complexity levels, we first train a sentence level classification model.In theory, any of the few English sentence-level readability datasets can be used for training.However, CEFR-SP (Arase et al., 2022) and Newsela (Xu et al., 2015) may contain data that we use for training and testing our later paraphrasing models, so we do not use either of those.The other option of OneStopEnglish (Vajjala and Lučić, 2018) has very few sentence pairs, and upon inspection, we find its simplest level to appear more complex than CEFR A1.Therefore, we create a new CEFR-labeled corpus for our needs, CEFR-CEP.

Data
We combine data from the English Profile and Cambridge Dictionary.10Our main source, English Profile (Capel, 2012), contains CEFR levels that map to word senses or grammar concepts.It contains two searchable databases, English Vocabulary Profile (EVP) 11 and English Grammar Profile (EGP). 12 Each entry in EVP corresponds to a word, and each of its possible definitions (word senses) is marked with its CEFR level along with one or more example usage sentences or phrases from either a real learner or a dictionary.EVP words, but not example sentences, have been used in the past to create lexical simplification datasets (Uchida et al., 2018;Fujinuma and Hagiwara, 2021).EGP and the Cambridge Dictionary are structured similarly to EVP, containing CEFR levels and examples for grammar concepts and word senses respectively.We automatically label these EVP, EGP, and dictionary examples with their entries' CEFR levels.We eliminate any duplicates from our combined dataset.Further details about CEFR-CEP are shown in Table 1.This method assumes that for each word sense or grammar concept, its example sentences/phrases match its CEFR level.This is likely false some of the time.However, analysis on the CEFR-CEP sentences shows that our assumed CEFR levels correlate strongly with other metrics associated with sentence complexity: word count, tree depth, and FKGL, as shown in Figure 1.

Model
On CEFR-CEP, we train a BERT classifier (Devlin et al., 2019) in addition to SVM and LSTM baselines, with an 80-10-10 train-validation-test split.The BERT-base-cased [CLS] token embedding serves as the sentence representation and the input to our classifier, which is made up of one linear layer and trained with cross-entropy loss like in previous work (Arase et al., 2022).Its outputs are softmax probabilities for each of the six CEFR levels, and we use an Adam optimizer (Kingma and Ba, 2015) with the best learning rate of 3e-5. 13n addition to the BERT model, we train two baselines on the same data.The first is a Support Vector Machine (SVM) classifier with Term Frequency-Inverse Document Frequency (TF-IDF) for its embeddings and a Radial Basis Function kernel (Scholkopf et al., 1997).We use the optimal cost and gamma hyperparameters of 10 and 1 respectively.We also train a LSTM classifier with a single dense layer and Word2Vec Google News vectors (Mikolov et al., 2013) as its embedding layer, Adam optimization with an optimal learning rate of 4e-3, softmax activation, and cross entropy loss.

Evaluation
We perform automatic evaluation on our held-out CEFR-CEP test data with four evaluation metrics.Our F1 scores are weighted to take label imbalance into account.
• 6-Level F1 (6-F1): The prediction F1 for the six CEFR levels.• 3-Level F1 (3-F1): The prediction F1 for the three CEFR levels A, B, and C. • Adjacent Accuracy (Adj-Acc): the percentage where the prediction's deviation from the test label is less than or equal to one.14 • Mean Absolute Error (MAE): a number between 0 and 5.The average amount that the prediction deviates from the test label.15Table 2 shows the results for each metric on the baseline and BERT models.For every metric, the BERT model performs better.But 6-F1 is only 59.78%, and we posit that it is so difficult to get an exact match with dataset CEFR level because of dataset flaws mentioned in Section 3.1: namely  Since we will use our classifier to add CEFR labels to the OOD ParaNMT dataset, we conduct a study to see to what extent its labels match human labels on the ParaNMT data.On our preprocessed ParaNMT set (see Section 4.2 for details), we sample 60 sentence pairs: 20 where their classified levels are the same and 40 where their classified levels differ by at least two (e.g.A2-B2 but not A2-B1).We split the different-level pairs into two groups, simplification where the higher level sentence comes first and complexification where the lower level one does.We then ask four native English speakers to examine each sentence pair and label which sentence is simpler: the first, the second, or neither.These three labels map to the categories of complexification, simplification, and same-level paraphrasing respectively.
Inter-rater agreement, or nominal Krippendorff's Alpha (Krippendorff, 2011), is a fairly low 0.27, where 0 means no agreement (chance) and 1 means perfect agreement.Because we want to evaluate on only reliable labels, we just consider the sentence pairs where three or more of the raters agree.These amount to 39 out of 60 pairs with agreement of 0.48.We test both our CEFR classifier and FKGL on these 39 gold labels.
We compare our CEFR classifier's predictions with those of FKGL. the CEFR versus FKGL methods on the gold labels for each of the three categories of simplification, complexification, and same-level paraphrasing.FKGL performs better for classifying complexification and same-level paraphrasing, while CEFR classification performs better for simplification.However, F1 is universally low, casting doubt on the reliability of our weak labeling approaches.Our gold human labels are also potentially problematic: only six of the 60 sentence pairs that were rated as same-level paraphrasing met our criterion of three out of four raters agreeing, compared to 15 and 18 for simplification and complexification respectively.From these results, we tentatively hypothesize that sentence simplification models trained on data labeled by the CEFR classifier will perform better than those trained on FKGL-labeled data, while complexification and same-level paraphrasing models trained FKGLlabeled data will perform better than those trained on CEFR-labeled data.

Paraphrasing Data
Next, we construct datasets for simplification, complexification, and same-level paraphrasing.Details are included in Table 4.

Supervised Data
Our supervised data source is Newsela-Auto, 16 a sentence simplification corpus derived from Newsela news articles targeted at five levels and written by education professionals, where level 0 is the complex original and 1-4 are simplifications of increasing degree (Xu et al., 2015).Their sentences must be aligned to create a sentence pair corpus from these original articles.Previous methods have aligned using metrics like Jaccard simi- 16 Request data at https://newsela.com/data.
larity (Zhang and Lapata, 2017).Newsela-Auto's pairs are aligned according to a neural CRF model (Jiang et al., 2020), and its pairs are more numerous (666k) and creatively rewritten than previous Newsela alignments.17Newsela-Auto does not contain level labels, so we use string matching with the original Newsela to find each sentence's level (Xu et al., 2015). 18 limitation of Newsela-Auto and other simplification datasets like WikiLarge (Zhang and Lapata, 2017) and Wiki-Auto (Jiang et al., 2020) is that they are only meant to contain differentlevel pairs.Therefore, we only conduct simplification and complexification experiments on this dataset.For the two-task dataset, we flip the order of exactly half of the sentence pairs.For the two single-task datasets, we extract all simplification and complexification pairs from the two-task dataset but perform an additional filtering step of removing all pairs that were labeled as the same level according to our retroactive labeling algorithm.These pairs only number into a few thousand and are not enough to train a comparable same-level paraphrasing model.

Unsupervised Data
To contrast with our supervised dataset and fill the gap of missing same-level paraphrase pairs, we create two unsupervised datasets.We use ParaNMT, one of the largest paraphrase pair datasets available to the public, with 50 million sentence pairs generated through backtranslation of the Czeng1.6 corpus (Wieting and Gimpel, 2018).It contains data sourced from movie and educational video subtitles, European legislation proceedings, and medical websites (Bojar et al., 2016).ParaNMT has been used for sentence simplification in the past (Martin et al., 2020b).
To determine our filtering techniques, we inspect samples from the corpus and find pairs that are identical or almost identical, very different in meaning, or that contain incomplete sentences.To alleviate these problems, we remove pairs where one sentence is contained in the other or where any sentence has less than three words.
To encourage our models not to directly copy the input sentence, a problem that occurs in both sentence simplification (Dong et al., 2019) and paraphrase generation (Thompson and Post, 2020), we only include aggressive paraphrases.We remove pairs where Sentence-BERT cosine similarity (Reimers and Gurevych, 2019) is below 60% or above 80%.From our observations, these thresholds exclude pairs that are different in meaning or too similarly phrased.
We want ParaNMT-CEFR and and ParaNMT-FKGL to be as similar as possible for the sake of comparison.From our filtered data, we use the CEFR classifier to label the level of each sentence.To maximize the likelihood that a level difference between the two sentences exists (see Table 2's Adj-Acc), we only select pairs where the level difference is two or greater. 19For the samelevel dataset, we select pairs where the sentences are classified as exactly the same level. 20e are left with 2,575,589 different-level pairs and 6,207,876 same-level pairs.For both the CEFR-based and FKGL-based labeling schemes, we derive all of our simplification, complexification, and same-level paraphrasing data from these two sets.For ParaNMT-CEFR, we halve the different-level dataset and re-order it to create one simplification and one complexification dataset.We then sample from the same-level pairs to get an equal-sized same-level set.To create ParaNMT-FKGL, we calculate the FKGL of each sentence (rounded to two decimal points).If the FKGL of the two sentences in a pair differs at all, we consider it a different-level pair.If it is exactly the same, we consider it a same-level pair.We are able to derive 65.16% of our different-level pairs from the ParaNMT-CEFR different-level set.The other 878,449 are taken from the ParaNMT-CEFR same-level pairs.We sample from the resulting data to match ParaNMT-CEFR's in size.The train-validation-test split is 80-10-10 for both ParaNMT datasets.We have made these data available to the public.21

Paraphrasing Experiments
We train models on the three tasks of sentence simplification, sentence complexification,

Models
For all models, we use a single NVIDIA GPU, a batch size of 32 after gradient accumulation, and maximum decoding length of 300 tokens.We finetune 34 ablations on T5 (Raffel et al., 2020), a pretrained transformer 22 We also perform limited experiments with Flan-T5-base (Chung et al., 2022), a more recent instruction-tuned version of T5.We train ParaNMT-CEFR single-task and 2-task simplification and complexification ablations (6 models).However, since we find in Section 6.1.4that it does not perform as well as T5, we focus our main experiments on T5.

Prompting Strategies
At inference time, we prepend the corresponding prompt to the beginning of each input sentence, as this strategy was used for T5 (Raffel et al., 2020).
Relative Simplification, complexification, and same-level paraphrasing correspond exactly to the prompts "level down: ", "level up: ", and "same level: ".We train on the data of one, two, or all three tasks, adding the corresponding task prompt to the front of each input sentence.We call this relative (REL) prompting because the prompt denotes the relative difference between the levels of the input and output sentence: down, up, or same.This scheme has 7 possible task combinations.
Absolute For each task combination besides single-task same-level paraphrasing, we use prompts that specify absolute (ABS) output level.
For training, we insert "change to level X: ", where X is the level of the output. 23ABS prompting theoretically has an advantage over REL prompting because we can change the prompt to match the level of a test dataset's output sentence.To compare the two prompting strategies on equal footing, we remove this advantage.With the exception of Section 6.1.6,for ABS prompting inference, we use the same prompt for every test input no matter the output level.Therefore, we can only evaluate these models on simplification and complexification and not on same-level paraphrasing.

Baselines
We train paraphrasing baselines, the first trained on the entire ParaNMT-CEFR dataset and the other trained on ParaNMT-FKGL.Each dataset consists of one third simplification data, one third complexification data, and one third same-level paraphrasing data, but at train time, we use the prompt "paraphrase: " for each input. 24

Paraphrasing Evaluation
We perform both automatic and human evaluation.To compare all 40 experiment models, we only report automatic evaluation results.We perform human evaluation on just one model per task.

Automatic Evaluation
We first discuss each individual task.Then, we discuss our ablation results more generally.
Evaluation metrics We report SARI and FKGL. 25 • SARI (System output Against References and against the Input sentence) is the most important automatic metric for text simplification.Ranging from zero to 100, it represents the F1 for a model's added, kept, and deleted n-grams when comparing the input and reference sentences (Xu et al., 2016).
23 For ParaNMT-CEFR, X is the CEFR level A/B/C.For ParaNMT-FKGL, X is FKGL rounded to two decimal points.And for Newsela-Auto, X is one of the Newsela levels 0-4. 24We also train an LSTM baseline per task per ParaNMT dataset using REL prompting, but we do not report the results because they do not add to the analysis. 25We use the EASSE Python library to compare with previous sentence simplification research (Alva-Manchego et al., 2019).
• FKGL (Flesch-Kincaid Grade Level) is a weighted score with sentence length and syllable information (Kincaid et al., 1975).It was introduced in Section 1.We consider the best FKGL score to be that closest to the gold reference FKGL in a given test set.Evaluation data For simplification and complexification, we use the ASSET and Newsela-Manual test sets.These simplification benchmarks can be easily reversed for the complexification task.There are no existing benchmarks that can be straightforwardly applied to same-level paraphrasing.Therefore, we use sentence pairs from the ParaNMT corpus.In all tables and figures, we denote task type to u/d/s for up (complexification), down (simplification), and same.
• ASSET has 359 test sentences, each with 10 human-written reference sentences (Alva-Manchego et al., 2020a).For simplification, we use this dataset as-is.For complexification, we consider each reference sentence to be an input and the corresponding test sentence to be an output, resulting in 3590 oneto-one pairings.• Newsela-Manual contains Newsela sentence pairs where each pair is annotated as aligned, partially aligned, or not aligned (Jiang et al., 2020). 26We collect all aligned and partially aligned pairs and follow Kew and Ebling (2022)'s method to automatically fix the alignments between partially aligned pairs.We include pairs from all input levels to all output levels and remove pairs where the output is an exact copy of the input, resulting in 2,748 pairs.27• Newsela-Manual by Level contains sentences where the complex level 0 maps to each of the simpler levels 1-4.To evaluate our models' level targeting ability, we use the same configuration as Kew and Ebling (2022), which does not filter out input-output copies.We also create a complexification version where the simple input is level 4 and the four possible outputs are levels 3-0.(Martin et al., 2020a).Lu et al. (2021) create their own corpus via backtranslation.For ABS models, we enclose in parentheses the target level we used for prompting at inference time.T5-FKGL-d-ABS uses 3.0 for ASSET and 0.0 for Newsela-Manual.128,779 pairs. 28This corpus is inherently noisy due to its unsupervised nature.We hope that in future work, a cleaner same-level paraphrasing dataset with human labels will be available.

Simplification Results
We report results in Table 6 on both the AS-SET and Newsela-Manual test sets.Besides baselines, we divide the table into two sections, one for models trained on unsupervised data and the other for supervised data.We only report our two best performing ablations per training dataset.For ABS prompting CEFR and Newsela-Auto (News) models, we try all possible prompts.For FKGL models, we try a range of prompts (0.0-7.0) and pick the best ones.For MUSS models, which are open source (Martin et al., 2020b), we report their best scores on ASSET and do our own parameter search on the Newsela-Manual validation set to derive optimal prompts.On both benchmarks, all models outperform baselines in SARI score.We achieve a new state-of-the-art for un- 28 There is no overlap between our resulting test set and either of the training or validation sets.supervised parallel data, with the highest SARI score of 43.65 on ASSET going to T5-CEFR-ud-ABS (prompt B). 29 Our supervised model T5-News-u-d-REL has the highest SARI score on the Newsela-Manual benchmark, outperforming baselines and MUSS.

Complexification Results
We report SARI and FKGL on reversed ASSET and reversed Newsela-Manual.Table 7 contains results arranged in the same way as for simplification.For FKGL prompts, we try a range of 10.0-17.0.We do a grid search to find MUSS parameters.MUSS-mined almost matches our best performing model's SARI on ASSET, even beating its supervised data counterpart and Sun et al.
(2023)'s ComplexBART.But it falls much shorter of our best models on Newsela-Manual.Between ParaNMT-CEFR and ParaNMT-FKGL models, the latter produce the highest SARI on ASSET and highest FKGL on both test sets.However, after inspecting model outputs, we find that for every FKGL model whose SARI surpasses our highest ParaNMT-CEFR SARI score, the outputs contain many degenerate repetitions.For example, consider the ASSET input simple sentence The state capital is Aracaju.T5-CEFR-u-s-ABS with prompt C produces the slightly longer sentence the capital of the state is Aracaju.
But T5-FKGL-u-s-ABS with prompt 11.0 produces a 295 word output starting with the capital of the state is Aracaju, the capital of the state is the capital of the state of the state of MUSS SARI also surpasses ParaNMT-CEFR on ASSET.However, their outputs contain fewer degenerate repetitions according to an inspection of the outputs.We believe this quality difference is due to problems with the ParaNMT dataset that are exacerbated by organizing it by FKGL score, a length-based metric.The MUSS-mined training data contains human-written sentences that were mined according to similarity metrics (Martin et al., 2020b).ParaNMT, on the other hand, is the result of machine translation (Wieting and Gimpel, 2018), which can sometimes enter repetitive loops during decoding (Holtzman et al., 2019;Welleck et al., 2019).Future work on backtranslation datasets could attempt to filter out sentences that contain these repetitions.
We also find that degenerate repetitions are not adequately captured by SARI, which only counts unique n-grams that are added, kept, and deleted compared to the gold references (Xu et al., 2016;Alva-Manchego et al., 2019).This means that as long as a model's repetitions have added no or very few unique new words to the sentence, they will not be reflected in SARI.Therefore, we suggest that for sentence complexification, a modified SARI should be used that takes word counts into consideration.We leave this to future work.

Same-level Paraphrasing Results
In Table 8, we report results for all of our baselines along with our best performing CEFR and FKGL models.Notably, both of our CEFR and FKGL paraphrasing baselines outperform their corresponding experiment models, which were trained on the exact same data, the only difference being prompting strategy.When we compare T5-CEFR-Para's outputs with those of T5-CEFR-u-d-s-REL, we find that after tokenization, the former copies the input 4.40% of the time, while the latter does so 10.42% of the time.The ParaNMT-s test set copies input 0.31% of the time after tokenization.Since we are unable to perform a quantitative human evaluation comparing these outputs, we are left with two possible theories.The first is that our T5 paraphrasing baselines are actually learning to same-level paraphrase.When presented with data where a third increases level, a third decreases level, and a third keeps level the same, the model picks the average option, which is same-level paraphrasing.The second theory is that the sentences in our same-level paraphrasing data are not actually the same level.After all, both our CEFR and FKGL methods in Section 3.3 have extremely low F1 on human labels for same-level paraphrasing: 12.5% for CEFR and 28.57% for FKGL.However, we doubt this theory because of our positive human evaluation results (see Section 6.2).

Flan-T5 Results
Table 9 shows SARI for the six Flan-T5 ablations we trained along with the best SARI scores from our T5 experiments on the same dataset of ParaNMT-CEFR.Interestingly, the best Flan-T5 scores never surpass the best from T5.And when directly comparing scores for each ablation, T5 outperforms Flan-T5 for 12 out of the 16 cases.
This may be surprising, as Flan-T5 performs better on a variety of tasks and benchmarks for zero-and few-shot inference (Chung et al., 2022).But Flan-T5 has not been shown to be better than T5 for fine-tuning on new datasets.We suspect that the reason for its degraded performance compared to T5 is that fine-tuning incurs catastrophic forgetting, diminishing the benefits gained from its previous instruction-tuning.While Scialom et al. (2022) report that T5 models can continually learn new tasks without catastrophic forgetting, rehearsal (Shin et al., 2017) is still required for the models to retain their previously learned skills.

Ablation Study Results
Figure 2 shows results for all T5 experiment models on all test sets, the x-axis being number of tasks per model and the y-axis being SARI score.Each data point is annotated with task combination.
Multitasking There is no clear winner among multitasking configurations.Single and two-task models often perform better than three-task ones, with the exception of same-level models, where SARI increases with the number of tasks.Many high-scoring two-task models were trained on tasks that are not opposite (i.e.u-s and d-s but not u-d).However, for simplification, the highest scoring models for ASSET and Newsela-Manual were both trained on the u-d ablation.For T5-News-u-d-REL, this is not noteworthy because REL prompts are distinct for each task (see Table 5).But strikingly, T5-CEFR-u-d-ABS scores best on ASSET with prompt B even though in theory, upon seeing the middle prompt B (as opposed to A or C), the model should not know whether to increase or decrease a sentence's complexity.Upon further investigation, we find that the reason for this is likely that the training dataset contains approximately double the amount of C → B simplifications as A → B complexifications.
Prompt type For FKGL models, ABS prompting always performs better than REL prompting.For News models, ABS prompting performs better in all but one case.For CEFR models, results are mixed, but ABS prompting performs slightly better on average.Compared to CEFR and Newsela levels, FKGL is very fine-grained, with up to two decimal point precision.The fact that FKGL models always perform better for ABS prompting than for REL, while CEFR and News models do not, suggests that using prompts that contain very finegrained output information might improve performance.Additionally, among just single-task models, ABS prompting always performs best, but this strategy is favored less and less as the amount of tasks increases.This indicates that using a more complex prompting strategy incurs a greater performance cost as the number of tasks increases.
Data labeling scheme As expected, models trained on Newsela-Auto perform better on Newsela-Manual than models trained on ParaNMT data.However, they mostly fail to achieve as high of SARI on non-Newsela data as ParaNMT models achieve on Newsela data, and they are some of the worst performing models on ASSET.For ABS prompting, FKGL models often outperform CEFR models on complexification, but for REL prompting, FKGL models almost universally do worse.For same-level paraphrasing, it Table 10: Level targeting for simplification and complexification on Newsela-Manual.We compare our scores to supervised MUSS (Martin et al., 2020b).Our simplification model is T5-News-u-d-ABS.For each level, we display reference FKGL.See Table 6 for naming conventions.
is notable that ParaNMT-CEFR models have much higher SARI than ParaNMT-FKGL ones despite the fact that the ParaNMT-s test dataset is half ParaNMT-CEFR and half ParaNMT-FKGL.This, and the fact that complexification FKGL model outputs contain degenerate repetitions that SARI does not reflect, shows that the CEFR method is the most robust automatic labeling method.Future work could experiment with finer-grained CEFR labels (6, not 3) and less fine-grained FKGL labels (intervals instead of two decimal precision).

Level Targeting Results
Table 10 show our Newsela-Auto models' abilities to target specific levels for simplification and complexification.For brevity, we show results from only one of our models per table along with the best previous work baseline, supervised MUSS (Martin et al., 2020b), for which we derive optimal parameters via grid search.For every level, our models achieve higher SARI than previous work, with the exception of 0 → 2 simplification, where MUSS wins.However, it appears that our models are better at targeting aggressive simplifications and complexifications than slight ones: SARI generally increases as target level deviates further from input level.The results from Section 6.1.5show that even when we are not using ABS prompting to its full strength, it often surpasses REL prompting in performance.These level-targeting results confirm that ABS prompting at its full strength does better.

Human Evaluation
We carry out a human evaluation on all three tasks.We use a 1-5 Likert scale across three separate categories: task performance, meaning preservation, and fluency.Due to limited resources, we choose just one model per task.We choose models ParaNMT models for our evaluation.For simplification, T5-CEFR-u-d-ABS with prompt B scores best on ASSET, but due to the prompt B task ambiguity discussed in Section 6.1.5,we choose T5-CEFR-d-ABS with prompt B, which scores second best with a SARI of 43.63.For complexification, we use the highest scoring CEFR model, T5-CEFR-u-ABS with prompt C, even though some of the FKGL models have higher SARI scores on ASSET.This is because, as mentioned in Section 6.1, FKGL models produce numerous degenerate repetitions that do not hurt SARI score.
Finally, for same-level paraphrasing, we choose T5-CEFR-u-d-s-REL because of its highest SARI score on ParaNMT-s.Due to limited human evaluation resources, out of the three tasks, we only compare our simplification model to a baseline.We choose supervised MUSS (Martin et al., 2020b), a publicly available state-of-the-art model that we also used in Section 6.1.We use its best performing ASSET prompts.So as to directly compare the three tasks of simplification, complexification, and same-level paraphrasing on the exact same dataset, something not done in Section 6.1, we do not use a benchmark simplification dataset.We instead source data from the CEFR-CEP test set, which our paraphrasing models have not seen and our CEFR classifier has not been trained or validated on.However, because of this choice, there are no reference paraphrases to compare model outputs to, preventing us from using a reference baseline.We do not use any baseline because in the absence of a single one that fits all three tasks, it would require dramatically more labeling work.
From CEFR-CEP, we sample 13 sentences from each level A2-C1, amounting to 52 sentences that we release to the public. 30We exclude A1 and C2 because simplifying or complexifying those sentences may not have an effect.We then run each of the four models on these sentences, producing 208 outputs.Three native English speakers each rate all outputs.31For each output, we average the ratings of the three evaluators.We then take the 95% confidence interval across each model's rating category along with inter-rater agreement using ordinal Krippendorff's Alpha (Krippendorff, 2011), a number between zero (random agreement) and one (perfect agreement).
Table 11 shows our results.For simplification, our model performs better than MUSS across all categories, especially meaning preservation.Across tasks, fluency is universally very high.This is a testament to the quality of these finetuned language models.Agreement is highest for meaning preservation, perhaps the most objective metric.We find that task performance is lowest for complexification, which is consistent with our intuition that this is the most difficult task, demanding the most additions and leaving the most room for error.Finally, same-level paraphrasing has the highest scores out of 5 compared to the other tasks, likely because it requires the least amount of modification.This is particularly interesting because of the fact that our paraphrasing baseline T5-CEFR-Para outperformed this model according to SARI on ParaNMT-s, calling into question whether the task models were effective at all.We told our raters to dock task performance points when a model exactly copied its input, but upon inspection of their ratings, we find that this is very inconsistent.So, this may be why inter-rater agreement is extremely low for task performance.

Can LLMs Change Complexity Level?
In this section, we perform an exploratory investigation into the simplification, complexification, and same-level paraphrasing abilities of LLMs.

Data
For simplification and complexification, we use ASSET like in Section 6.1. 32For same-level paraphrasing, we randomly sample 400 sentence pairs from ParaNMT-s.33

Models
For all models, we set temperature to 1.0 and limit output length to 50 tokens.We run inference in a zero-shot setting and leave an investigation into more sophisticated inference settings to future work.Due to hardware limitations, we are unable to run inference for models with more than 20 billion parameters.We mostly select instructiontuned models because we expect them to do better with new tasks and prompts.We select five: GPT-3.5-Turbo,34GPT-NeoX-20B (Black et al., 2022), Flan-UL2 (Tay et al., 2023), Flan-T5-xxl (Chung et al., 2022), and OPT-IML-MAX-1.3B (Iyer et al., 2023).

Prompts
Like in our fine-tuning experiments, we attempt both ABS and REL prompting.However, in this case, we construct prompts with more descriptive  wording to better fit the zero-shot setting.Table 12 shows the prompts for each task.To determine them, we try different wording with GPT-3.5-Turbo to check for obvious differences in behavior.We find that for complexification, explicitly telling the model to "increase the complexity" of a piece of text produces undesirably long outputs, but the wording "advanced English level" does not.We keep terminology consistent across prompts.

Results and Discussion
Table 13 shows results for each LLM and task, and Figure 3 shows SARI for each LLM per task and prompt type.On all tasks, GPT-3.5-Turbooutperforms the rest of the models by a large margin.None of the other models produce SARI scores that come close to the paraphrasing baselines from Tables 6, 7, and 8, much less the finetuned T5 scores.We confirm this by inspecting model outputs: all besides GPT-3.5-Turbocontain hallucinations.For example, in response to CEFR prompting (and FKGL to a lesser degree), Flan-T5-xxl and Flan-UL2 often return a single letter instead of a sentence as the output, while OPT-IML-MAX-1.3B and GPT-NeoX-20B attach discussions of the CEFR to their outputs.Despite the fact that the ABS prompting outputs contain more hallucinations than those from REL prompting, Figure 3 shows that ABS prompting generally produces higher SARI, echoing our findings from the fine-tuning experiments.For GPT-3.5-Turbo in particular, the ABS-CEFR prompt produces outputs with higher SARI for simplification than Feng et al. (2023)'s REL prompting score of 44.67 in the zero-shot setting.
Notably, although GPT-3.5-Turbooutperforms our fine-tuned models on simplification, it does not on complexification, demonstrating the difficulty of the task.Models perform the worst at same-level paraphrasing, but this may be due to the unsupervised same-level dataset being worse in quality than supervised ASSET.
The huge gap in performance between GPT-3.5-Turbo and the other models may be in part due to its size of 176B parameters being much larger than the next largest size of 20B.However, there is no obvious pattern regarding model size for the other four: for example, the smallest model of OPT-IML-MAX-1.3B performs competitively with the two 20B-parameter models.

Conclusion
In this paper, we provide a general investigation of the task of changing sentence complexity, with thorough fine-tuning experiments and brief experiments with LLMs.For sentence simplification, our models surpass or are comparable to stateof-the-art systems.For sentence complexification and same-level paraphrasing, we set new benchmarks.We show that weak classification is an effective way to create strong unsupervised datasets and that target level absolute prompting is more effective than level direction relative prompting.
This research leaves opportunities for future work.For example, using a stronger level classifier to label paraphrase data might improve performance for the paraphrasing tasks.In the same vein, different filtering of ParaNMT or another paraphrasing dataset (Hu et al., 2019) could potentially be used.A human-labeled same-level paraphrasing test dataset does not yet exist, and a modified SARI metric that adequately penalizes repetitions is needed for sentence complexification.Our methods focus on English data, but they can be easily applied to other languages if a different classifier is trained (Khallaf and Sharoff, 2021;Vásquez-Rodríguez et al., 2022) and a non-English paraphrasing dataset is used (Martin et al., 2020b;Scherrer, 2020;Lu et al., 2021).Finally, a thorough investigation on how well LLMs can change sentence complexity is necessary.

Figure 2 :
Figure 2: All ablation results.Tasks abbreviated as u (up, complexification), d (down, simplification), and s (same, same-level paraphrasing).ASSET-d and News-d correspond to the original ASSET and Newsela-Manual sets.The -u indicates that they were reversed for complexification.

Figure 3 :
Figure 3: All SARI scores per model, task, and prompt.

Table 3 :
F1 of CEFR classifier vs. FKGL predictions on 39 human labels that we label each example text according to the level of its corresponding word sense or grammar concept, which is not always correct.But Adj-Acc is a high value of 90.64%, showing that our model has very close estimation, and the low MAE of 0.52 is consistent with this.Our SVM baseline scores similarly to the LSTM despite having much more information-rich embeddings.

Table 4 :
Table 3 shows the F1 of Paraphrasing dataset details

Table 5 :
Prompt(s)for each task.For same-level paraphrasing single-task models, we only train REL prompt ablations.For simplification, complexification, and all two-task and three-task configurations, both REL and ABS prompt ablations are trained.

Table 6 :
Simplification on ASSET and Newsela-

Table 7 :
Complexification on ASSET and Newsela Manual.See Table 6's caption for naming details.We obtained model weights and data for Sun et al. (2023)'s ComplexBART model and ran inference ourselves.However, since their Newsela training data overlaps with the Newsela-Manual test set, we only report ASSET scores for ComplexBART.

Table 9 :
Flan-T5 SARI for all trained ablations.For ABS models, the best prompt(s) are shown in parenthesis.CEFR-ABS-u uses B for ASSET and C for News.

Table 12 :
Prompt(s)for each task.For CEFR ABS prompting, we use A for simplification and C for complexification.For FKGL ABS prompting, in two point intervals, we try levels 0-6 for simplification and 8-14 for complexification.

Table 13 :
LLM results based on best SARI per model, tested on ASSET.For u and d, best prompts are included in the Model column.Reference FKGL is 6.49 for simplification, 10.46 for complexification, and 2.82 for same-level paraphrasing.