Abstract
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject–verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors, respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems, which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarize the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgments, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as a comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
1 Introduction
Writing is a learned skill that is particularly challenging for non-native language users. We all make occasional mistakes with punctuation, spelling, and minor infelicities of word choice in our native language, but non-native writers often also struggle to create grammatical and comprehensible texts. Research in the field of Natural Language Processing (NLP) has addressed the problem of “ill-formed input” at least since the 1980s because downstream parsing of text usually collapsed unless input was grammatical (Kwasny and Sondheimer 1981; Jensen et al. 1983). However, useful applications able to significantly assist non-native writers only began to appear in the 2000s, such as ETS’s Criterion (Burstein, Chodorow, and Leacock 2003) and Microsoft’s ESL Assistant (Leacock, Gamon, and Brockett 2009). These systems were largely based on hand-coded “mal-rules” applied to the output from robust parsers that suggested corrections for errors.
Around the same time, researchers began exploring more data-driven approaches using supervised machine learning models built from annotated corpora of errorful text with exemplary corrections (Brockett, Dolan, and Gamon 2006; De Felice and Pulman 2008; Rozovskaya and Roth 2010b; Tetreault, Foster, and Chodorow 2010; Dahlmeier and Ng 2011b). The Helping Our Own (HOO) shared task (Dale, Anisimoff, and Narroway 2012), which attracted 14 research groups to compete and report their results on correcting English determiner and preposition choice errors using the First Certificate in English (FCE) corpus (Yannakoudakis, Briscoe, and Medlock 2011), marked with hindsight the turning point from rule-based to data-driven methods as well as burgeoning interest in the task. Leacock et al. (2014) subsequently published a book-length survey summarizing progress in the field up to this point.
The next decade has seen three further expanded shared tasks and an explosion of research and publications, both from participants in these competitions and others benchmarking their systems against the released test sets. Performance has increased roughly three-fold, and today, most state-of-the-art systems treat the task as one of “translation” from errorful to corrected text, including the latest system deployed in Google Docs and Gmail (Hoskere 2019). Recently, Wang et al. (2021) provided another detailed survey of work on grammatical error correction summarizing most work published since Leacock et al. (2014). In this article, we provide a more in-depth focus on very recent deep learning–based approaches to the task as well as a more detailed discussion of the nature of the task, its evaluation, and other remaining challenges (such as multilingual Grammatical Error Correction [GEC]) in order to better equip researchers with the insights required to be able to contribute to further progress.
1.1 The Task
The definition of a grammatical error is surprisingly difficult. Some types of spelling errors (such as accomodation with a single m) are about equally distributed between native and non-native writers and have no grammatical reflexes, so could be reasonably excluded. Others, such as he eated, are boundary cases as they result from over-regularization of morphology, whereas he would eated is clearly ungrammatical in the context of a modal auxiliary verb. At the interpretative boundary, infelicitous discourse organization, such as Kim fell. Sandy pushed him. where the intention is to explain why Kim fell, is not obviously a grammatical error per se but nevertheless can be “corrected” via a tense change (Sandy had pushed him.) as opposed to a reordering of the sentences. Other tense changes that can span sentences appear more grammatical, such as Kim will make Sandy a sandwich. Sandy ate it., as the discourse is incoherent and correction will require a tense change in one or other sentence.
In practice, the task has increasingly been defined in terms of what corrections are annotated in corpora used for the shared tasks. These use a variety of annotation schemes but all tend to adopt minimal modifications of errorful texts to create error-free text with the same perceived meaning. Other sources of annotated data, such as that sourced from the online language learning platform Lang-8 (Mizumoto et al. 2012; Tajiri, Komachi, and Matsumoto 2012), often contain much more extensive rewrites of entire paragraphs of text. Given this resource-derived definition of the task, systems are evaluated on their ability to correct all kinds of mistakes in text, including spelling and discourse level errors that have no or little grammatical reflex. The term “Grammatical” Error Correction is thus something of a misnomer, but is nevertheless now commonly understood to encompass errors that are not always strictly grammatical in nature. A more descriptive term is Language Error Correction.
Table 1 provides a small sample of (constructed) examples that illustrate the range of errors to be corrected and some of the issues that arise with the precise definition and evaluation of the task. Errors can be classified into three broad categories: replacement errors, such as dreamed for dreamt in the second example; omission errors, such as on in the first example; and insertion errors, such as the in the third example. Some errors are complex in the sense that their correction requires a sequence of replacement, omission, or insertion steps to correct, as with the syntax example. Sentences may also contain multiple distinct errors that require a sequence of corrections, as in the multiple example. Both the classification and specification of correction steps for errors can be and has been achieved using different schemes and approaches. For instance, correction of the syntax example involves transposing two adjacent words so we could introduce a fourth broad class and correction step of transposition (word order). All extant annotation schemes break these broad classes down into further subclasses based on the part-of-speech of the words involved, and perceived morphological, lexical, syntactic, semantic, or pragmatic source of the error. The schemes vary in the number of such distinctions, ranging from just over two dozen (NUCLE: Dahlmeier, Ng, and Wu 2013) to almost one hundred (CLC: Nicholls 2003). The schemes also identify different error spans in source sentences and thus suggest different sets of edit operations to obtain the suggested corrections. For instance, the agreement error example might be annotated as She likes him and [kiss→kisses] him at the token level or simply [ϵ→es] at the character level. These differing annotation decisions affected the evaluation of system performance in artefactual ways, so a two-stage automatic standardization process was developed, ERRANT (Felice, Bryant, and Briscoe 2016; Bryant, Felice, and Briscoe 2017), which maps parallel errorful and corrected sentence pairs to a single annotation scheme using a linguistically enhanced alignment algorithm and series of error type classification rules. This scheme uses 25 main error type categories, based primarily on part-of-speech and morphology, which are further subdivided into missing (omission), unnecessary (insertion), and replacement errors. This approach allows consistent automated training and evaluation of systems on any or all parallel corpora as well as supporting a more fined-grained analysis of the strengths and weaknesses of systems in terms of different error types.
Type . | Error . | Correction . |
---|---|---|
Preposition | I sat in the talk | I sat in on the talk |
Morphology | dreamed | dreamt |
Determiner | I like the ice cream | I like ice cream |
Tense/Aspect | I like kiss you | I like kissing you |
Agreement | She likes him and kiss him | She likes him and kisses him |
Syntax | I have not the book | I do not have the book |
Punctuation | We met they talked and left | We met, they talked and left |
Unidiomatic | We had a big conversation | We had a long conversation |
Multiple | I sea the see from the seasoar | I saw the sea from the seesaw |
Type . | Error . | Correction . |
---|---|---|
Preposition | I sat in the talk | I sat in on the talk |
Morphology | dreamed | dreamt |
Determiner | I like the ice cream | I like ice cream |
Tense/Aspect | I like kiss you | I like kissing you |
Agreement | She likes him and kiss him | She likes him and kisses him |
Syntax | I have not the book | I do not have the book |
Punctuation | We met they talked and left | We met, they talked and left |
Unidiomatic | We had a big conversation | We had a long conversation |
Multiple | I sea the see from the seasoar | I saw the sea from the seesaw |
Ultimately, however, the correction of errors requires an understanding of the communicative intention of the writer. For instance, the determiner example in Table 1 implicitly assumes a “neutral” context where the intent is to make a statement about generic ice cream rather than a specific instance. In a context where, say, a specific ice cream dessert is being compared to an alternative dessert, then the determiner is felicitous. Similarly, the preposition omission error might not be an error if the writer is describing a context in which a talk was oversubscribed and many attendees had to stand because of a lack of seats. Though annotators will most likely take both the context and perceived writer’s intention into account when identifying errors, GEC itself is instead often framed as an isolated sentence-based task that ignores the wider context. This can introduce noise in the task in that errorful sequences in context may appear correct in isolation out of context. A related issue is that correction may not only depend on communicative intent, but also factors such as dialect and genre. For example, correcting dreamed to dreamt may be appropriate if the target is British English, but incorrect for American English.
A larger issue arises with differing possibilities for correction. For example, correcting the tense/aspect example to kissing or to kiss in the context of likes seems equally correct. However, few existing corpora provide more than one possibility, which means the true performance of systems is often underestimated. However, the same two corrections are not equally correct as complements of a verb, such as try depending on whether the context implies that a kissing event occurred or not. The issue of multiple possible corrections arises with many if not most examples: For instance, I haven’t the book; We met them, talked and left; We had an important conversation; The sea I see from the seesaw (is calm) are all plausible alternative corrections for some of the examples in Table 1. For this reason, several of the shared tasks have also evaluated performance on grammatical error detection, as this is valuable in some applications. Recently, some work has explored treating the GEC task as one of document-level correction (e.g., Chollampatt, Wang, and Ng 2019; Yuan and Bryant 2021) which, in principle, could ameliorate some of these issues but is currently hampered by a lack of appropriately structured corpora.
1.2 Survey Structure
We organize the remainder of this survey according to Table 2. We note that our taxonomy of core approaches, additional techniques, and data augmentation (Section 3–5) is similar to that of Wang et al. (2021), because these sections contain unavoidable discussions of well-established techniques. We nevertheless believe this is the most effective way of categorizing this information and have endeavored to make the sections complementary in terms of the insights and information they provide.
Subject . | Topics . | |
---|---|---|
Section 2 | Data | Data collection and annotation, benchmark English datasets, other English datasets, non-English datasets |
Section 3 | Core Approaches | Classifiers, statistical machine translation, neural machine translation, edit-based approaches, language models and low-resource systems |
Section 4 | Additional Techniques | Reranking, ensembling and system combination, multi-task learning, custom inference methods, contextual GEC, Generative Adversarial Networks (GANs) |
Section 5 | Data Augmentation | Rule-based noise injection, probabilistic error patterns, back-translation, round-trip translation |
Section 6 | Evaluation | Benchmark metrics, reference-based metrics, reference-less metrics, metric reliability and human judgments, common experimental settings |
Section 7 | System Comparison | Recent state-of-the-art systems |
Section 8 | Future Challenges | Domain generalization, personalized systems, feedback comment generation, model interpretability, semantic errors, contextual GEC, system combination, training data selection, unsupervised approaches, multilingual GEC, spoken GEC, improved evaluation |
Section 9 | Conclusion | – |
Subject . | Topics . | |
---|---|---|
Section 2 | Data | Data collection and annotation, benchmark English datasets, other English datasets, non-English datasets |
Section 3 | Core Approaches | Classifiers, statistical machine translation, neural machine translation, edit-based approaches, language models and low-resource systems |
Section 4 | Additional Techniques | Reranking, ensembling and system combination, multi-task learning, custom inference methods, contextual GEC, Generative Adversarial Networks (GANs) |
Section 5 | Data Augmentation | Rule-based noise injection, probabilistic error patterns, back-translation, round-trip translation |
Section 6 | Evaluation | Benchmark metrics, reference-based metrics, reference-less metrics, metric reliability and human judgments, common experimental settings |
Section 7 | System Comparison | Recent state-of-the-art systems |
Section 8 | Future Challenges | Domain generalization, personalized systems, feedback comment generation, model interpretability, semantic errors, contextual GEC, system combination, training data selection, unsupervised approaches, multilingual GEC, spoken GEC, improved evaluation |
Section 9 | Conclusion | – |
2 Data
Like most tasks in NLP, the cornerstone of modern GEC systems is data. State-of-the-art neural models depend on millions or billions of words and the quality of this data is paramount to model success. Collecting high-quality annotated data is a slow and laborious process, however, and there are fewer resources available in GEC than other fields such as machine translation. This section hence first outlines some key considerations of data collection in GEC and highlights the importance of robust annotation guidelines. It next introduces the most commonly used corpora in English, as well as some less commonly used corpora, before concluding with GEC corpora for other languages. Artificial data has also become a popular topic in recent years, but this section only covers human annotated data; artificial data will be covered in Section 5.
2.1 Annotation Challenges
As mentioned in Section 1.1, the notion of a grammatical error is hard to define as different errors may have different scope (e.g., local vs. contextual), complexity (e.g., orthographic vs. semantic), and corrections (e.g., [this books→this book] vs. [this books→these books]. Human annotation is thus an extremely cognitively demanding task and so clear annotation guidelines are a crucial component of dataset quality. This section briefly outlines three important aspects of data collection: Minimal vs. Fluent Corrections, Annotation Consistency, and Preprocessing Challenges.
Minimal vs. Fluent Corrections
Most GEC corpora have been annotated on the principle of minimal corrections, that is, annotators should make the minimum number of changes to make a text grammatical. Sakaguchi et al. (2016) argue, however, that this can often lead to corrections that sound unnatural, and so it would be better to annotate corpora on the principle of fluent corrections instead. Consider the following example:
Original | I want explain to you some interesting part from my experience. |
Minimal | I want to explain to you some interesting parts of my experience. |
Fluent | I want to tell you about some interesting parts of my experience. |
Original | I want explain to you some interesting part from my experience. |
Minimal | I want to explain to you some interesting parts of my experience. |
Fluent | I want to tell you about some interesting parts of my experience. |
While the minimal correction primarily inserts a missing infinitival to before explain to make the sentence grammatical, the fluent correction also changes explain to tell you about because it is more idiomatic to tell someone about an experience rather than explain an experience.
One of the main challenges of this distinction, however, is that it is very difficult to draw a line between what constitutes a minimal correction and what constitutes a fluent correction. This is because minimal corrections (e.g., missing determiners) are a subset of fluent corrections, and so there cannot be fluent corrections without minimal corrections. It is also the case that minimal corrections are typically easier to make than fluent corrections (for both humans and machines), although it is undeniable that fluent corrections are the more desirable outcome. Ultimately, although it is very difficult to precisely define a fluent correction, annotation guidelines should nevertheless attempt to make clear the extent to which annotators are expected to edit.
Annotation Consistency
A significant challenge of human annotation is that corrections are subjective and there is often more than one way to correct a sentence (Bryant and Ng 2015; Choshen and Abend 2018b). It is nevertheless important that annotators attempt to be consistent in their judgments, especially if they are explicitly annotating edit spans. For example, the edit [has eating→was eaten] can also be represented as [has→was] and [eating→eaten], and this choice not only affects data exploration and analysis, but can also have an impact on edit-based evaluation. Similarly, the edit [the informations→information] can also be represented as [the→ϵ] and [informations→information], but the latter may be more intuitive because it represents two independent edits of clearly distinct types. Explicit error type classification is thus another important aspect of annotator consistency, as an error type framework (if any) not only increases the cognitive burden on the annotator, but also might influence an annotator toward a particular correction given the error types that are available (Sakaguchi et al. 2016). Ultimately, if annotators are tasked with explicitly defining the edits they make to correct a sentence, annotator guidelines must clearly define the notion of an edit.
Preprocessing Challenges
While human annotators are trained to correct natural text, GEC systems are typically trained to correct word tokenized sentences (mainly for evaluation purposes). This mismatch means that human annotations typically undergo several preprocessing steps in order to produce the desired output format (Bryant and Felice 2016). The first of these transformations involves converting character-level edits to token-level edits. While this is often straightforward, it can sometimes be the case that a human-annotated character span does not map to a complete token; for example, [ing→ed] to denote the edit [dancing→danced]. Although such cases can often (but not always) be resolved automatically (e.g., by expanding the character spans of the edit or calculating token alignment), they can also be reduced by training annotators to explicitly annotate longer spans rather than sub-words.
The second transformation involves sentence tokenization, which is potentially more complex given human edits may change sentence boundaries (e.g., [A. B, C.→A, B. C.]). Sentences are nevertheless typically tokenized based solely on the original text, with the acknowledgment that some may be sentence fragments (to be joined with the following sentence) and that edits that cross sentence boundaries are ignored (e.g., [. Because→, because]. It is worth noting that this issue only affects sentence-based GEC systems (the vast majority) but paragraph or document-based systems are unaffected.
2.2 English Datasets
A small number of English GEC datasets have become popular for training and testing GEC systems, mostly as a result of shared tasks.1 This section introduces them as well as other less popular datasets for English (Table 3). We acknowledge that this is by no means an exhaustive list, but highlight datasets that have gained some traction in the last few years.
Corpus . | Use . | Sents . | Toks . | Refs . | Edit Spans . | Error Types . | Level . | Domain . |
---|---|---|---|---|---|---|---|---|
FCE | Train | 28.3k | 454k | 1 | ✓ | 71 | B1-B2 | Exams |
Dev | 2.2k | 34.7k | 1 | ✓ | 71 | B1-B2 | Exams | |
Test | 2.7k | 41.9k | 1 | ✓ | 71 | B1-B2 | Exams | |
NUCLE | Train | 57.1k | 1.16m | 1 | ✓ | 28 | C1 | Essays |
CoNLL-2013 | Dev/Test | 1.4k | 29.2k | 1 | ✓ | 28 | C1 | Essays |
CoNLL-2014 | Test | 1.3k | 30.1k | 2–18 | ✓ | 28 | C1 | Essays |
Lang-8 | Train | 1.03m | 11.8m | 1–8 | ✗ | 0 | A1-C2? | Web |
JFLEG | Dev | 754 | 14.0k | 4 | ✗ | 0 | A1-C2? | Exams |
Test | 747 | 14.1k | 4 | ✗ | 0 | A1-C2? | Exams | |
W&I+ | Train | 34.3k | 628k | 1 | ✓ | 55 | A1-C2 | Exams |
LOCNESS | Dev | 4.4k | 87.0k | 1 | ✓ | 55 | A1-Native | Exams, Essays |
(BEA-2019) | Test | 4.5k | 85.7k | 5 | ✓ | 55 | A1-Native | Exams, Essays |
CLC | Train | 1.96m | 29.1m | 1 | ✓ | 77 | A1-C2 | Exams |
EFCamDat | Train | 4.60m | 56.8m | 1 | ✓ | 25 | A1-C2 | Exams |
WikEd | Train | 28.5m | 626m | 1 | ✗ | 0 | Native | Wiki |
AESW | Train | 1.20m | 28.4m | 1 | ✓ | 0 | C1-Native | Science |
Dev | 148k | 3.51m | 1 | ✓ | 0 | C1-Native | Science | |
Test | 144k | 3.45m | 1 | ✓ | 0 | C1-Native | Science | |
GMEG | Dev | 2.9k | 60.9k | 4 | ✗ | 0 | B1-B2, Native | Exams, Web, Wiki |
Test | 2.9k | 61.5k | 4 | ✗ | 0 | B1-B2, Native | Exams, Web, Wiki | |
CWEB | Dev | 6.7k | 148k | 2 | ✓ | 55 | Native | Web |
Test | 6.8k | 149k | 2 | ✓ | 55 | Native | Web | |
GHTC | Train? | 353k edits only | 1 | ✓ | 0 | Native? | Documentation |
Corpus . | Use . | Sents . | Toks . | Refs . | Edit Spans . | Error Types . | Level . | Domain . |
---|---|---|---|---|---|---|---|---|
FCE | Train | 28.3k | 454k | 1 | ✓ | 71 | B1-B2 | Exams |
Dev | 2.2k | 34.7k | 1 | ✓ | 71 | B1-B2 | Exams | |
Test | 2.7k | 41.9k | 1 | ✓ | 71 | B1-B2 | Exams | |
NUCLE | Train | 57.1k | 1.16m | 1 | ✓ | 28 | C1 | Essays |
CoNLL-2013 | Dev/Test | 1.4k | 29.2k | 1 | ✓ | 28 | C1 | Essays |
CoNLL-2014 | Test | 1.3k | 30.1k | 2–18 | ✓ | 28 | C1 | Essays |
Lang-8 | Train | 1.03m | 11.8m | 1–8 | ✗ | 0 | A1-C2? | Web |
JFLEG | Dev | 754 | 14.0k | 4 | ✗ | 0 | A1-C2? | Exams |
Test | 747 | 14.1k | 4 | ✗ | 0 | A1-C2? | Exams | |
W&I+ | Train | 34.3k | 628k | 1 | ✓ | 55 | A1-C2 | Exams |
LOCNESS | Dev | 4.4k | 87.0k | 1 | ✓ | 55 | A1-Native | Exams, Essays |
(BEA-2019) | Test | 4.5k | 85.7k | 5 | ✓ | 55 | A1-Native | Exams, Essays |
CLC | Train | 1.96m | 29.1m | 1 | ✓ | 77 | A1-C2 | Exams |
EFCamDat | Train | 4.60m | 56.8m | 1 | ✓ | 25 | A1-C2 | Exams |
WikEd | Train | 28.5m | 626m | 1 | ✗ | 0 | Native | Wiki |
AESW | Train | 1.20m | 28.4m | 1 | ✓ | 0 | C1-Native | Science |
Dev | 148k | 3.51m | 1 | ✓ | 0 | C1-Native | Science | |
Test | 144k | 3.45m | 1 | ✓ | 0 | C1-Native | Science | |
GMEG | Dev | 2.9k | 60.9k | 4 | ✗ | 0 | B1-B2, Native | Exams, Web, Wiki |
Test | 2.9k | 61.5k | 4 | ✗ | 0 | B1-B2, Native | Exams, Web, Wiki | |
CWEB | Dev | 6.7k | 148k | 2 | ✓ | 55 | Native | Web |
Test | 6.8k | 149k | 2 | ✓ | 55 | Native | Web | |
GHTC | Train? | 353k edits only | 1 | ✓ | 0 | Native? | Documentation |
2.2.1 Benchmark English Datasets
FCE
The First Certificate in English (FCE) corpus (Yannakoudakis, Briscoe, and Medlock 2011) is a public subset of the Cambridge Learner Corpus (CLC) (Nicholls 2003) that consists of 1,244 scripts (∼531k words) written by international learners of English as a second language (L2 learners). Each script typically contains two answers to a prompt in the style of a short essay, letter, or description, and each answer has been corrected by a single annotator who has identified and classified each edit according to a framework of 88 error types (Nicholls 2003) (71 unique error types are represented in the FCE). The authors are all intermediate level (B1-B2 level on the Common European Framework of Reference for Languages (CEFR) (Council of Europe 2001)) and the data is split into standard training, development, and test sets. The FCE was used as the official dataset of the HOO-2012 shared task (Dale, Anisimoff, and Narroway 2012), one of the official training datasets of the BEA-2019 shared task (Bryant et al. 2019), and has otherwise commonly been used for grammatical error detection (Rei and Yannakoudakis 2016; Bell, Yannakoudakis, and Rei 2019; Yuan et al. 2021). It also contains essay level scores, as well as other limited metadata about the learner, and has been used for automatic essay scoring (e.g., Ke and Ng 2019).
NUCLE/CoNLL
The National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier, Ng, and Wu 2013) consists of 1,397 argumentative essays (∼1.16m words) written by NUS undergraduate students who needed L2 English language support. The essays, which are approximately C1 level, are written on a diverse range of topics including technology, healthcare, and finance, and were each corrected by a single annotator who identified and classified each edit according to a framework of 28 error types. NUCLE was used as the official training corpus of the CoNLL-2013 and CoNLL-2014 shared tasks (Ng et al. 2013, 2014) as well as one of the official training datasets of the BEA-2019 shared task (Bryant et al. 2019). The CoNLL-2013 and CoNLL-2014 test sets were annotated under similar conditions to NUCLE and respectively consist of 50 essays each (∼30k words) on the topics of (i) surveillance technology and population aging, and (ii) genetic testing and social media. The CoNLL-2014 test set was also doubly annotated by 2 independent annotators, resulting in 2 sets of official reference annotations; Bryant and Ng (2015) and Sakaguchi et al. (2016) subsequently collected another 8 sets of annotations each for a total of 18 sets of reference annotations. The CoNLL-2013 dataset is now occasionally used as a development set, while the CoNLL-2014 dataset is one of the most commonly used benchmark test sets. One limitation of the CoNLL-2014 test set is that it is not very diverse given that it consists entirely of essays written by a narrow range of learners on only two different topics.
Lang-8
The Lang-8 Corpus of Learner English (Mizumoto et al. 2012; Tajiri, Komachi, and Matsumoto 2012) is a preprocessed subset of the multilingual Lang-8 Learner Corpus (Mizumoto et al. 2011), which consists of 100,000 submissions (∼11.8m words) to the language learning social network service, Lang-8.2 The texts are wholly unconstrained by topic, and hence include the full range of ability levels (A1–C2), and were written by international L2 English language learners with a bias toward Japanese L1 speakers. Although Lang-8 is one of the largest publicly available corpora, it is also one of the noisiest as corrections are provided by other users rather than professional annotators. A small number of submissions also contain multiple sets of corrections, but all annotations are provided as parallel text and so do not contain explicit edits or error types. Lang-8 was also one of the official training datasets of the BEA-2019 shared task (Bryant et al. 2019).
JFLEG
The Johns Hopkins Fluency-Extended GUG corpus (JFLEG) (Napoles, Sakaguchi, and Tetreault 2017) is a collection of 1,501 sentences (∼28.1k words) split roughly equally into a development and test set. The sentences were randomly sampled from essays written by L2 learners of English of an unspecified ability level (Heilman et al. 2014) and corrected by crowdsourced annotators on Amazon Mechanical Turk (Crowston 2012). Each sentence was annotated a total of 4 times, resulting in 4 sets of parallel reference annotations, but edits were not explicitly defined or classified. The main innovation of JFLEG is that sentences were corrected to be fluent rather than minimally grammatical (Section 2.1). The main criticisms of JFLEG are that it is much smaller than other test sets, the sentences are presented out of context, and it was not corrected by professional annotators (Napoles, Nădejde, and Tetreault 2019).
W&I+LOCNESS
The Write & Improve (W&I) and LOCNESS corpora (Bryant et al. 2019) respectively consist of 3,600 essays (∼755k words) written by international learners of all ability levels (A1–C2) and 100 essays (∼46.2k words) written by native British/American English undergraduates. It was released as the official training, development, and test corpus of the BEA-2019 shared task and was designed to be more balanced than other corpora such that there are roughly an equal number of sentences at each ability level: Beginner, Intermediate, Advanced, Native. The W&I essays come from submissions to the Write & Improve online essay-writing platform3 (Yannakoudakis et al. 2018), and the LOCNESS essays, which only constitute part of the development and test sets, come from the LOCNESS corpus (Granger 1998). The training and development set essays were each corrected by a single annotator, while the test set essays were corrected by 5 annotators resulting in 5 sets of parallel reference annotations. Edits were explicitly defined, but not manually classified, so error types were added automatically using the ERRANT framework (Bryant, Felice, and Briscoe 2017). The test set references are not currently publicly available, so all evaluation on this dataset is done via the BEA-2019 Codalab competition platform,4 which ensures that all systems are evaluated under the same conditions.
2.2.2 Other English Datasets
CLC
The Cambridge Learner Corpus (CLC) (Nicholls 2003) is a proprietary collection of over 130,000 scripts (∼29.1m words) written by international learners of English (130 different first language backgrounds) for different Cambridge exams of all levels (A1–C2) (Yuan, Briscoe, and Felice 2016; Bryant 2019). It is the superset of the public FCE and annotated in the same way.
EFCAMDAT
The Education First Cambridge Database (EFCamDat) (Geertzen, Alexopoulou, and Korhonen 2013) consists of 1.18m scripts (∼83.5m words) written by international learners of all ability levels (A1–C2) submitted to the English First online school platform. Approximately 66% of the scripts (∼56.8m words) have been annotated with explicit edits that have been classified according to a framework of 25 error types (Huang et al. 2017). Since the annotations were made by teachers for the purposes of giving feedback to students rather than for GEC system development, they are not always complete (too many corrections may dishearten the learner).
WikEd
The Wikipedia Edit Error Corpus (WikEd) (Grundkiewicz and Junczys-Dowmunt 2014) consists of tens of millions of sentences of revision histories from articles on English Wikipedia. The texts are written and edited by native speakers rather than L2 learners and not all changes are grammatical edits (e.g., information updates). A preprocessed version of the corpus is available5 (28.5m sentences, 626m words) that filters and modifies sentences such that they only contain edits similar to those in NUCLE. The corpus also includes tools to facilitate the collection of similar Wiki-based corpora for other languages.
AESW
The Automatic Evaluation of Scientific Writing (AESW) dataset consists of 316k paragraphs (∼35.5m words) extracted from 9,919 published scientific journal articles and split into a training, development, and test set for the AESW shared task (Daudaravicius et al. 2016). A majority of the paragraphs come from Physics, Mathematics, and Engineering journals and were written by advanced or native speakers. The articles were edited by professional language editors who explicitly identified the required edits but did not classify them by error type. Although large, one of the main limitations of the AESW dataset is that the texts come from a very specific domain and many sentences contain placeholder tokens for mathematical notation and reference citations, which do not generalize to other domains.
GMEG
The Grammarly Multidomain Evaluation for GEC (GMEG) dataset (Napoles, Nădejde, and Tetreault 2019) consists of 5,919 sentences (∼122.4k words) split approximately equally across 3 different domains: formal native, informal native, and learner text. Specifically, the formal text is sampled from the WikEd corpus (Grundkiewicz and Junczys-Dowmunt 2014), the informal text is sampled from Yahoo Answers, and the learner text is sampled from the FCE (Yannakoudakis, Briscoe, and Medlock 2011). The sentences were sampled at the paragraph level (except for WikEd) to include some context and were annotated by 4 professional annotators to produce 4 sets of alternative references. One of the goals of GMEG was to diversify researchers away from purely L2 learner-based corpora.
CWEB
The Corrected Websites (CWEB) dataset (Flachs et al. 2020) consists of 13.6k sentences (297k words) sampled from random paragraphs on the web in the CommonCrawl dataset.6 Paragraphs were filtered to reduce noise (e.g., non-English and duplicates) and loosely defined as formal (“sponsored”) and informal (“generic”) based on the domain of the URL. The paragraphs, which are split equally between a development set and a test set, were doubly annotated by 2 professional annotators and edits were extracted and classified automatically using ERRANT (Bryant, Felice, and Briscoe 2017). Like GMEG, one of the aims of CWEB was to introduce a dataset that extended beyond learner corpora.
GHTC
The GitHub Typo Corpus (GHTC) (Hagiwara and Mita 2020) consists of 353k edits from 203k commits to repositories in the GitHub software hosting website.7 All the edits were gathered from repositories that met certain conditions (e.g., a permissive license) and from commits that contained the word “typo” in the commit message. The intuition behind the corpus was that developers often make small commits to correct minor spelling/grammatical errors and that these annotations can be used for GEC. The main limitation of GHTC is that the majority of edits are spelling or orthographic errors from a specific domain (i.e., software documentation) and that the context of the edit is not always a full sentence.
2.3 Non-English Datasets
Although most work on GEC has focused on English, corpora for other languages are also slowly being created and publicly released for the purposes of developing GEC models. This section introduces some of the most prominent (Table 4), along with other relevant resources, but is again by no means an exhaustive list. These resources are ultimately helping to pave the way for research into multilingual GEC (Náplava and Straka 2019; Katsumata and Komachi 2020; Rothe et al. 2021).
Language . | Corpus . | Use . | Sents . | Toks . | Refs . | Edit Spans . | Error Types . | Level . | Domain . |
---|---|---|---|---|---|---|---|---|---|
Arabic | QALB-2014 | Train | 19.4k* | 1m | 1 | 7 | Native | Web | |
Dev | 1k* | 53.8k | 1 | 7 | Native | Web | |||
Test | 948* | 51.3k | 1 | 7 | Native | Web | |||
QALB-2015 | Train | 310* | 43.3k | 1 | 7 | A1-C2 | Essays | ||
Dev | 154* | 24.7k | 1 | 7 | A1-C2 | Essays | |||
Test | 158* | 22.8k | 1 | 7 | A1-C2 | Essays | |||
Test | 920* | 48.5k | 1 | 7 | Native | Web | |||
Chinese | NLPTEA-2020 | Train | 1.1k† | 36.9k‡ | 1 | 4 | A1-C2 | Exams | |
Test | 1.4k† | 55.2k‡ | 1 | 4 | A1-C2 | Exams | |||
NLPCC-2018 | Train | 717k | 14.1m‡ | 1–21 | ✗ | 0 | A1-C2? | Web | |
Test | 2k | 61.3k‡ | 1–2 | 4 | A1-C2? | Essays | |||
MuCGEC | Dev | 1.1k | 50k‡ | 2.3 | 19 | A1-C2? | Exams | ||
Test | 5.9k | 228k‡ | 2.3 | 19 | A1-C2? | Essays, Exams, Web | |||
Czech | AKCES-GEC | Train | 42.2k | 447k | 1 | 25 | A1-Native | Essays, Exams | |
Dev | 2.5k | 28.0k | 2 | 25 | A1-Native | Essays, Exams | |||
Test | 2.7k | 30.4k | 2 | 25 | A1-Native | Essays, Exams | |||
GECCC | Train | 66.6k | 750k | 1 | 65 | A1-Native | Essays, Exams, Web | ||
Dev | 8.5k | 101k | 1–2 | 65 | A1-Native | Essays, Exams, Web | |||
Test | 7.9k | 98.1k | 2 | 65 | A1-Native | Essays, Exams, Web | |||
German | Falko-MERLIN | Train | 19.2k | 305k | 1 | 56 | A1-C2 | Essays, Exams | |
Dev | 2.5k | 39.5k | 1 | 56 | A1-C2 | Essays, Exams | |||
Test | 2.3k | 36.6k | 1 | 56 | A1-C2 | Essays, Exams | |||
Japanese | TEC-JL | Test | 1.9k | 41.5k‡ | 2 | ✗ | 0 | A1-C2? | Forum |
Russian | RULEC-GEC | Train | 5k | 83.4k | 1 | 23 | C1-C2 | Essays | |
Dev | 2.5k | 41.2k | 1 | 23 | C1-C2 | Essays | |||
Test | 5k | 81.7k | 1 | 23 | C1-C2 | Essays | |||
Ukrainian | UA-GEC | Train | 18.2k | 285k | 1 | 4 | B1-Native | Essays, Fiction | |
Test | 2.5k | 43.5k | 1 | 4 | B1-Native | Essays, Fiction |
Language . | Corpus . | Use . | Sents . | Toks . | Refs . | Edit Spans . | Error Types . | Level . | Domain . |
---|---|---|---|---|---|---|---|---|---|
Arabic | QALB-2014 | Train | 19.4k* | 1m | 1 | 7 | Native | Web | |
Dev | 1k* | 53.8k | 1 | 7 | Native | Web | |||
Test | 948* | 51.3k | 1 | 7 | Native | Web | |||
QALB-2015 | Train | 310* | 43.3k | 1 | 7 | A1-C2 | Essays | ||
Dev | 154* | 24.7k | 1 | 7 | A1-C2 | Essays | |||
Test | 158* | 22.8k | 1 | 7 | A1-C2 | Essays | |||
Test | 920* | 48.5k | 1 | 7 | Native | Web | |||
Chinese | NLPTEA-2020 | Train | 1.1k† | 36.9k‡ | 1 | 4 | A1-C2 | Exams | |
Test | 1.4k† | 55.2k‡ | 1 | 4 | A1-C2 | Exams | |||
NLPCC-2018 | Train | 717k | 14.1m‡ | 1–21 | ✗ | 0 | A1-C2? | Web | |
Test | 2k | 61.3k‡ | 1–2 | 4 | A1-C2? | Essays | |||
MuCGEC | Dev | 1.1k | 50k‡ | 2.3 | 19 | A1-C2? | Exams | ||
Test | 5.9k | 228k‡ | 2.3 | 19 | A1-C2? | Essays, Exams, Web | |||
Czech | AKCES-GEC | Train | 42.2k | 447k | 1 | 25 | A1-Native | Essays, Exams | |
Dev | 2.5k | 28.0k | 2 | 25 | A1-Native | Essays, Exams | |||
Test | 2.7k | 30.4k | 2 | 25 | A1-Native | Essays, Exams | |||
GECCC | Train | 66.6k | 750k | 1 | 65 | A1-Native | Essays, Exams, Web | ||
Dev | 8.5k | 101k | 1–2 | 65 | A1-Native | Essays, Exams, Web | |||
Test | 7.9k | 98.1k | 2 | 65 | A1-Native | Essays, Exams, Web | |||
German | Falko-MERLIN | Train | 19.2k | 305k | 1 | 56 | A1-C2 | Essays, Exams | |
Dev | 2.5k | 39.5k | 1 | 56 | A1-C2 | Essays, Exams | |||
Test | 2.3k | 36.6k | 1 | 56 | A1-C2 | Essays, Exams | |||
Japanese | TEC-JL | Test | 1.9k | 41.5k‡ | 2 | ✗ | 0 | A1-C2? | Forum |
Russian | RULEC-GEC | Train | 5k | 83.4k | 1 | 23 | C1-C2 | Essays | |
Dev | 2.5k | 41.2k | 1 | 23 | C1-C2 | Essays | |||
Test | 5k | 81.7k | 1 | 23 | C1-C2 | Essays | |||
Ukrainian | UA-GEC | Train | 18.2k | 285k | 1 | 4 | B1-Native | Essays, Fiction | |
Test | 2.5k | 43.5k | 1 | 4 | B1-Native | Essays, Fiction |
The Arabic datasets are split into documents rather than sentences.
The Chinese NLPTEA datasets are split into paragraphs (1–5 sentences) rather than sentences.
The Chinese and Japanese datasets are split into characters rather than tokens.
Arabic
The Qatar Arabic Language Bank (QALB) project (Zaghouani et al. 2014) is an initiative that aims to collect large corpora of annotated Arabic for the purposes of Arabic GEC system development. A subset of this corpus was used as the official training, development, and test data of the QALB-2014 and QALB-2015 shared tasks on Arabic text correction (Mohit et al. 2014; Rozovskaya et al. 2015). In particular, QALB-2014 released 21.3k documents (1.1m words) of annotated user comments submitted to the Al Jazeera news website by native speakers, while QALB-2015 released 622 documents (90.8k words) of annotated essays written by the full range of Arabic L2 learners (A1–C2) (Zaghouani et al. 2015) along with an additional 920 documents (48.5k words) of unreleased Al Jazeera comments. QALB-2015 thus had 2 test sets: one on native Al Jazeera data and one on Arabic L2 learner essays. In all cases, files were provided at the document level (rather than the sentence level) and edits were explicitly identified by trained annotators and classified automatically using a framework of 7 error types.
Chinese
The Test of Chinese as a Foreign Language (TOCFL) corpus (Lee, Tseng, and Chang 2018) and the Hanyu Shuiping Kaoshi (HSK: Chinese Proficiency Test) corpus8 (Zhang 2009) respectively consist of 2.8k essays (1m characters) and 11k essays (4m characters) written by the full range of language learners (A1–C2) who took Mandarin Chinese language proficiency exams. Various subsets of these corpora were used as the official training and test sets in the NLPTEA series of shared tasks on Chinese Grammatical Error Diagnosis (i.e., error detection) between 2014 and 2020 (Yu, Lee, and Chang 2014; Rao, Yang, and Zhang 2020). The most recent of these shared tasks, NLPTEA-2020, released a total of 2.6k paragraphs (92.1k characters, 1–5 sentences each), which were annotated by a single annotator according to a framework of 4 error types: Redundant (R), Missing (M), Word Selection (S), or Word Order (W).
The NLPCC-2018 shared task (Zhao et al. 2018), which was the first shared task on full error correction in Mandarin Chinese, released a further 717k training sentences (14.1m characters) that were extracted from a cleaned subset of Lang-8 user submissions (Mizumoto et al. 2011). Like the Lang-8 Corpus of Learner English, the ability level of the authors in this dataset is unknown and corrections were provided by other users. The test data for this shared task came from the PKU Chinese Learner Corpus and consists of 2,000 sentences (61.3k characters) written by foreign college students. All test sentences were first annotated by a single annotator, who also classified edits according to the same 4-error-type framework as NLPTEA, and subsequently checked by a second annotator who was allowed to make changes to the annotations if necessary.
The Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction (MuCGEC) Zhang et al. (2022b) is a new corpus that is intended to be a more robust test set for Chinese GEC. It contains a total of 7,063 sentences (∼278k characters) sampled approximately equally from the NLPCC-2018 training set (Lang-8), the NLPCC-2018 test set (PKU Chinese Learner Corpus), and the NLPTEA-2018/2020 test sets (HSK Corpus). All sentences were annotated by multiple annotators, but identical references were removed, so we report an average of 2.3 references per sentence (90% of all sentences have 1–3 references). Edits were also classified according to a scheme of 19 error types, including 5 main error types and 14 minor sub-types.
Czech
The AKCES-GEC corpus (Náplava and Straka 2019) consists of 47.3k sentences (505k words) written by both learners of Czech as a second language (from both Slavic and non-Slavic backgrounds) and Romani children who speak a Czech ethnolect as a first language. The essays and exam-style scripts come from the Learner Corpus of Czech as a Second Language (CzeSL) (Rosen 2016), which falls under the larger Czech Language Acquisition Corpora (AKCES) project (Šebesta 2010). The essays in the training set were annotated once (1 set of annotations) and the essays in the development and test sets were annotated twice (2 sets of annotations), all with explicit edits that were classified according to a framework of 25 error types.
The Grammar Error Correction Corpus for Czech (GECCC) (Náplava et al. 2022) is an extension of AKCES-GEC that includes both formal texts written by native Czech primary and secondary school students as well as informal website discussions on Facebook and Czech news websites, in addition to the texts written by Czech language learners and Romani children. The total corpus consists of 83k sentences (949k words), all of which were manually annotated (or re-annotated in order to preserve annotation style) by 5 experienced annotators who explicitly identified edits. Edits were then classified automatically by a variant of ERRANT (Bryant, Felice, and Briscoe 2017) for Czech, which included a customized tagset of 65 error types. GECCC is currently one of the largest non-English corpora and is also larger than most popular English benchmarks.
German
The Falko-MERLIN GEC corpus (Boyd 2018) consists of 24k sentences (381k words) written by learners of all ability levels (A1–C2). Approximately half the data comes from the Falko corpus (Reznicek et al. 2012), which consists of minimally corrected advanced German learner essays (C1–C2), while the other half comes from the MERLIN corpus (Boyd et al. 2014), which consists of standardized German language exam scripts from a wide range of ability levels (A1–C1). Edits were not explicitly annotated, but extracted and classified automatically using a variation of ERRANT (Bryant, Felice, and Briscoe 2017) that was adapted for German and included a customized tagset for German error types.
Japanese
The TMU Evaluation Corpus for Japanese Learners (TEC-JL) (Koyama et al. 2020) consists of 1.9k sentences (41.5k characters) written by language learners of unknown level (A1–C2?) and submitted to the language learning social network service Lang-8. TEC-JL is a subset of the multilingual Lang-8 Learner Corpus (Mizumoto et al. 2011) and was doubly annotated by 3 native Japanese university students (2 sets of annotations) to be a more reliable test set than the original Lang-8 Learner Corpus, which can be quite noisy.
Russian
The Russian Learner Corpus of Academic Writing (RULEC) (Alsufieva, Kisselev, and Freels 2012) consists of essays written by L2 university students and heritage Russian speakers in the United States. A subset of this corpus, 12.5k sentences (206k words), was annotated by 2 native speakers of Russian with backgrounds in linguistics and released as the RULEC-GEC corpus (Rozovskaya and Roth 2019). Edits were explicitly annotated and classified according to a framework of 23 error types. Another corpus of annotated Russian errors, the Russian Lang-8 corpus (RU-Lang8) (Trinh and Rozovskaya 2021), which is a subset of the aforementioned multilingual Lang-8 Learner Corpus (Mizumoto et al. 2011), was also recently announced; however, the data has not yet been publicly released.
Ukrainian
The UA-GEC corpus (Syvokon and Nahorna 2021) consists of 20.7k sentences (329k words) written by almost 500 authors from a wide variety of backgrounds (mostly technical and humanities) and ability levels (two-thirds native). The texts cover a wide range of topics, including short essays (formal, informal, fictional, or journalistic) and translated works of literature, and were annotated by two native speakers with degrees in Ukrainian linguistics. Edits were explicitly annotated and classified according to a scheme of 4 error types: Grammar, Spelling, Punctuation, or Fluency.
3 Core Approaches
This section introduces some of the core approaches to GEC including classifiers (statistical and neural), machine translation (statistical and neural), edit-based approaches, and language models. We provide a high-level overview of how each of these approaches works and highlight notable models that have led to breakthroughs in system development. These approaches provide the foundation on which additional techniques (Section 4) and artificial error generation (Section 5) are built.
3.1 Classifiers
Machine learning classifiers were historically one of the most popular approaches to GEC. The main reason for this was that some of the most common error types for English as a second language (ESL) learners, such as article and preposition errors, have small confusion sets and so are well-suited to multiclass classification. For example, it is intuitive to build a classifier that predicts one of {a/an, the, ϵ} before every noun phrase in a sentence. To do this, a classifier receives a number of features representing the context of the analyzed word or phrase in a sentence and outputs a predicted class that constitutes a correction. Errors are flagged and corrected by comparing the original word used in the text with the most likely candidate predicted by the classifier. This approach has been applied to several common error types including:
articles (Lee 2004; Han, Chodorow, and Leacock 2006; De Felice 2008; Gamon et al. 2008; Gamon 2010; Dahlmeier and Ng 2011b; Kochmar, Andersen, and Briscoe 2012; Rozovskaya and Roth 2013, 2014);
prepositions (Chodorow, Tetreault, and Han 2007; De Felice 2008; Gamon et al. 2008; Tetreault and Chodorow 2008; Gamon 2010; Dahlmeier and Ng 2011b; Kochmar, Andersen, and Briscoe 2012; Rozovskaya and Roth 2013, 2014);
noun number (Berend et al. 2013; van den Bosch and Berck 2013; Jia, Wang, and Zhao 2013; Xiang et al. 2013; Yoshimoto et al. 2013; Kunchukuttan, Chaudhury, and Bhattacharyya 2014);
verb form (Lee and Seneff 2008; Tajiri, Komachi, and Matsumoto 2012; van den Bosch and Berck 2013; Jia, Wang, and Zhao 2013; Rozovskaya and Roth 2013, 2014; Rozovskaya, Roth, and Srikumar 2014).
Training examples consisting of native and/or learner data are represented as vectors of features that are considered useful for the error type. Since the most useful features often depend on the word class, it is necessary to build separate classifiers for each error type and most of the prior classification-based approaches have focused on feature engineering. For the vast majority of syntactically-motivated errors, features such as contextual word and part-of-speech (POS) n-grams, lemmas, phrase constituency information, and dependency relations are generally useful (Felice and Yuan 2014b; Leacock et al. 2014; Rozovskaya and Roth 2014; Wang et al. 2021). The details of training vary depending upon the classification algorithm, but popular examples include naive Bayes (Rozovskaya and Roth 2011; Kochmar, Andersen, and Briscoe 2012), maximum entropy (Lee 2004; Han, Chodorow, and Leacock 2006; Chodorow, Tetreault, and Han 2007; De Felice 2008), decision trees (Gamon et al. 2008), support-vector machines (Putra and Szabó 2013), and the averaged perceptron (Rozovskaya and Roth 2010a, 2010b, 2011).
More recently, neural network techniques have been applied to classification-based GEC, where neural classifiers have been built using context words with pre-trained word embeddings, like Word2Vec (Mikolov et al. 2013) and GloVe (Pennington, Socher, and Manning 2014). Different neural network models have been proposed, including convolutional neural networks (CNNs) (Sun et al. 2015), recurrent neural networks (RNNs) (Wang, Li, and Lin 2017; Li et al. 2019), and pointer networks (Li et al. 2019).
One limitation of these classifiers, however, is that they only target very specific error types with small confusion sets and do not extend well to errors involving open-class words (such as word choice errors). Another weakness is that they heavily rely on local context and treat errors independently, assuming that there is only one error in the context and all the surrounding information is correct. When multiple classifiers are combined for multiple error types, classifier order also matters and predictions from individual classifiers may become inconsistent (Yuan 2017). These limitations consequently mean that classifiers are generally no longer explored in GEC in favor of other methods.
3.2 Statistical Machine Translation
The use of SMT for GEC was pioneered by Brockett, Dolan, and Gamon (2006), who built a system to correct errors involving 14 countable and uncountable nouns. Their training data comprised a large corpus of sentences extracted from news articles that were deliberately modified to include artificial mass noun errors. Mizumoto et al. (2011) applied the same techniques to Japanese error correction but improved on them by not only considering a wider set of error types, but also training on real learner examples extracted from the language learning social network website Lang-8. Yuan and Felice (2013) subsequently trained a POS-factored SMT system to correct five types of errors in learner text for the CoNLL-2013 shared task, and revealed the potential of using SMT as a general approach for correcting multiple error types and interacting errors simultaneously. In the following year, the two top-performing systems in the CoNLL-2014 shared task demonstrated that SMT yielded state-of-the-art performance on general error correction, in contrast with other methods (Felice et al. 2014; Junczys-Dowmunt and Grundkiewicz 2014). This success led to SMT becoming a dominant approach in the field and inspired other researchers to adapt SMT technology for GEC, including:
Adding GEC-specific features to the model to allow for the fact that most words translate into themselves and errors are often similar to their correct forms. Two types of these features include the Levenshtein distance (Felice et al. 2014; Junczys-Dowmunt and Grundkiewicz 2014, 2016; Yuan, Briscoe, and Felice 2016; Grundkiewicz and Junczys-Dowmunt 2018) and edit operations (Junczys-Dowmunt and Grundkiewicz 2016; Chollampatt and Ng 2017; Grundkiewicz and Junczys-Dowmunt 2018).
Tuning parameter weights with different algorithms, including minimum error rate training (MERT) (Kunchukuttan, Chaudhury, and Bhattacharyya 2014; Junczys-Dowmunt and Grundkiewicz 2014), the margin infused relaxed algorithm (MIRA) (Junczys-Dowmunt and Grundkiewicz 2014), and pairwise ranking optimization (PRO) (Junczys-Dowmunt and Grundkiewicz 2016).
Training additional large-scale LMs on monolingual native data, such as the British National Corpus (BNC) (Yuan 2017), Wikipedia (Junczys-Dowmunt and Grundkiewicz 2014; Chollampatt and Ng 2017), and Common Crawl (Junczys-Dowmunt and Grundkiewicz 2014, 2016; Chollampatt and Ng 2017).
Introducing neural network components, such as a neural network global lexicon model (NNGLM) and neural network joint model (NNJM) (Chollampatt, Taghipour, and Ng 2016; Chollampatt and Ng 2017).
Despite their success in GEC, SMT-based approaches suffer from a few shortcomings. In particular, they (i) tend to produce locally well-formed phrases with poor overall grammar, (ii) exhibit a predilection for changing phrases to more frequent versions even when the original is correct, resulting in unnecessary corrections, (iii) are unable to process long-range dependencies, and (iv) are hard to constrain to particular error types (Felice 2016; Yuan 2017). Last but not least, the performance of SMT systems depends heavily on the amount and quality of parallel data available for training, which is very limited in GEC. A common solution to this problem is to generate artificial datasets, where errors are injected into well-formed text to produce pseudo-incorrect sentences, as described in Section 5.
3.3 Neural Machine Translation
With the advent of deep learning and the promising results reported in machine translation and other sequence-to-sequence tasks, neural machine translation (NMT) was naturally extended to GEC. Compared with SMT, NMT uses a single large neural network to model the entire correction process, eliminating the need for complex GEC-specific feature engineering. Training an NMT system is furthermore an end-to-end process and so does not require separately trained and tuned components as in SMT. Despite its simplicity, NMT has achieved state-of-the-art performance on various GEC tasks (Flachs, Stahlberg, and Kumar 2021; Rothe et al. 2021).
3.3.1 Recurrent Neural Networks
RNNs are a type of neural network that is specifically designed to process sequential data. RNNs are used to transform a variable-length input sequence to another variable-length output sequence (Cho et al. 2014; Sutskever, Vinyals, and Le 2014). To handle long-term dependencies, gated units are usually used in RNNs (Goodfellow, Bengio, and Courville 2016). The two most effective RNN gates are Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) and Gated Recurrent Units (GRU) (Cho et al. 2014). Bahdanau, Cho, and Bengio (2015) introduced an attention mechanism to implement variable-length representations, which eased optimization difficulty and resulted in improved performance. Yuan and Briscoe (2016) presented the first work on the NMT-based approach for GEC. Their model consists of a bidirectional RNN encoder and an attention-based RNN decoder. Xie et al. (2016) proposed the use of a character-level RNN sequence-to-sequence model for GEC. Following their work, a hybrid model with nested attention at both the word and character level was later introduced by Ji et al. (2017).
3.3.2 Convolutional Neural Networks
Another way of processing sequential data is by using a CNN across a temporal sequence. CNNs are a type of neural network that is designed to process grid-like data and specializes in capturing local dependencies (Goodfellow, Bengio, and Courville 2016). CNNs were first applied to NMT by Kalchbrenner and Blunsom (2013), but they were not as successful as RNNs until Gehring et al. (2017) stacked several CNN layers followed by non-linearities. Inspired by this work, Chollampatt and Ng (2018a) proposed a 7-layer CNN sequence-to-sequence model for GEC. In their model, local context is captured by the convolution operations performed over smaller windows and wider context is captured by the multilayer structure. Their model was the first NMT-based model that significantly outperformed prior SMT-based models. This model was later used in combination with Transformers to build a state-of-the-art GEC system (Yuan et al. 2019).
3.3.3 Transformers
The Transformer (Vaswani et al. 2017) is the first sequence transducer network that entirely relies on a self-attention mechanism to compute the representations of its input, without the need for recurrence or convolution. Its architecture allows better parallelization on multiple GPUs, overcoming the weakness of RNNs.
The Transformer has become the architecture of choice for machine translation since its inception (Edunov et al. 2018; Wang et al. 2019; Liu et al. 2020). Previous work has investigated the adaptation of NMT to GEC, such as optimizing the model with edit-weighted loss (Junczys-Dowmunt et al. 2018) and adding a copy mechanism (Zhao et al. 2019; Yuan et al. 2019). A copy mechanism allows the model to directly copy tokens from the source sentence, which often has substantial overlap with the target sentence in GEC. The Copy-Augmented Transformer has become a popular alternative architecture for GEC (Hotate, Kaneko, and Komachi 2020; Wan, Wan, and Wang 2020). Another modification to the Transformer architecture is altering the encoder-decoder attention mechanism in the decoder to accept and make use of additional context. For example, Kaneko et al. (2020) added the BERT representation of the input sentence as additional context for GEC, while Yuan and Bryant (2021) added the previous sentences in the document, and Zhang et al. (2022c) added a tree-based syntactic representation of the input sentence.
As the Transformer architecture has a large number of parameters, yet parallel GEC training data is limited, pre-training has become a standard procedure in building GEC systems. The first Transformer-based GEC system (Junczys-Dowmunt et al. 2018) pre-trained the Transformer decoder on a language modeling task, but it has since become more common to pre-train on synthetic GEC data. The top two systems in the BEA-2019 shared task (Grundkiewicz, Junczys-Dowmunt, and Heafield 2019; Choe et al. 2019) and a recent state-of-the-art GEC system (Stahlberg and Kumar 2021) all pre-trained their Transformer models with synthetic data, but they generated their synthetic data in different ways. We discuss different techniques for generating synthetic data in Section 5.1. More recently, with the advances in large pre-trained language models, directly fine-tuning large pre-trained language models with GEC parallel data has been shown to achieve comparable performance with synthetic data pre-training (Katsumata and Komachi 2020), even reaching state-of-the-art performance (Rothe et al. 2021; Tarnavskyi, Chernodub, and Omelianchuk 2022).
Irrespective of the type of NMT architecture (RNN, CNN, Transformer), however, NMT systems share several weaknesses with SMT systems, most notably in terms of data requirements. In particular, although NMT systems are more capable at correcting longer range and more complex errors than SMT, they also require as much training data as possible, which can lead to extreme resource and time requirements: It is not uncommon for some models to require several days of training time on a cluster of GPUs. Moreover, neural models are almost completely uninterpretable (which furthermore makes them difficult to customize) and it is nearly impossible for a human to determine the reasoning behind a given decision; this is particularly problematic if we also want to explain the cause of an error to a user rather than just correct it. Ultimately, however, a key strength of NMT is that it is an end-to-end approach, and so does not require feature engineering or much human intervention, and it is undeniable that it produces some of the most convincing output to date.
3.4 Edit-based Approaches
While most GEC approaches generate a corrected sentence from an input sentence, the edit generation approach generates a sequence of edits to be applied to the input sentence instead. As GEC has a high degree of token copying from the input to the output, Stahlberg and Kumar (2020) argued that generating the full sequence is wasteful. By generating edit operations instead of all tokens in a sentence, the edit generation approach typically has a faster inference speed, reported to be five to ten times faster than GEC systems that generate the whole sentence. One limitation of this approach, however, is that edit operations tend to be token-based, and so sometimes fail to capture more complex, multi-token fluency edits (Lai et al. 2022). Edit generation has been cast as a sequence tagging task (Malmi et al. 2019; Awasthi et al. 2019; Omelianchuk et al. 2020; Tarnavskyi, Chernodub, and Omelianchuk 2022) or a sequence-to-sequence task (Stahlberg and Kumar 2020).
In the sequence tagging approach, for each token of an input sentence, the system predicts an edit operation to be applied to that token (Table 5). This approach requires the user to define a set of tags representing the edit operations to be modeled by the system. Some edits can be universally modeled, such as conversion of verb forms or conversion of nouns from singular to plural form. Some others, such as word insertion and word replacement, are token-dependent. Token-dependent edits need a different tag for each possible word in the vocabulary, resulting in the number of tags growing linearly with the number of unique words in the training data. Thus, the number of token-dependent tags to be modeled in the system becomes a trade-off between coverage and model size.
Source | After | many | years | he | still | dream | to | become | a | super | hero | |
Target | After | many | years | , | he | still | dreams | of | becoming | a | super | hero |
Edits | KEEP | KEEP | APP_, | KEEP | KEEP | VB_VBZ | REP_of | VB_VBG | KEEP | KEEP | KEEP |
Source | After | many | years | he | still | dream | to | become | a | super | hero | |
Target | After | many | years | , | he | still | dreams | of | becoming | a | super | hero |
Edits | KEEP | KEEP | APP_, | KEEP | KEEP | VB_VBZ | REP_of | VB_VBG | KEEP | KEEP | KEEP |
On the other hand, the sequence-to-sequence approach is more flexible as it does not limit the output to pre-defined edit operation tags. It produces a sequence of edits, each consisting of a span position, a replacement string, and an optional tag for edit type (Table 6). These tags add interpretability to the process and have been shown to improve model performance. As generation in the sequence-to-sequence approach has a left-to-right dependency, the inference procedure is slower than that in the sequence tagging approach. It is still five times faster than that in the whole sentence generation approach as the edit sequence generated is much shorter than the sequence of all tokens in the sentence (Stahlberg and Kumar 2020).
Source | After many years he still dream to become a super hero . |
Target | After many years , he still dreams of becoming a super hero . |
Edits | (SELF,3,SELF), (PUNCT,3,‘,’), (SELF,5,SELF), (SVA,6,‘dreams’), (PART,7,‘of’), (FORM,8,‘becoming’), (SELF,12,SELF) |
Source | After many years he still dream to become a super hero . |
Target | After many years , he still dreams of becoming a super hero . |
Edits | (SELF,3,SELF), (PUNCT,3,‘,’), (SELF,5,SELF), (SVA,6,‘dreams’), (PART,7,‘of’), (FORM,8,‘becoming’), (SELF,12,SELF) |
The main advantages of edit-based approaches to GEC are thus that they not only add much needed transparency and explainability to the correction process, but they are also much faster at inference time than NMT. Their main disadvantages, however, are that they generally require human engineering to define the size and scope of the edit label set, and that it is more difficult to represent interacting and complex multi-token edits with token-based labels. Like all neural approaches, they also depend on as much training data as possible, but when data is available, edit-based approaches are very competitive with state-of-the-art NMT models.
3.5 Language Models for Low-Resource and Unsupervised GEC
Unlike previous strategies, language model–based GEC does not require training a system with parallel data. Instead, it uses various techniques using n-gram or Transformer language models. LM-based GEC was a common approach before machine translation–based GEC became popular (Dahlmeier and Ng 2012a; Lee and Lee 2014), but has experienced a recent resurgence with low-resource GEC and unsupervised GEC due to the effectiveness of large Transformer-based language models (Alikaniotis and Raheja 2019; Grundkiewicz and Junczys-Dowmunt 2019; Flachs, Lacroix, and Søgaard 2019). Recent advances have enabled Transformer-based language models to more adequately capture syntactic phenomena (Jawahar, Sagot, and Seddah 2019; Wei et al. 2021), making them capable GEC systems when little or no data is available. These systems can, however, become even more capable when exposed to a small amount of parallel data (Mita and Yanaka 2021).
3.5.1 Language Models as Discriminators
The traditional LM-based approach to GEC makes the assumption that low probability sentences are more likely to contain grammatical errors than high probability sentences, and so a GEC system must determine how to transform the former into the latter based on language model probabilities (Bryant and Briscoe 2018). Correction candidates can be generated from confusion sets (Dahlmeier and Ng 2011a; Bryant and Briscoe 2018), classification-based GEC models (Dahlmeier and Ng 2012a), or finite-state transducers (Stahlberg, Bryant, and Byrne 2019).
Yasunaga, Leskovec, and Liang (2021) proposed an alternative method using the break-it-fix-it (BIFI) approach (Yasunaga and Liang 2021), with a language model as the critic (LM-critic). Specifically, BIFI trains a breaker (noising channel) and a fixer (GEC model) on multiple rounds of feedback loops. An initial fixer is used to correct erroneous text, then the sentence pairs are filtered using LM-critic. Using this filtered data, the breaker is trained and used to generate new synthetic data from a clean corpus. These new sentence pairs are then also filtered using LM-critic and subsequently used to train the fixer again. The BIFI approach can be used for unsupervised GEC by training the fixer on synthetic data.
3.5.2 Language Models as Generators
A more recent LM-based approach to GEC is to use a language model as a zero-shot or few-shot generator to generate a correction given a prompt and noisy input sentence. For example, given the prompt “Correct the grammatical errors in the following text:” followed by an input sentence, the language model is expected to generate a corrected form of the input sentence given the prompt as context. This approach has become possible largely due to the advent of Large Language Models (LLMs), such as GPT-2 (Radford et al. 2019), GPT-3 (Brown et al. 2020), OPT (Zhang et al. 2022a), and PaLM (Chowdhery et al. 2022), which have been trained on up to a trillion words and parameterized using tens or hundreds of billions of parameters. These models have furthermore been shown to be capable of generalizing to new unseen tasks or languages by being fine-tuned on a wide variety of other NLP tasks (Sanh et al. 2022; Wei et al. 2022; Muennighoff et al. 2022), and so it is possible, for the first time, to build a system that is capable of carrying out multilingual GEC without having been explicitly trained to do so.
Despite their potential, however, there have not yet been any published studies that have formally benchmarked generative LLMs against any of the standard GEC test sets. Although a number of studies were beginning to appear at the time of final submission of this survey paper, most only evaluated LLM performance on a small sample (100 sentences) of the official test sets (Wu et al. 2023; Coyne and Sakaguchi 2023). These studies generally conclude, however, that LLMs have a tendency to overcorrect for fluency, which causes them to underperform on datasets that were developed for minimal corrections (Fang et al. 2023). We expect further investigation of this phenomenon in the coming year.
Regardless of the type of language model, the main advantage of language model–based approaches is that they only require unannotated monolingual data and so are much easier to extend to other languages than all other approaches. While discriminative LMs may not perform as well as state-of-the-art models and generative LLMs models have not been formally benchmarked, LMs have nevertheless proven themselves capable and can theoretically correct all types of errors, including complex fluency errors. The main disadvantage of language model approaches, however, is that it can be hard to adequately constrain the model, and so models sometimes replace grammatical words with other words that simply occur more frequently in a given context. An additional challenge in generative LLM-based GEC is that prompt engineering is important (Liu et al. 2023) and output may vary depending on whether a system was asked to “correct” a grammatical error or “fix” a grammatical error (Coyne and Sakaguchi 2023). Ultimately, all LM-based approaches suffer from the limitation that probability is not grammaticality, and so rare words may be mistaken for errors.
4 Additional Techniques
While Section 3 introduced the core technologies underpinning modern GEC systems, a number of other techniques are also commonly applied to further boost performance. Several of these techniques are introduced in this section, including re-ranking, ensembling and system combination, multi-task learning, custom inference methods (e.g., iterative decoding), contextual GEC, and Generative Adversarial Networks (GANs).
4.1 Re-ranking
Machine translation–based (both SMT and NMT) systems can produce an n-best list of alternative corrections for a single sentence. This has led to much work on n-best list re-ranking, which aims to determine whether the best correction for a sentence is not the most likely candidate produced by the system (i.e., n = 1), but is rather somewhere further down the top n most likely candidates (Yuan, Briscoe, and Felice 2016; Mizumoto and Matsumoto 2016; Hoang, Chollampatt, and Ng 2016). As a separate post-processing step, candidates produced by an SMT-based or NMT-based GEC system can be re-ranked using a rich set of features that have not been explored by the decoder before, so that better candidates can be selected as “optimal” corrections. During re-ranking, GEC-specific features can then be easily adapted without worrying about fine-grained model smoothing issues. In addition to the original model scores of the candidates, useful features include:
sentence fluency scores calculated from: LMs (Yuan, Briscoe, and Felice 2016; Chollampatt and Ng 2018a), neural error detection models (Yannakoudakis et al. 2017; Yuan et al. 2019), neural quality estimation models (Chollampatt and Ng 2018c), and BERT (Kaneko et al. 2019);
similarity measures like Levenshtein Distance (Yannakoudakis et al. 2017; Yuan et al. 2019) and edit operations (Chollampatt and Ng 2018a; Kaneko et al. 2019);
length-based features (Yuan, Briscoe, and Felice 2016);
right-to-left models (Grundkiewicz, Junczys-Dowmunt, and Heafield 2019; Kaneko et al. 2020);
syntactic features like POS n-grams, dependency relations (Mizumoto and Matsumoto 2016);
error detection information that has been used in a binary setting (Yannakoudakis et al. 2017; Yuan et al. 2019), as well as a multiclass setting (Yuan et al. 2021).
N-best list reranking has traditionally been one of the simplest and most popular methods of boosting system performance. An alternative form of reranking is to collect all the edits from the N-best corrections and filter them using an edit-scorer (Sorokin 2022).
4.2 Ensembling and System Combination
Ensembling is a common technique in machine learning to combine the predictions of multiple individually trained models. Ensembles often generate better predictions than any of the single models that are combined (Opitz and Maclin 1999). In GEC, ensembling usually refers to averaging the probabilities of individually trained GEC models when predicting the next token in the sequence-to-sequence approach or the edit tag in the edit-based approach. GEC models that are combined into ensembles usually have similar properties with only slight variations, which can be the random seed (Stahlberg and Kumar 2021), the pre-trained model (Omelianchuk et al. 2020), or the architecture (Choe et al. 2019).
On the other hand, different GEC approaches have different strengths and weaknesses. Susanto, Phandi, and Ng (2014) have shown that combining different GEC systems can produce a better system with higher accuracy. When combining systems that have substantial differences, training a system combination model is preferred over ensembles. A system combination model allows the combined system to properly integrate the strengths of the GEC systems and has been shown to produce better results than ensembles (Kantor et al. 2019; Qorib, Na, and Ng 2022). The combination model can be trained through learning the characteristic of the GEC systems (Kantor et al. 2019; Lin and Ng 2021; Qorib, Na, and Ng 2022) or learning how to score a correction by supplying the model with examples of good and bad corrections for different kinds of student sentences (Sorokin 2022). Moreover, most system combination methods for GEC work on a black-box setup (Kantor et al. 2019; Lin and Ng 2021; Qorib, Na, and Ng 2022), only requiring the systems’ outputs without any access to the systems’ internals and the prediction probabilities. When the individual component systems are not different enough, encouraging the individual systems to be more diverse before combining them can also improve performance (Han and Ng 2021).
4.3 Multi-task Learning
Multi-task learning allows systems to use information from related tasks and learn from multiple objectives via shared representations, leading to performance gains on individual tasks. Rei and Yannakoudakis (2017) were the first to investigate the use of different auxiliary objectives for the task of error detection in learner writing through a neural sequence-labeling model. In addition to predicting the binary error labels (i.e., correct or incorrect), they experimented with also predicting specific error type information, including the learner’s L1, token frequency, POS tags, and dependency relations. Asano et al. (2019) utilized a similar approach in which their error correction model additionally estimated the learner’s language proficiency level and performed sentence-level error detection simultaneously. Token-level and sentence-level error detection have also both been explored as auxiliary objectives in NMT-based GEC (Yuan et al. 2019; Zhao et al. 2019), where systems have been trained to jointly generate a correction and predict whether the source sentence (or any token in it) is correct or incorrect. Labels for these auxiliary error detection tasks can be extracted automatically from existing datasets using automatic alignment tools like ERRANT (Bryant, Felice, and Briscoe 2017).
4.4 Custom Inference Methods
Various inference techniques have been proposed to improve the quality of system output or speed up inference time in GEC. The most common of these, which specifically improves output quality, is to apply multiple rounds of inference, known as iterative decoding or multi-turn decoding. Because the input and output of GEC are in the same language, the output of the model can be passed through the model again to produce a second iteration of output. The advantage of this is that the model gets a second chance to correct errors it might have missed during the first iteration. Lichtarge et al. (2019) thus proposed an iterative decoding algorithm that allows a model to make multiple incremental corrections. In each iteration, the model is allowed to generate a different output only if it has high confidence. This technique proved effective for GEC systems trained on noisy data such as Wikipedia edits, but not as effective on GEC systems trained on clean data. Ge, Wei, and Zhou (2018) proposed an alternative iterative decoding technique called fluency boost, in which the model performs multiple rounds of inference until a fluency score stops increasing, while Lai et al. (2022) proposed an iterative approach that investigated the effect of correcting different types of errors (missing, replacement, unnecessary words) in different orders. Iterative decoding is commonly used in sequence-labeling GEC systems, which cannot typically correct all errors in a single pass. In these systems, iterative decoding is applied until the model stops making changes to the output or the number of iterations reaches a limit (Awasthi et al. 2019; Omelianchuk et al. 2020; Tarnavskyi, Chernodub, and Omelianchuk 2022).
Other inference techniques have been proposed to speed up inference time in GEC. As many tokens in GEC are copied from the input to the output, standard left-to-right inference can be inefficient. Chen et al. (2020a) thus proposed a two-step process that only performs correction on text spans that are predicted to contain grammatical errors. Specifically, their system first predicts erroneous spans using an erroneous span detection (ESD) model, and then corrects only the detected spans using an erroneous span correction (ESC) model. They reported reductions in inference time of almost 50% compared with a standard sequence-to-sequence model. In contrast, Sun et al. (2021) proposed a parallelization technique to speed up inference, aggressive decoding, which can be applied to any sequence-to-sequence model. Specifically, aggressive decoding first decodes as many tokens as possible in parallel and then only re-decodes tokens one-by-one at the point where the input and predictions differ (if any). When the input and predicted tokens start to match again, aggressive decoding again decodes the remainder in parallel until either the tokens no longer match or the end-of-sentence token is predicted. Since the input and output sequences in GEC are often very similar, this means most tokens can be decoded aggressively, yielding an almost ten time speedup in inference time.
4.5 Contextual GEC
Context provides valuable information that is crucial for correcting many types of grammatical errors and resolving inconsistencies. Existing GEC systems typically perform correction at the sentence-level, however; i.e., each sentence is processed independently, and so cross-sentence information is ignored. These systems thus frequently fail to correct contextual errors, such as verb tense, pronoun, run-on sentence, and discourse errors, which typically rely on information outside the scope of a single sentence. Corrections proposed by such narrow systems are furthermore likely to be inconsistent throughout a paragraph or entire document.
Chollampatt, Wang, and Ng (2019) were the first to address this problem by adapting a CNN sequence-to-sequence model to be more context-aware. Specifically, they introduced an auxiliary encoder to encode the two previous sentences along with the input sentence and incorporated the encoding in the decoder via attention and gating mechanisms. Yuan and Bryant (2021) subsequently compared different architectures for capturing wider context in Transformer-based GEC and showed that local context is useful (≤ 2 sentences) but very long context (>2 sentences) is not necessary for improved performance.
Because human reference edits are not annotated for whether an error depends on local context or long range context, it is often difficult to evaluate the extent to which a context-aware system improves the correction of context-sensitive errors. Chollampatt, Wang, and Ng (2019) thus constructed a synthetic dataset of verb tense errors that required cross-sentence context for correction, and Yuan and Bryant (2021) proposed a document-level evaluation framework to address this problem.
4.6 Generative Adversarial Networks
Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) are an approach to model training that makes use of both a generator, to generate some output, and a discriminator, to discriminate between real data and artificial output. In the context of GEC, Raheja and Alikaniotis (2020) were the first to apply this methodology to error correction, in which they trained a standard sequence-to-sequence Transformer model to generate grammatical sentences from parallel data (the generator) and a sentence classification model to discriminate between these generated output sentences and human-annotated reference sentences (the discriminator). During training, the models competed adversarially such that the generator learned to generate corrected sentences that are indistinguishable from the reference sentences (and thus fooled the discriminator), while the discriminator learned to identify the differences between real and generated sentences (and thus defeated the generator). This adversarial training process was ultimately shown to produce a better sequence-to-sequence model.
In addition to sequence-to-sequence generation, GANs have also been applied to sequence-labeling for GEC. In particular, Parnow, Li, and Zhao (2021) trained a generator to generate increasingly realistic errors (in the form of token-based edit labels) and a discriminator to differentiate between artificially generated edits and real human edits. They similarly reported improvements over a baseline that was not trained adversarially.
5 Data Augmentation
A common problem in GEC is that the largest publicly available high-quality parallel corpora only contain roughly 50k sentence pairs, and larger corpora, such as Lang-8, are noisy (Mita et al. 2020; Rothe et al. 2021). This data sparsity problem has motivated considerable research into synthetic data generation, especially in the context of resource-heavy NMT approaches, because synthetic data primarily requires a native monolingual source corpus rather than a labor-intensive manual annotation process. In this section, we introduce several different types of data augmentation methods, including rule-based noise injection and back-translation, but also noise reduction, which aims to improve the quality of existing datasets by removing/down-weighting noisy examples. It is an open question as to how to best evaluate the quality of synthetic data (Htut and Tetreault 2019; White and Rozovskaya 2020). An effort has been made by Kiyono et al. (2019) to compare the noise injection method and back-translation, but it is hard to comprehensively compare synthetic data generation methods directly, so most research evaluates it indirectly in terms of its impact on the performance of previous experiments. Data augmentation has nevertheless contributed greatly to GEC system improvement and has become a staple component of recent models.
5.1 Synthetic Data Generation
GEC is sometimes regarded as a low-resource machine translation task (Junczys-Dowmunt et al. 2018). With the dominance of neural network approaches, the need for more data grows as model size continues to increase. However, obtaining human annotations is expensive and difficult. Thus, techniques to generate synthetic parallel corpora from clean monolingual corpora have been intensely explored. A synthetic parallel corpus is generated by adding noise to a sentence and pairing it with the original sentence. The corrupted sentence is then regarded as a learner’s sentence (source) and the original clean sentence is regarded as the reference (target). There are many ways to generate synthetic sentences, and the dominant techniques usually fall under the category of noise injection or back-translation (Kiyono et al. 2019).
5.1.1 Noise Injection
One way to artificially generate grammatical errors to clean monolingual corpora is by perturbing a clean text to make it grammatically incorrect. The perturbations can be in the form of rule-based noising operations or error patterns that usually appear in GEC parallel corpora.
Rule-based
The most intuitive way of adding noise to a clean corpus is by applying a series of perturbation operations based on some pre-defined rules. The rules are applied based on a probability, which can be decided arbitrarily, empirically, or through some observations of available data. Ehsan and Faili (2013) apply one error to each sentence from pre-defined error templates that include omitting prepositions, repeating words, and so on. Lichtarge et al. (2019) introduce spelling errors to Wikipedia edit history by performing deletion, insertion, replacement, and transposition of characters. Zhao et al. (2019) also apply a similar noising strategy but at the word level—that is, deleting, adding, shuffling, and replacing words in a sentence. Grundkiewicz, Junczys-Dowmunt, and Heafield (2019) combine both approaches, character-level and word-level noising, but word substitution is limited to pairs from a confusion set made from an inverted spellchecker. Similarly, Xu et al. (2019) also combine both approaches but with a more complex word substitution strategy by making use of POS tags. The rule-based injection technique can also be applied dynamically during training to increase the error rate in a parallel corpus instead of creating additional training data (Zhao and Wang 2020).
Error Patterns
Another way of generating synthetic data is through injecting errors that frequently occur in GEC parallel corpora. In this way, the errors are more similar to the ones that humans usually make. Rozovskaya and Roth (2010b) proposed three different methods of injecting article errors, based on the error distribution in English as a Second Language data. They proposed adding article errors based on the distribution of articles in a text before correction, the distribution of articles in the corrected text, and the distribution of article corrections themselves. Felice and Yuan (2014a) later improved the method by taking into consideration the morphology, POS tag, semantic concept, and word sense information of a text when generating the artificial errors. Rei et al. (2017) further extended it to all types of errors. Another direction of emulating human errors is by extracting the correction patterns from GEC parallel corpora and applying the inverse of those corrections on grammatically correct sentences, as done by Yuan and Felice (2013) using the corrections from the NUCLE corpus and by Choe et al. (2019) using the corrections from the W&I training data. The correction patterns are extracted both in lexical form (an → the) and POS (NN → NNS).
5.1.2 Back-translation
Emulating human errors can be made in a more automated and dynamic way via a noisy channel model. The noisy channel model is trained with the inverse of a GEC parallel corpus, treating the learner’s sentence as the target and the reference sentence as the source. This technique is commonly called back-translation. The technique was originally proposed for generating additional data in machine translation (Sennrich, Haddow, and Birch 2016), but it is also directly applicable to GEC. Rei et al. (2017) were the first to apply back-translation to grammatical error detection (GED) and Xie et al. (2018) were the first to apply it to GEC. Yuan et al. (2019) add a form of quality control to Rei et al. (2017) based on language model probabilities in an effort to make sure that the generated synthetic sentences are less probable (and hence hopefully less grammatical) than the original input sentences. Between the rule-based and back-translation strategy, Kiyono et al. (2019) report that the back-translation strategy has better empirical performance. They also compare back-translation with a noisy beam-search strategy (Xie et al. 2018) and back-translation with sampling strategy (Edunov et al. 2018), and report that both achieve competitive performance. Koyama et al. (2021) furthermore compare the effect of using different architectures (e.g., CNN, LSTM, Transformer) for back-translation, and find that interpolating multiple generation systems tends to produce better synthetic data to train a GEC system on. Another variant of back-translation was proposed by Stahlberg and Kumar (2021) to generate more complex edits. They found that generating a sequence of edits using Seq2Edit (Stahlberg and Kumar 2020) works better than generating the corrupted sentences directly. They also reported that back-translation with sampling worked better than beam search in their experiments.
5.1.3 Round-trip Translation
A less popular alternative to back-translation is round-trip translation, which generates synthetic sentence pairs via a bridge language (e.g., English-Chinese-English). The assumption is that the MT system will make translation errors and so the output via the bridge language will be noisy in relation to the input. This strategy was used by Madnani, Tetreault, and Chodorow (2012) and Lichtarge et al. (2019), who furthermore both explored the effect of using different bridge languages. Zhou et al. (2020) explore a similar technique, except use a bridge language as the input to both a low-quality and high-quality translation system (namely, SMT vs. NMT), and treat the output from the former as an ungrammatical noisy sentence and the output from the latter as the reference.
5.2 Augmenting Official Datasets
Besides generating synthetic data to address the data sparsity problem in GEC, other studies focus on augmenting official datasets, via noise reduction or model enhancement.
Noise reduction aims to reduce the impact of wrong corrections in the official GEC datasets. One direction focuses on correcting noisy sentences. Mita et al. (2020) and Rothe et al. (2021) achieve this by incorporating a well-trained GEC model to reduce wrong corrections. The other direction attempts to down-weight noisy sentences. Lichtarge, Alberti, and Kumar (2020) introduce an offline re-weighting method to score each training sentence based on delta-log perplexity, Δppl, which measures the model’s log perplexity difference between checkpoints for a single sentence. Sentences with lower Δppl are preferred and assigned a higher weight during training.
Model enhancement augments official datasets to address the model’s weakness. Parnow, Li, and Zhao (2021) aim to enhance performance by reducing the error density mismatch between training and inference. They use a GAN (Goodfellow et al. 2014) to produce an ungrammatical sentence that could better represent the error density at inference time. Lai et al. (2022) also address the mismatch between training and inference, but specific to multi-round inference. They propose additional training stages that make the model consider edit type interdependence when predicting the corrections. Cao, Yang, and Ng (2021) aim to enhance model performance in low-error density domains. The augmented sentences are generated by beam search to capture wrong corrections that the model tends to make. Supervised contrastive learning (Chen et al. 2020b) is then applied to enhance model performance. Cao, Yang, and Ng (2023) use augmented sentences generated during beam search to address the exposure bias problem in seq2seq GEC models. A dynamic data reweighting method through reinforcement learning is used to select an optimal sampling strategy for different beam search candidates.
6 Evaluation
A core component of any NLP system is the ability to measure model performance. This section hence first introduces the most commonly used evaluation metrics in GEC, namely, the MaxMatch (M2) scorer (Dahlmeier and Ng 2012b), ERRANT (Bryant, Felice, and Briscoe 2017; Felice, Bryant, and Briscoe 2016), and GLEU (Napoles et al. 2015, 2016), as well as other reference-based and reference-less metrics that have been proposed. It next discusses the problem of metric reliability, particularly in relation to correlation with human judgments, and explains why it is difficult to draw any robust conclusions. The section concludes with a discussion of best practices in GEC evaluation, including defining standard experimental settings and highlighting their limitations. To date, almost all evaluation in GEC has been carried out at the sentence level.
6.1 MaxMatch
One of the most prevalent evaluation methods used in current GEC research is the MaxMatch (M2) scorer9 (Dahlmeier and Ng 2012b), which calculates an Fβ-score (van Rijsbergen 1979). Specifically, the M2 scorer is a reference-based metric that compares system hypothesis edits against human-annotated reference edits and counts a True Positive (TP) if a hypothesis edit matches a reference edit, a False Positive (FP) if a hypothesis edit does not match any reference edit, and a False Negative (FN) if a reference edit does not match any hypothesis edit. An example of each case is shown below.
The total number of TPs, FPs, and FNs for a dataset can then be used to calculate Precision (P) Equation (3) and Recall (R) Equation (4), which respectively denote the proportion of hypothesis edits that were correct and the proportion of reference edits that were found in the hypothesis edits, which in turn can be used to calculate the Fβ-score Equation (5). In current GEC research, it is common practice to use β = 0.5, first introduced in Ng et al. (2014), which weights precision twice as much as recall, because it is generally considered more important for a GEC system to be precise than to necessarily correct all errors.
One issue of using edit overlap to measure performance is that there is often more than one way to define an edit. For example, the edit [has eating→was eaten] can also be realized as [has→was] and [eating→eaten]. If the hypothesis combines them, but the reference does not, the edit will not be counted as a TP even though it produces the same valid correction. As a result, system performance is not measured correctly.
The innovation of the M2 scorer is that it uses a Levenshtein alignment (Levenshtein 1966) between the original text and a system hypothesis to dynamically explore the different ways of combining edits such that the hypothesis edits maximally match the reference edits. As such, it overcomes a limitation of the previous scorer used in the HOO shared tasks which could return erroneous scores. Whenever there is more than one set of reference edits for a test sentence, the M2 scorer tries each set in turn and chooses the one that leads to the best performance for that test sentence.
6.2 ERRANT
The ERRANT scorer10 (Bryant, Felice, and Briscoe 2017) is similar to the M2 scorer, in that it is a reference-based metric that measures performance in terms of an edit-based F-score, but differs primarily in that it is also able to calculate error types scores. Specifically, unlike the M2 scorer, it uses a linguistically enhanced Damerau-Levenshtein alignment algorithm to extract edits from the hypothesis text (Felice, Bryant, and Briscoe 2016), and then classifies them according to a rule-based error type framework. This facilitates the calculation of F-scores for each error type rather than just overall, which can be invaluable for a detailed system analysis. For example, System A might outperform System B overall, but System B might outperform System A on certain error types, and this information can be used to improve System A.
ERRANT was the first scorer to be able to evaluate GEC systems in terms of error types and is moreover able to do so at three different levels of granularity:
Edit Operation (3 labels): Missing, Replacement, Unnecessary
Main Type (25 labels): e.g., Noun, Spelling, Verb Tense
Full Type (55 labels): e.g., Missing Noun, Replacement Noun, Unnecessary Noun
It is also able to carry out this analysis in terms of both error detection and correction. ERRANT currently only supports English, but other researchers have independently extended it for German (Boyd 2018), Greek (Korre, Chatzipanagiotou, and Pavlopoulos 2021), Arabic (Belkebir and Habash 2021), and Czech (Náplava et al. 2022).
6.3 GLEU
Like M2 and ERRANT, GLEU11 (Napoles et al. 2015, 2016) is also a reference-based metric except it does not require explicit edit annotations but rather only corrected reference sentences. It was inspired by the BLEU score (Papineni et al. 2002) commonly used in machine translation and was motivated by the fact that human-annotated edit spans are somewhat arbitrary and time-consuming to collect. The main intuition behind GLEU is that it rewards hypothesis n-grams that overlap with the reference but not the original text, and penalizes hypothesis n-grams that overlap with the original text but not the reference. It is important to be aware that GLEU is often attributed to Napoles et al. (2015), but actually implemented according to Napoles et al. (2016), which is a revised formulation. The revised formulation is calculated as follows.
Like the BLEU score, GLEU also has a Brevity Penalty (BP) to penalize hypotheses that are shorter than the references Equation (7), where lh denotes the total number of tokens in the hypothesis corpus and lr denotes the total number of tokens in the sampled reference corpus. It is important to note that when there is more than one reference sentence, GLEU iteratively selects one at random and averages the score over 500 iterations. GLEU is finally calculated as in Equation (8).
6.4 Other Metrics
In addition to M2, ERRANT, and GLEU, other metrics have also been proposed in GEC. Some of these are reference-based, that is, they require human-annotated target sentences, while others are reference-less, that is, they do not require human-annotated target sentences. This section briefly introduces metrics of both types.
6.4.1 Reference-based Metrics
I-measure
The I-measure (Felice and Briscoe 2015) was designed to overcome certain shortcomings of the M2 scorer—for example, the M2 scorer is unable to differentiate between a bad system (TP=0, FP >0) and a do-nothing system (TP=0, FP=0), which both result in F=0, and instead measure system performance in terms of relative textual Improvement. The I-measure is calculated by carrying out a 3-way alignment between the original, hypothesis, and reference texts and classifying each token according to an extended version of the Writer-Annotator-System (WAS) evaluation scheme (Chodorow et al. 2012). This ultimately enables the calculation of accuracy, which Felice and Briscoe (2015) modify to weight TPs and FPs differently to more intuitively reward or punish a system. Having calculated a weighted accuracy score for a system, a baseline weighted accuracy score is computed in the same manner using a copy of the original text as the hypothesis. The difference between these scores is then normalized to fall between −1 and 1, where I < 0 indicates text degradation and I > 0 indicates text improvement.
GMEG
The GMEG metric (Napoles, Nădejde, and Tetreault 2019) is an ensemble metric that was designed to correlate with human judgments on three different datasets. It was motivated by the observation that different metrics correlate very differently with human judgments in different domains, and so a better metric would be more consistent. As an ensemble metric, GMEG depends on features (e.g., precision and recall) from several other metrics, including M2, ERRANT, GLEU, and the I-measure (73 features in total). The authors then use these features to train a ridge regression model that was optimized to predict the human scores for different systems.
GoToScorer
The GoToScorer (Gotou et al. 2020) was motivated by the observation that some errors are more difficult to correct than others yet all metrics treat them equally. The GoToScorer hence models error difficulty by weighting edits according to how many different systems were able to correct them; for example, edits that were successfully corrected by all systems would yield a smaller reward than those successfully corrected by fewer systems. Although this methodology confirmed the intuition that some error types were easier to correct than others, for example, spelling errors (easy) vs. synonym errors (hard); one disadvantage of this approach is that the difficulty weights depend entirely on the type and number of systems involved. Consequently, results do not generalize well and error difficulty (or gravity) remains an unsolved problem.
SERCL/SERRANT
SErCl (Choshen et al. 2020) is not a metric per se, but rather a method of automatically classifying grammatical errors by their syntactic properties using the Universal Dependencies formalism (Nivre et al. 2020). It is hence similar to ERRANT except it can more easily support other languages. The main disadvantage of SErCl is that it is not always meaningful to classify errors entirely based on their syntactic properties (e.g., spelling and orthography errors), and some error types are not very informative (e.g., “VERB→ADJ”). SERRANT (Choshen et al. 2021) is hence a compromise that attempts to combine the advantages of both SErCl and ERRANT.
PT-M2
The pretraining-based MaxMatch (PT-M2) metric (Gong et al. 2022) is a hybrid metric that combines traditional edit-based metrics, such as M2, with recent pretraining-based metrics, such as BERTScore (Zhang et al. 2020). The main advantage of pretraining-based metrics over edit-based metrics is that they are more capable of measuring the semantic similarity between pairs of sentences, rather than just comparing edits. Since Gong et al. (2022) found that off-the-shelf pretraining metrics correlated poorly with human judgments on GEC at the sentence level, they instead proposed measuring performance at the edit level. This approach ultimately produced the highest correlation with human judgments on the CoNLL-2014 test set to date, but should be considered with caution, as Hanna and Bojar (2021) also highlight some of the limitations of pretraining metrics and cite sources that claim correlation with human judgments may not be the best way to evaluate a metric (see Section 6.5).
6.4.2 Reference-less Metrics
GBMs
The first work to explore the idea of a reference-less metric for GEC (Napoles, Sakaguchi, and Tetreault 2016) was inspired by similar work on quality estimation in machine translation (e.g., Specia et al. (2020)). Specifically, the authors proposed three Grammaticality-Based Metrics (GBMs) that either use a benchmark GEC system to count the errors in the output produced by other GEC systems or else predict a grammaticality score using a pretrained ridge regression model (Heilman et al. 2014). The main limitation of these metrics is that they (i) require an existing GEC system to evaluate other GEC systems and (ii) are insensitive to changes in meaning. The authors thus proposed interpolating reference-less metrics with other reference-based metrics.
GFM
Asano, Mizumoto, and Inui (2017) extended the work on GBMs by introducing three reference-less metrics for Grammaticality, Fluency, and Meaning preservation (GFM). Specifically, the Grammaticality metric combines Napoles, Sakaguchi, and Tetreault’s (2016) GBMs into a single model, the Fluency metric computes a score using a language model, and the Meaning preservation metric computes a score using the METEOR metric from machine translation (Denkowski and Lavie 2014). A weighted linear sum of the three scores is then used as the final score. The main weaknesses of the GFM metric are that the Grammaticality and Fluency metrics suffer from the same limitations as GBMs, and the Meaning preservation metric only models shallow text similarity in terms of overlapping content words.
USIM
The USim metric (Choshen and Abend 2018c) was motivated by the fact that no other metric takes deep semantic similarity into account and it is possible that a GEC system might change the intended meaning of the original text—for example, by inserting/deleting “not” or replacing a content word with an incorrect synonym. It is calculated by first automatically annotating the original and hypothesis texts as semantic graphs using the UCCA semantic scheme (Abend and Rappoport 2013) and then measuring the overlap between the graphs (in terms of matching edges) as an F-score. USim was thus designed to operate as a complementary metric to other metrics.
SOME
Sub-metrics Optimized for Manual Evaluation (SOME) (Yoshimura et al. 2020) is an extension of GFM that was designed to optimize each Grammaticality, Fluency, and Meaning preservation metric to more closely correlate with human judgments. The authors achieved this by annotating the system output of five recent systems on a 5-point scale for each metric and then fine-tuning BERT (Devlin et al. 2019) to predict these human scores. This differs from GFM in that GFM was fine-tuned to predict the human ranking of different systems rather than explicit human scores. While the authors found SOME correlates more strongly with human judgments than GFM, both metrics nevertheless suffer from the same limitations.
Scribendi Score
The Scribendi Score (Islam and Magnani 2021) was designed to be simpler than other reference-less metrics in that it requires neither an existing GEC system nor fine-tuning. Instead, it calculates an absolute score (1=positive, −1=negative, 0=no change) from a combination of language model perplexity (GPT2: Radford et al. 2019) and sorted token/Levenshtein distance ratios, which respectively ensure that (i) the corrected sentence is more probable than the original and (ii) both sentences are not significantly different from each other. While it is intuitive that these scores correlate with the grammaticality of a sentence, they are not, however, a robust way of evaluating a GEC system. For example, the sentence “I saw the cat” is more probable than “I saw a cat” in GPT2 (160.8 vs 156.4), and both sentences are moreover very similar, yet we would not want to always reward this as a valid correction since both sentences are grammatical. We observe the same effect in “I ate the cake.” (130.2) vs. “I ate the pie.” (230.7) and so conclude that the Scribendi Score is highly likely to erroneously reward false positives.
IMPARA
The Impact-based Metric for GEC using Parallel data (IMPARA) (Maeda, Kaneko, and Okazaki 2022) is a hybrid reference-based/reference-less metric that requires parallel data to train an edit-based quality estimation and semantic similarity model, but can be used as a reference-less metric after training. It is sensitive to the corpus it is trained on (i.e., it does not generalize well to unseen domains) but shows comparable or better performance to SOME in terms of correlation with human judgments. Its main advantage is that it only requires parallel data for training (i.e., not system output or human judgments), but its main disadvantage is that IMPARA scores are not currently interpretable by humans.
6.5 Metric Reliability
Given the number of metrics that have been proposed, it is natural to wonder which metric is best. This is not straightforward to answer, however, as all metrics have different strengths and weaknesses. There has nevertheless been a great deal of work based on the assumption that the “best” metric is the one that correlates most closely with ground-truth human judgments.
With this in mind, the first work to compare metric performance with human judgments was by Napoles et al. (2015) and Grundkiewicz, Junczys-Dowmunt, and Gillian (2015), who independently collected human ratings for the 13 system outputs from the CoNLL-2014 shared task (including the unchanged original text) using the Appraise evaluation framework (Federmann 2010) commonly used in MT. This framework essentially asks humans to rank randomly chosen samples of 5 system outputs (ties are permitted) in order to build up a collection of pairwise judgments that can be used to extrapolate an overall system ranking. A metric can then be judged in terms of how well it correlates with this extrapolated ranking. The judgments collected by Grundkiewicz, Junczys-Dowmunt, and Gillian (2015) in particular proved especially influential (their dataset was much larger than Napoles et al. [2015]) and were variously used to justify GLEU as a better metric than M2 (Napoles et al. 2015; Napoles, Sakaguchi, and Tetreault 2016; Sakaguchi et al. 2016) and motivate almost all reference-less metrics to date (except USim).
Unfortunately however, this methodology was later found to be problematic and many of the conclusions drawn using these datasets were thrown into doubt. Notable observations included:
The correlation coefficients reported by Napoles et al. (2015) and Grundkiewicz, Junczys-Dowmunt, and Gillian (2015) were very different even though they essentially carried out the same experiment (albeit on different samples) (Choshen and Abend 2018a).
This method of human evaluation was abandoned in machine translation due to unreliability (Choshen and Abend 2018a; Graham, Baldwin, and Mathur 2015).
Chollampatt and Ng (2018b) found no evidence of GLEU being a better metric than M2 for ranking systems.
Choshen and Abend (2018a) surmise that one of the reasons these metric correlation experiments proved unreliable is that rating sentences for grammaticality is a highly subjective task which often shows very low inter-annotator agreement (IAA); for example, it is difficult to determine whether a sentence containing one major error should be considered “more grammatical” than a sentence containing two minor errors.
Napoles, Nădejde, and Tetreault (2019) nevertheless carried out a follow-up study that not only used a continuous scale to judge sentences (rather than rank them) (Sakaguchi and Van Durme 2018), but also collected judgments on all pairs of sentences to overcome sampling bias. They furthermore reported results on different datasets from different domains, rather than just CoNLL-2014, in an effort to determine the most generalizable metric. Their results, partially recreated in Table 7, hence found that dataset does indeed have an effect on metric performance, most likely because different error type distributions are judged inconsistently by humans. In fact, although Napoles, Nădejde, and Tetreault (2019) reported very high IAA at the corpus level (0.9–0.99 Pearson/Spearman), IAA at the sentence level was still low to average (0.3–0.6 Pearson/Spearman).
Metric . | FCE . | Wiki . | Yahoo . | |||
---|---|---|---|---|---|---|
r . | ρ . | r . | ρ . | r . | ρ . | |
ERRANT F0.5 | 0.919 | 0.887 | 0.401 | 0.555 | 0.532 | 0.601 |
GLEU | 0.838 | 0.813 | 0.426 | 0.538 | 0.740 | 0.775 |
I-measure | 0.819 | 0.839 | 0.854 | 0.875 | 0.915 | 0.900 |
M2 F0.5 | 0.860 | 0.849 | 0.346 | 0.552 | 0.580 | 0.699 |
Metric . | FCE . | Wiki . | Yahoo . | |||
---|---|---|---|---|---|---|
r . | ρ . | r . | ρ . | r . | ρ . | |
ERRANT F0.5 | 0.919 | 0.887 | 0.401 | 0.555 | 0.532 | 0.601 |
GLEU | 0.838 | 0.813 | 0.426 | 0.538 | 0.740 | 0.775 |
I-measure | 0.819 | 0.839 | 0.854 | 0.875 | 0.915 | 0.900 |
M2 F0.5 | 0.860 | 0.849 | 0.346 | 0.552 | 0.580 | 0.699 |
System . | Synthetic Sents . | Corpora . | Pre-trained Model . | Architecture . | Techniques . | CoNLL14 M2 . | BEA19 ERRANT . |
---|---|---|---|---|---|---|---|
Qorib, Na, and Ng (2022) | – | W (dev) | Various1 | T5-large, RoBERTa-base, XLNet-base, Transformer-big | SC | 69.5 | 79.9 |
Lai et al. (2022) | 9m | N+F+L+W | RoBERTa, XLNet | RoBERTa-base, XLNet-base | ENS+PRT+MTD | 67.0 | 77.9 |
Sorokin (2022) | 9m | cL+N+F+W | RoBERTa | RoBERTa-large | RE+MTD | 64.0 | 77.1 |
Tarnavskyi, Chernodub, and Omelianchuk (2022) | – | N+F+L+W | RoBERTa, XLNet, DeBERTa | RoBERTa-large, XLNet-large, DeBERTa-large | VT+PRT+MTD | 65.3 | 76.1 |
Rothe et al. (2021) | – | cL | T5-xxl | T5-xxl | – | 68.9 | 75.9 |
Sun and Wang (2022) | 300m | N+F+L+W | BART | BART (12+2) | PRT | – | 75.0 |
Stahlberg and Kumar (2021) | 546m | F+L+W | – | Transformer-big | ENS | 68.3 | 74.9 |
Cao, Yang, and Ng (2023) | 200m | cL+N+F+W | – | Transformer-big | ENS | 68.5 | 74.8 |
Omelianchuk et al. (2020) | 9m | N+F+L+W | BERT, RoBERTa, XLNet | BERT-base, RoBERTa-base, XLNet-base | ENS+PRT+MTD | 66.5 | 73.7 |
Lichtarge, Alberti, and Kumar (2020) | 340m | F+L+W | – | Transformer-big | ENS | 66.8 | 73.0 |
Zhang et al. (2022c) | – | cL+N+F+W | BART | BART-large | – | 67.6 | 72.9 |
Sun et al. (2021) | 300m | N+F+L+W | BART | BART (12+2) | – | 66.4 | 72.9 |
Yasunaga, Leskovec, and Liang (2021) | 9m | N+F+L+W | XLNet | XLNet-base | PRT+MTD | 65.8 | 72.9 |
Parnow, Li, and Zhao (2021) | 9m | N+F+L+W | XLNet | XLNet-base | PRT+MTD | 65.7 | 72.8 |
Yuan et al. (2021) | – | N+F+L+W +CLC | ELECTRA | Multi-encoder, Transformer-base | RE | 63.5 | 70.6 |
Stahlberg and Kumar (2020) | 346m | F+L+W | – | Seq2Edits (modified Transformer-big) | ENS+RE | 62.7 | 70.5 |
Kaneko et al. (2020) | 70m | N+F+L+W | – | Transformer-big | ENS+RE | 65.2 | 69.8 |
Mita et al. (2020) | 70m | N+F+L+W | – | Transformer-big | ENS+RE | 63.1 | 67.8 |
Chen et al. (2020a) | 260m | N+F+L+W | RoBERTa | Transformer-big | – | 61.0 | 66.9 |
Katsumata and Komachi (2020) | – | N+F+L+W | BART | BART-large | ENS | 63.0 | 66.1 |
System . | Synthetic Sents . | Corpora . | Pre-trained Model . | Architecture . | Techniques . | CoNLL14 M2 . | BEA19 ERRANT . |
---|---|---|---|---|---|---|---|
Qorib, Na, and Ng (2022) | – | W (dev) | Various1 | T5-large, RoBERTa-base, XLNet-base, Transformer-big | SC | 69.5 | 79.9 |
Lai et al. (2022) | 9m | N+F+L+W | RoBERTa, XLNet | RoBERTa-base, XLNet-base | ENS+PRT+MTD | 67.0 | 77.9 |
Sorokin (2022) | 9m | cL+N+F+W | RoBERTa | RoBERTa-large | RE+MTD | 64.0 | 77.1 |
Tarnavskyi, Chernodub, and Omelianchuk (2022) | – | N+F+L+W | RoBERTa, XLNet, DeBERTa | RoBERTa-large, XLNet-large, DeBERTa-large | VT+PRT+MTD | 65.3 | 76.1 |
Rothe et al. (2021) | – | cL | T5-xxl | T5-xxl | – | 68.9 | 75.9 |
Sun and Wang (2022) | 300m | N+F+L+W | BART | BART (12+2) | PRT | – | 75.0 |
Stahlberg and Kumar (2021) | 546m | F+L+W | – | Transformer-big | ENS | 68.3 | 74.9 |
Cao, Yang, and Ng (2023) | 200m | cL+N+F+W | – | Transformer-big | ENS | 68.5 | 74.8 |
Omelianchuk et al. (2020) | 9m | N+F+L+W | BERT, RoBERTa, XLNet | BERT-base, RoBERTa-base, XLNet-base | ENS+PRT+MTD | 66.5 | 73.7 |
Lichtarge, Alberti, and Kumar (2020) | 340m | F+L+W | – | Transformer-big | ENS | 66.8 | 73.0 |
Zhang et al. (2022c) | – | cL+N+F+W | BART | BART-large | – | 67.6 | 72.9 |
Sun et al. (2021) | 300m | N+F+L+W | BART | BART (12+2) | – | 66.4 | 72.9 |
Yasunaga, Leskovec, and Liang (2021) | 9m | N+F+L+W | XLNet | XLNet-base | PRT+MTD | 65.8 | 72.9 |
Parnow, Li, and Zhao (2021) | 9m | N+F+L+W | XLNet | XLNet-base | PRT+MTD | 65.7 | 72.8 |
Yuan et al. (2021) | – | N+F+L+W +CLC | ELECTRA | Multi-encoder, Transformer-base | RE | 63.5 | 70.6 |
Stahlberg and Kumar (2020) | 346m | F+L+W | – | Seq2Edits (modified Transformer-big) | ENS+RE | 62.7 | 70.5 |
Kaneko et al. (2020) | 70m | N+F+L+W | – | Transformer-big | ENS+RE | 65.2 | 69.8 |
Mita et al. (2020) | 70m | N+F+L+W | – | Transformer-big | ENS+RE | 63.1 | 67.8 |
Chen et al. (2020a) | 260m | N+F+L+W | RoBERTa | Transformer-big | – | 61.0 | 66.9 |
Katsumata and Komachi (2020) | – | N+F+L+W | BART | BART-large | ENS | 63.0 | 66.1 |
Ultimately, although ground-truth human judgments may be an intuitive way to benchmark metric performance, they are also highly subjective and should be considered with caution. Nothing demonstrates this sentiment better than the conclusions drawn about the I-measure, which was initially found to have a weak negative correlation with human judgments (Napoles et al. 2015; Grundkiewicz, Junczys-Dowmunt, and Gillian 2015; Sakaguchi et al. 2016), subsequently found to have good correlation at the sentence level (Napoles, Sakaguchi, and Tetreault 2016) and finally considered the best singular metric across multiple domains (Napoles, Nădejde, and Tetreault 2019). Reliable methods of evaluating automatic metrics thus remain an active area of research.
6.6 Evaluation Best Practices
A common pitfall for new researchers in GEC concerns which metric to use with which dataset; for example, the M2 scorer with JFLEG, or the I-measure with BEA-2019. While there is no empirical reason to prefer one metric over another, in practice, the most popular GEC test sets are almost always evaluated with a single, specific metric:
CoNLL-2014 is evaluated with the M2 scorer
JFLEG is evaluated with GLEU
BEA-2019 is evaluated with ERRANT
This choice of experimental setup is largely motivated by historical reasons (e.g., GLEU and ERRANT did not exist during CoNLL-2014), but has nevertheless persisted in order to ensure fair comparison with subsequent work. One particularly common mistake is to evaluate CoNLL-2014 with ERRANT or BEA-2019 with the M2 scorer because both metrics return an F-score, yet M2 F0.5 is not equivalent to ERRANT F0.5 (Bryant, Felice, and Briscoe 2017). It is thus imperative that a dataset be evaluated with its associated metric in order to facilitate a meaningful comparison.
6.6.1 Caveats
Despite this convention, it is also important to highlight the limitations of this set-up, as it is not always desirable to optimize different systems for different test sets using different metrics. Instead, we should remember that the ultimate goal of GEC is to build systems that generalize well, and so we should not place too much emphasis on specific experimental configurations. It is with this in mind that Mita et al. (2019) recommend evaluating on multiple corpora in order to reveal any systematic biases toward particular domains or user demographics, whereas Napoles, Nădejde, and Tetreault (2019) recommend evaluating using their trained metric that was designed to be less sensitive to dataset biases. These approaches thus add greater confidence that a model is versatile and does not overfit to a specific type of input.
6.6.2 Recommendations
In light of the confusion surrounding different experimental set-ups, we make the following recommendations for ensuring a meaningful comparison in English GEC evaluation. This is not an exhaustive list, but we attempt to summarize the current standard experimental setups that facilitate the most informative comparison with previous work.
Evaluate on the BEA-2019 test set using ERRANT.
The BEA-2019 test set is one of the most diverse test sets that contains texts from the full range of learner backgrounds and ability levels on a wide range of topics. This makes it a good benchmark for system robustness and generalizability. It is also the official test set of the most recent shared task.
Evaluate on the CoNLL-2014 test set using the M2 scorer.
The CoNLL-2014 test set is one of the most well-known test sets that has been widely used to benchmark progress in the field; it is thus an important indicator of system performance. It is also the official test set of the second most recent shared task.
Evaluate on the GMEG and/or CWEB test sets using ERRANT.
One of the main limitations of the BEA-2019 and CoNLL-2014 test sets is that they mainly represent non-native language learners. It can therefore be beneficial to evaluate on native speaker errors in GMEG and CWEB to obtain a more complete picture of system generalizability.
Evaluate on JFLEG using GLEU.
The main reason to evaluate on JFLEG is to test systems on more complex fluency edits rather than minimal edits. Not all edits in JFLEG are fluency edits, however, and the test set is very small, so researchers have seldom reported GLEU on JFLEG in recent years (Gong et al. 2022).
Ultimately, robust evaluation is rarely as straightforward as directly comparing one number against another, and it is important to consider, for example, whether a model has been trained/fine-tuned on in-domain data, optimized for a specific metric, or only evaluated on a specific target test set. Each of these factors impacts how a score should be interpreted, especially in relation to previous work, and there is a real danger of rewarding a highly optimized, specialized system, over a lower-scoring but more versatile system that may actually be more desirable.
7 System Comparison
In this section, we compare the most recent state-of-the-art systems from the past couple of years and comment on the innovations that led to them performing better than previous work. The full list of systems we compare is shown in Table 8. For a comparison of systems between 2014 and 2020, we refer the reader to Wang et al. (2021, Table 7).
7.1 System Description
We first note that many of the systems in Table 8 are extensions of 3 other systems: Omelianchuk et al. (2020), Sun et al. (2021), and Kiyono et al. (2019). Specifically, Omelianchuk et al. (2020) built a sequence tagging model (Section 3.4) using a pre-trained language model (e.g., BERT) and 9 million synthetic sentence pairs; Sun et al. (2021) used a rule-based approach to generate 300 million synthetic sentence pairs (Section 5.1.1) to train a modified BART model which contains 12 encoders and 2 decoders; and Kiyono et al. (2019) used 70 million synthetic sentence pairs generated through back-translation (Section 5.1.2) to train a Transformer-big model.
Many of these systems specifically build on top of Omelianchuk et al. (2020), including systems from Sorokin (2022); Lai et al. (2022); Parnow, Li, and Zhao (2021), and Yasunaga, Leskovec, and Liang (2021). Specifically, Sorokin (2022) and Tarnavskyi, Chernodub, and Omelianchuk (2022) upgraded the pre-trained language model from base to large (e.g., RoBERTa-base vs. RoBERTa-large) and used an additional mechanism to select the final edits by means of edit-scoring or majority voting (VT), respectively. Parnow, Li, and Zhao (2021) and Lai et al. (2022) address the problem of edit interdependence, that is, when the correction of one error depends on another, by means of GANs and multi-turn training, respectively. Yasunaga, Leskovec, and Liang (2021) applied the BIFI framework (Yasunaga and Liang 2021) to Omelianchuk et al. (2020) (Section 3.5) to gradually train a system that iteratively generates and learns from more realistic synthetic data. In contrast, Sun and Wang (2022) add a single hyperparameter to Sun et al. (2021) to control the trade-off between precision and recall (PRT), Kaneko et al. (2020) incorporate BERT into Kiyono et al. (2019) (Section 3.3.3), and Mita et al. (2020) applied a self-refinement data augmentation strategy to Kiyono et al. (2019) (Section 5.2).
Other systems include Katsumata and Komachi (2020) and Rothe et al. (2021), who respectively explored the effectiveness of using pre-trained BART (Lewis et al. 2020) and T5 (Raffel et al. 2020) as the base model for GEC; Zhang et al. (2022c) subsequently extended Katsumata and Komachi (2020) by adding syntactic information (Section 3.3.3). Chen et al. (2020a) and Yuan et al. (2021) meanwhile both combined error detection with error correction by respectively constraining the output of a GEC system based on a separate GED system and jointly training GED as an auxiliary task (Section 4.3). Stahlberg and Kumar (2020) proposed a seq2edit approach that explicitly predicts a sequence of tuple edit operations to apply to an input sentence (Section 3.4), and Stahlberg and Kumar (2021) developed a method to generate a specific type of error in a sentence (given a clean sentence and an error tag), which could be used to generate synthetic datasets that more closely match the error distribution in a real corpus (Section5.1.2). Finally, Lichtarge, Alberti, and Kumar (2020) used delta-log-perplexity to weight the contribution of each sentence in the training set toward overall model performance, downweighting those that added the most noise (Section 5.2), and Qorib, Na, and Ng (2022) used a binary classifier based on logistic regression to combine multiple GEC systems using only the output from each individual component system.
7.2 Analysis
Despite all these enhancements, we first observe that it is very difficult to draw conclusions about the efficacy of different techniques in Table 8, because different systems were trained using different amounts/types of data (both real and artificial) and developed using different pre-trained models and performance-boosting techniques. Consequently, the systems are rarely directly comparable and we can only infer the relative advantages of different approaches from the wider context. With this in mind, the general trend in the past couple of years has been to scale models up using (i) more artificial data, (ii) multiple pre-trained models/architectures, and (iii) multiple performance-boosting techniques.
In terms of artificial data, the trend is somewhat mixed, as, on the one hand, Stahlberg and Kumar (2021) introduced a system trained on more than half a billion synthetic sentences, but on the other hand, they were still outperformed by systems that used orders of magnitude less data (Lai et al. 2022; Tarnavskyi, Chernodub, and Omelianchuk 2022). This pattern has been consistent for several years now and reveals a delicate trade-off between artificial data quantity and quality. There is ultimately no clear relationship between data quantity and performance, and some systems still achieve competitive performance without artificial data (Rothe et al. 2021; Yuan et al. 2021; Katsumata and Komachi 2020).
The use of several pre-trained model architectures, however, tells a different story and it is generally the case that using multiple architectures improves performance: The top 3 latest state-of-the-art systems all use at least 2 different pre-trained models (Qorib, Na, and Ng 2022; Lai et al. 2022; Tarnavskyi, Chernodub, and Omelianchuk 2022). This suggests that different pre-training tasks capture different aspects of natural language that complement each other in different ways in GEC. In contrast, approaches that rely on a single pre-trained model typically perform slightly worse than those that combine architectures, although it is worth keeping in mind that there is also a trade-off between model complexity and run-time that is seldom reported (Omelianchuk et al. 2020; Sun et al. 2021).
Finally, adding more performance-boosting techniques also tends to result in better performance, and the systems that incorporate the most techniques typically score highest. Among these techniques, the use of model ensembling or system combination (Section 4.2) mitigates the instability of neural models and allows a final system to make use of the strengths of several other systems. However, this comes at a cost to model complexity and run-time.
8 Future Challenges
While much progress has been made in the past decade, several important challenges remain (Qorib and Ng 2022). This section highlights some of them and offers suggestions for future work.
Domain Generalization
Robustness is an important attribute of any NLP system. In the case of GEC, we not only want our systems to work well for language learners, but also native speakers in different domains such as business emails, literature, and instruction manuals. Some efforts have been made in this direction, such as the native web texts in CWEB (Flachs et al. 2020), scientific articles in AESW (Daudaravicius et al. 2016), and conversational dialog in ErAConD (Yuan et al. 2022), but more effort is needed to create new corpora that represent a wider variety of domains. This is important because previous research has shown that systems that perform well in one domain do not necessarily perform well in other domains (Napoles, Nădejde, and Tetreault 2019).
Personalized Systems
Related to domain generalization is the fact that system performance is also tied to the profiles of the users in the training data. For example, a system trained on L2 English data produced by advanced L1 Spanish learners is unlikely to perform as well on L2 English data produced by beginner L1 Japanese learners because of the mismatch in ability level and first language. It is thus important to develop corpora and tools that can adapt to different users (Chollampatt, Hoang, and Ng 2016), since different ability levels and L1s can significantly affect the distribution of errors that authors are likely to make (Nadejde and Tetreault 2019).
Feedback Comment Generation
GEC systems are currently trained to correct errors without explaining why a correction was needed. This is insufficient in an educational context, however, where it is desirable for a system to explain the cause of an error such that a user may learn from it and not make the same mistake again. Resources have begun to emerge to support this endeavor but much more work is needed to generate robust feedback comments to support explainable GEC (Nagata 2019; Nagata, Inui, and Ishikawa 2020; Hanawa, Nagata, and Inui 2021; Nagata et al. 2021).
Model Interpretability
Related to feedback generation, it is also important that model output is interpretable by humans. For example, although a system may make a prediction with high confidence, there is no guarantee that the prediction will be consistent with human intuition. Researchers have thus begun to build systems that estimate the quality of model output in an effort to provide more confidence that a given prediction is correct (Chollampatt and Ng 2018c; Liu et al. 2021). Similarly, Kaneko et al. (2022) propose an example-based approach, where a model additionally outputs similar corrections in different contexts in order to add credibility to the notion that the model truly understood the error.
Semantic Errors
One of the areas where state-of-the-art systems still underperform is semantic errors, which include complex phenomenon such as collocations, idioms, multi-word expressions, and fluency edits. Considerable work in GEC has focused on correcting function word errors, which typically have small confusion sets and constitute a majority of error types, but this does not mean we can neglect the correction of content word errors. Although there has been some work on correcting collocations (Kochmar and Briscoe 2014; Herbelot and Kochmar 2016) and multi-word expressions (Mizumoto, Mita, and Matsumoto 2015; Taslimipoor, Bryant, and Yuan 2022), semantic errors remain a notable area in which GEC systems could improve.
Contextual GEC
To date, most GEC systems operate at the sentence level, and so do not perform well on errors that require cross-sentence context or document-level understanding. Although work has already been done to incorporate multi-sentence context into GEC systems (Chollampatt, Wang, and Ng 2019; Yuan and Bryant 2021; Mita et al. 2022), almost all current datasets expect sentence tokenized input and so do not facilitate multi-sentence evaluation. Paragraph or document-level datasets, like in the Arabic QALB shared tasks (Mohit et al. 2014; Rozovskaya et al. 2015), should thus be developed to encourage contextual GEC in the future.
System Combination
Although much recent work focuses on NMT for GEC, this does not mean that other approaches have nothing to offer. Work on system combination has shown that systems built with different approaches have complementary strengths and weaknesses such that a combined system can achieve significantly improved performance (Susanto, Phandi, and Ng 2014; Han and Ng 2021; Lin and Ng 2021; Qorib, Na, and Ng 2022). Better understanding of these strengths and weaknesses, and when and how to combine approaches, are promising areas of research. One tool is ALLECS (Qorib, Moon, and Ng 2023), which is a web-interface tool to produce text corrections using GEC system combination methods.
Training Data Selection
Current state-of-the-art systems rely on pre-training on a massive amount of synthetic parallel corpora; however this is both computationally expensive and not environmentally friendly. It is also questionable whether so much training data is really necessary, as humans are not exposed to training data on such a massive scale, yet can still correct errors without issue. A more economical approach to effective training data selection is thus an important research question that will go a long way toward reducing training time and developing more efficient GEC systems (Lichtarge, Alberti, and Kumar 2020; Takahashi, Katsumata, and Komachi 2020; Mita and Yanaka 2021; Rothe et al. 2021).
Unsupervised Approaches
The dependency on parallel corpora (both real and synthetic) is a major limiting factor in GEC system development, in that it is both laborious and time-consuming to train human annotators to manually correct errors, and also surprisingly difficult to generate high-quality synthetic errors that reliably imitate human error patterns. It is furthermore noteworthy that humans can correct errors without access to a large corpus of erroneous examples and instead rely on their knowledge of grammatical language in order to detect and correct mistakes. It should thus be intuitive that a GEC system might be able to do the same by interpreting deviations from grammatical data as anomalies that need to be corrected. The success of such an unsupervised approach would significantly hasten the development of multilingual GEC systems and also eliminate the need to compile parallel corpora.
Multilingual GEC
Although most work on GEC has focused on English, work on other languages is also beginning to take off as new resources become available; e.g., in German (Boyd 2018), Russian (Rozovskaya and Roth 2019), and Czech (Náplava et al. 2022). While it is important to encourage research into GEC systems for specific languages, it is also important to remember that it is ultimately not scalable to build a separate system for every language. It is desirable to work toward a single multilingual system that can correct all languages simultaneously like in machine translation (Katsumata and Komachi 2020; Rothe et al. 2021).
Spoken GEC
Another aspect of GEC that has seldom been explored in the literature is that of spoken GEC. While progress has largely been hindered by a lack of available data, researchers have recently begun to build systems capable of detecting and correcting errors in learner speech (Knill et al. 2019; Caines et al. 2020; Kyriakopoulos, Knill, and Gales; Lu, Gales, and Wang 2020; Lu, Bannò, and Gales 2022). Compared with text-based GEC, additional challenges include recognizing non-native accented speech (possibly including non-standard pronunciation), disfluency detection, and utterance segmentation.
Improved Evaluation
Finally, robust evaluation of GEC system output is still an unsolved problem and current evaluation practices may actually hinder progress (Rozovskaya and Roth 2021). For example, almost all metrics to date require tokenized text, yet end-users require untokenized text, and so there is a disconnect between system capability and user expectation. Similarly, GEC systems are typically trained to output a single best correction for a sentence, yet end-users may prefer a short n-best list of possible corrections for each edit, as in most spellcheckers. Ultimately, alternative answers and untokenized text are not yet properly accounted for in GEC system evaluation, leaving room for new metrics to drive the field toward better practices.
9 Conclusion
In this survey, we set out to provide a comprehensive overview of the state of the art in the field of Grammatical Error Correction. Our main goal was to summarize the progress that has been made since Leacock et al. (2014) but also complement the work of Wang et al. (2021) with more in-depth and recent coverage on various topics.
With this in mind, we first explored the nature of the task and illustrated the inherent difficulties in defining an error according to the perceived communicative intent of the author. We next alluded to how these difficulties can manifest in human-annotated corpora, before introducing the most commonly used benchmark corpora for English, several less commonly used corpora for English, and new corpora for GEC systems in other languages, including Arabic, Chinese, Czech, German, and Russian. Research into GEC for non-English languages has begun to take off in the last couple of years and will no doubt continue to grow in the future.
We next characterized the evolution of approaches to GEC, from error-type specific classifiers to state-of-the-art NMT and edit-based sequence-labeling, and summarized some of the additional supplementary techniques that are commonly used to boost performance, such as re-ranking, multi-task learning, and iterative decoding. We also described different methods of artificial data generation and augmentation, which have become core components of recent GEC systems, but also drew attention to the benefits of low-resource GEC systems that may be less resource intensive and more easily extended to other languages.
Robust evaluation is still an unsolved problem in GEC, but we introduced the most commonly used metrics in the field, along with their strengths and weaknesses, and listed previous attempts at both reference-based and reference-less metrics that were designed to overcome various shortcomings. We furthermore highlighted the difficulty in correlating human judgments with metric performance in light of the highly subjective nature of the task.
Finally, we provided an analysis of very recent progress in the field, including making observations about which techniques/resources seemed to perform best (particularly in the context of model efficiency), before concluding with several possibilities for future work. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
Acknowledgments
We thank the anonymous reviewers for their helpful comments. This research is supported by both Cambridge University Press & Assessment and the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-014).