Data Weighted Training Strategies for Grammatical Error Correction

Abstract Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state- of-the-art results on common GEC test sets.


Introduction
Grammatical Error Correction (GEC), the task of automatically correcting errors in written text, can be framed as a translation task from 'bad grammar' to 'good grammar.' This framing has enabled GEC to borrow models and techniques from the vast literature in machine translation (MT). Neural approaches have dominated recent stateof-the-art advances in GEC, and have been shown to be more effective in direct comparison with non-neural methods (Chollampatt and Ng, 2018;. Nevertheless, GEC continues to pose a challenge for data-reliant neural models given that the publicly available training data is relatively limited, with the largest corpus numbering only 2M examples (Mizumoto et al., 2012). Therefore, much recent work in GEC has focused on diverse methods to address data sparsity by supplementing available annotated corpora with much larger pretraining data (Ge et al., 2018a;Kasewa et al., 2018;Lichtarge et al., 2019;Grundkiewicz et al., 2019;Zhao et al., 2019). A contrasting approach to addressing data sparsity in GEC has been explored in the Building Educational Application (BEA) 2019 Shared Task on Grammatical Error Correction (Bryant et al., 2019). The task introduced the Write and Improve training set, a new high-quality annotated corpus numbering only~34k examples (referred to in this work as BEA-19 train), and encouraged exploration of low-resource methods by organizing two tracks specifically for data-restricted competition. Despite the relatively small size, many approaches using the BEA-19 train data achieved better results on common GEC test sets than previous approaches that did not have access to this small but high-quality data (Bryant et al., 2019).
In the context of neural MT (NMT), models have been shown to be sensitive to noise in the training data (Khayrallah and Koehn, 2018). While much effort has been dedicated to methods which either filter or downweight noisy pretraining data in NMT (Junczys-Dowmunt, 2018), less attention has thus far been paid in GEC. To the best of our knowledge, previously explored techniques for filtering pretraining data in GEC are limited to hand-engineered heuristic cutoffs (Grundkiewicz and Junczys-Dowmunt, 2014) and n-gram language model filtering (Ge et al., 2018a).
Recent work in NMT (Wang et al., 2018) presents a training technique for scoring the 'noise' of training data by employing a much smaller, higher-quality 'trusted' dataset. The authors describe a curriculum-style training over data scored by this metric, and demonstrate significant improvements over a baseline. We refer to this score as delta-log-perplexity (∆ppl).

Contributions of this work
This work presents an empirical study of training strategies for GEC in multiple dimensions. Us-ing a standard training setup (without scoring), we explore arrangements of GEC corpora into pretraining and finetuning data, establishing a strong baseline. We then apply data scoring via ∆ppl to the GEC task, demonstrating the value of ∆ppl as a heuristic for example quality. By comparing multiple plausible methods for applying ∆ppl, we gain some insight into the interpretation and practical applicability of the metric. We train on the scored data via four simple methods that instantiate different intuitions about how to treat a heuristic score for data quality. We demonstrate performance gains for various strategies incorporating scoring into the training, and present state-of-theart results on the CoNLL-14 (Ng et al., 2014), JF-LEG (Napoles et al., 2017), and BEA-19 (Bryant et al., 2019) test sets.

Related Work
In recent GEC work, most approaches pretrain on some synthetic data and then finetune on the union of multiple annotated data sources, with some variation in which datasets are included for fine-tuning (Grundkiewicz et al., 2019;Lichtarge et al., 2019). In a thorough study of incorporating generated pseudo-data into GEC training, Kiyono et al. (2019) report that this typical pretrainfinetune setup scales with size of pretraining data better than a setup in which all data is trained on simultaneously. Choe et al. (2019) describe a 'sequential transfer learning' approach in which the pretrained model, finetuned on all available annotated data, is finetuned again separately for each test set. A thorough review of the GEC field is made by Wang et al. (2020).
Data selection in MT has been performed along two dimensions: domain-relevance and denoising. Multiple researchers (Moore and Lewis, 2010; Axelrod et al., 2011;van der Wees et al., 2017) have used the difference in cross-entropy between two language models as a criteria for the selection of in-domain sentences. In contrast, Wang et al. (2018) and Junczys-Dowmunt (2018) have used data selection for denoising. Recently,  demonstrate that a co-curriculum training for dynamic selection of data that is both clean and in-domain, can outperform independent selection along each of the two dimensions.  Wang et al. (2018) present a metric defined as the difference in log-probability of an individual training example before and after improving a pretrained model by finetuning on a small trusted dataset. Wang et al. use this metric to order the pretrain data, and train a new model via a curriculum-style strategy using this ordering. In their setup, this metric is interpreted as measuring 'noise', describing the change in log probability of an example between a noisy pretrained model and its 'denoised' finetuned counterpart. Since log perplexity for an example is the negative of the log-probability, we refer to this score as 'delta-logperplexity'(∆ppl) 1 .

Calculation
In the most general case, ∆ppl describes the change in a model's log perplexity for an individual example between two checkpoints in model training. If the first checkpoint (with parameterization θ − ) is sampled after model convergence on a base dataset D − , and the second checkpoint (θ + ), after further finetuning on a second target dataset D + , then the ∆ppl between those models for a given example (composed of input, output pair (i, o)) should suggest which of the datasets the example is more similar to, from the perspective of the successive models θ − and θ + . ∆ppl(i, o; θ − , θ + ) = log p(o|i; θ − )−log p(o|i; θ + ) (1) In the course of this work, we make use of the relative ordering of examples from the scored dataset D − ∆ when sorted by their ∆ppl scores, rather than the actual ∆ppl score values 2 . We refer to this quantity as 'delta-perplexity-rank': (2) '%ile_rank' refers to percentile rank. δppl has range [0,1], and is computed such that the example with the most negative ∆ppl will have the highest δppl score of 1. The median example will have a δppl of 0.5.
Algorithm 1: Score base data with ∆ppl, and calculate δppl for each sentence pair. The symbols x.i and x.o refer to the input and output sequences of the example.

Explanation
Any example drawn from D + should trivially be expected to have a negative ∆ppl because θ + has just been trained directly upon the exact example, whereas θ − has never seen the example before. The negative ∆ppl can be explained by suggesting θ + has begun to memorize the specific examples in D + .
Scoring examples drawn from D − reveals the value of the technique; both checkpoints have been trained on D − and no example in D − was present during further training on D + , so the ∆ppl reflects the general changes learned during the transition from θ − to θ + . Examples from D − which are similar to examples from D + can be expected to have relatively lower log perplexity for θ + , and thus lower ∆ppl. Examples from D − which are markedly different from those of D + should be expected to have higher ∆ppl scores.
While D − (base data) and D + (target data) refer to the pretraining and fine-tuning datasets, respectively, in our setup, we note that these two datasets could be selected according to alternative criteria. The only requirement is that these sets differ in terms of some observable qualitative aspect, for which ∆ppl becomes a heuristic. While in this work we use a target dataset to focus on example quality, it may also be feasible to employ a target dataset that differs from the base data chiefly in domain, and use ∆ppl to negotiate domain transfer.

Annealing strategies
When D + is selected to be 'higher quality' than D − , then the ∆ppl scores of examples drawn from D − provide a heuristic for example quality. Given a heuristic score for example quality, there are many plausible strategies to incorporate the score into a training schedule. We explore the following schemes: [a] Filter the pretraining data by discarding examples for which δ ppl < k, where k is a fixed cutoff parameter.
[b] Instead of discarding data, down-weight the loss on lowscoring examples during training proportionally to their rank: weight x = δppl x . A more sophisticated variation of filtering the data is employed by Wang et al. (2018): [c] define a curriculum by an exponentially decaying function over training, so that by the end of training, only the best-scoring examples remain in the training data.
where k(t) = 0.5 t H for training step t and constant H. To combine the benefits of down-weighting and the curriculum-style annealing, we also implement a mixed strategy [d]: where k(t) = 0.5 t H for training step t and constant H.

Model
We use the Transformer sequence-to-sequence model (Vaswani et al., 2017), using the Ten-sor2Tensor open-source implementation with the "transformer_clean_big_tpu" setting 3 . We use a 32k word piece dictionary (Schuster and Nakajima, 2012). For all training stages we use the Adafactor optimizer (Shazeer and Stern, 2018).

Data
We train on the public version of the Lang-8 corpus (Mizumoto et al., 2012), the FCE corpus (Yannakoudakis et al., 2011), and the Cambridge English Write & Improve training split described in the BEA-2019 shared task (BEA-19 train) (Bryant et al., 2019).
The Lang-8 corpus is scraped from the social language learning website 4 , and is composed of potentially erroneous sentences from Englishlanguage-learners with crowd-sourced corrections. The corpus includes many sentence pairs that are noisy or irrelevant to GEC for a variety of reasons. In contrast, FCE 5 and BEA-19 train 6 are much smaller corpora that have been carefully annotated by a small number of professional annotators. Due to their highly-curated origin, these datasets have a much higher proportion of highquality GEC-relevant sentence pairs than Lang-8.
For pretraining data, we follow Lichtarge et al. (2019) in using a large and noisy corpus of edits crawled from Wikipedia's publicly available revision histories (REV). We also use a similar-sized corpus of sentence pairs, where the target sentences are drawn from Wikipedia, and the source sentences are generated via round-trip-translation through a bridge language (RT) (Lichtarge et al., 2019). We generate four parallel datasets of equal size by round-trip-translating the same 'clean' sequences through four bridge languages 7 . Both pretraining corpora are further probabilistically corrupted via character-level insertions, deletions, transpositions, and replacements. We corrupt each character of REV, which already contains some 'natural' spelling errors, at a rate of 0.003 per character. For the RT data, which does not already have spelling errors, we use a rate of 0.005 per character.
Prior research on GEC has employed the NU-CLE corpus (Dahlmeier et al., 2013) for model training. Our pilot experiments showed that a model pre-trained on REV/RT yielded similar performance when fine-tuned on either Lang-8 or a combination of Lang-8 and NUCLE. Since both corpora contain corrections of sentences written by non-native speakers, and NUCLE, which has only a fourth as many sentences as Lang-8, did not give additional improvements on top of Lang-8, we decided to exclude the corpus in our experiments to simplify the presentation.

Non-Scored Training and Finetuning
When pretraining, we train the Transformer model for 1M steps. We set the learning rate to 0.01 for the first 10, 000 steps, after which we decrease it proportionally to the inverse square root of the number of steps. When finetuning, we set the learning rate to a constant 3 × 10 −5 . Regardless of the dataset being used, we run finetuning for 30 epochs.

Scored Training and Finetuning
When applying the scored training strategies to Lang-8, we discard the base model that was used in calculating the ∆ppl scores (which was trained on: Pretrain → Lang-8), and start a new finetuning run on the scored Lang-8, from a model initialized on the same pretraining data. When applying our scored training strategies to the much larger pretraining data, rather than start the model from random initialization and repeat 1M steps of training, we continue training from the 1M checkpoint of the base model and train on the scored data for an additional 100,000 steps (using the same pretraining settings).

Evaluation
In the course of our experiments, we evaluate on the development set of the BEA-2019 shared task (BEA-19 dev), which includes examples from both W&I and the LOCNESS corpus (Granger, 1998), using the ERRANT scorer (Bryant et al., 2017). In our analysis (Section 7), we report on BEA-19 test, with scores provided by the official Codalab of the BEA-2019 task 8 . We also report on the popular GEC evaluation corpora: CoNLL-2014(Ng et al., 2014 and JFLEG (Napoles et al., 2017;Heilman et al., 2014), for which we report F 0.5 with the M 2 scorer (Dahlmeier and Ng, 2012) and the GLEU + metric (Napoles et al., 2016) respectively. For BEA-19 dev and BEA-19 test, fol-lowing the conventions of the shared task, we postprocessed the model output with the spaCy tokenizer 9 .
For decoding, we use iterative decoding (Lichtarge et al., 2019)  The datasets presented in Table 1 can be sorted into three categories by their relative quality. REV and RT are noisiest, with most data not appearing relevant to GEC. FCE and BEA-19 train are cleanest, as they are professionally annotated. Lang-8 occupies a middle ground, as the data, which is largely relevant to GEC but scraped from a crowdsourced medium, does not rise to the standard of professional annotation. In light of this, we combine the single REV dataset with each of the four RT datasets to produce four large pretraining datasets, each containing half Wiki revisions and half round-trip translated data (PRE). All experiments are run for each of these merged datasets, and all reported figures are the average of those four models. We also merge the FCE and BEA-19 train into a single finetuning set, which we refer to as 'BEA-FCE' (BF).
We explore three training schemes: including Lang-8 with the higher-quality annotated data, including Lang-8 with the pretraining data, and a two-stage finetuning scheme, with Lang-8 as the intermediate step.    Table 2).

Training with scored examples
Given a set of training data for which each example has an associated heuristic 'quality' score, there are many plausible options for incorporating that score into a training schedule. For the best-performing scoring arrangement, [D] in Table 3, we repeat the scored training stage in order  , which seems catastrophically out-of-domain, has a better ∆ppl than [e], which simply changes the tense of the sentence. ppl + is much higher for examples that have significant information change. This explains why the REV data in the scatter-plot extends thinly along the ∆ppl=0 diagonal; REV contains many examples with information change, for which both source and target are grammatically correct. For these examples, absolute value of both ppl + and ppl − is large, but the change in ∆ppl is relatively small. This demonstrates a shortcoming of using only ∆ppl as a heuristic for example quality: REV has a higher percentage of 'good' examples than Lang-8 according to ∆ppl, but many of those examples actually have large ppl + , and do not capture grammatical changes. Example [a] illustrates a related failure case; it has high ppl − , but according to ∆ppl alone, is the 'best' example in the table.

Roundtrip-translations
The roundtrip-translated data does not suffer from large information changes, except when the meaning is so garbled as to produce a semantically irreconcilable sequence, as in [n]. As a result, the distribution of RT examples has lower ppl + than that of REV. However, many examples include rearrangements or re-phrasings that are out of scope    s, u]. Interestingly, partial corrections, even apparently good ones, also perform poorly [t]. This may be a result of the relatively complete nature of the corrections made in BF, in which few if any target sequences appear to need further correction.

Training strategies
The scored training strategies (Table 4) explore approaches to making use of an example-level quality heuristic that accommodate distinct intuitions about how to treat the data. Filtering out examples beforehand (hard) follows the intuition that bad examples only hurt performance and should be excluded. Down-weighting the loss (soft) modifies the relative importance of examples, but avoids throwing any out, maintaining the value of having a large dataset. The 'curriculum'-style counterparts of each apply the same logic, while incorporating (albeit in a hard-coded manner) the intuition that the value of some examples changes over the course of training. It is worthwhile to note that the optimal strategy, even amongst these simple hard-coded strategies, is a function of the characteristics of the dataset in question. The hard-cclm strategy is worst for Lang-8 BF , where it gradually isolates a small portion of an already small dataset, but is best for PRE BF , which is so large that 5% of the dataset is still considerable. Also, much of what is lost in the 'bad' portion of PRE BF is lower-quality data than that which exists in Lang-8 BF , which may explain both why hard-cclm does so well for PRE BF and why soft-cclm, which does not throw out the large portion of bad examples, does relatively poorly.
The hard strategy outperforms both soft and soft-cclm for the first stage of both experiments, but the advantage disappears following finetuning on BF. This suggests that cutting out the 'worst' examples entirely, while beneficial in the scored training stage, may prevent a sort of regularization that is beneficial to the ultimate finetuned model.
That all strategies similarly out-perform the baseline suggests that ∆ppl is a robust heuristic for quality; that all are simple and un-tuned to the data suggests that there remains headroom for more sophisticated training strategies to do even better.

Scoring with less target data
We observe that scoring any combination of lowerquality datasets using BF as the target data leads to large improvements over unscored pretraining models, and modest performance gains over those unscored models after finetuning (Table 3).
Dataset proportion examples learning rate full 60011 3 × 10 −5 1/2 29998 3 × 10 −5 1/4 15121 25 × 10 −6 1/8 7608 1 × 10 −7 1/16 3749 1 × 10 −7 1/32 1841 1 × 10 −7 1/64 905 1 × 10 −7 We now explore how each of these effects varies as a function of the target data size. For the scoring setup with the largest relative gains over unscored pretraining ([A] in Table 3), we repeat the same experiment multiple times, but using nested subsets of BF for both scoring and finetuning, each half the size of the previous one. While halving the datasets, we maintain the ratio of BEA-19 train and FCE data within each subset. Because using the same finetuning learning rate would quickly overfit for the smaller datasets, learning rates were tuned for each subset using the test set of the CoNLL-2013 shared task (Ng et al., 2013) (Table 6).
All models are trained via the hard-cclm strategy, which, prior to finetuning, significantly outperforms other strategies for training on scored pretraining data (section 'ii' in Table 4). Results are shown in Figure 2.

Understanding the benefits of scoring
The marginal benefit of scoring the pretraining data yields a drastic performance gain over unscored pretraining, even for very small amounts of target data (see Figure 2 and Table 3). This pretrain gain reflects the value of obliquely incorporating the information of the target dataset into the pretraining data via ∆ppl scoring. Because finetuning on the target dataset directly incorporates that same information again, this gain is diminished once the scored models are finetuned (see "∆ vs unscored" column in Table 3). However, the benefits of finetuning are limited by over-fitting to the finetuning dataset, which is likely to occur given that it is substantially smaller (≈ 1M words) than pretraining data (≈ 8B words). Thus the scored pretrained model, which has already incorporated some of the information of the target dataset without yet having seen any of the specific examples therein, is able to make better use of the finetuning set before the harm of over-fitting outweighs the benefit of further training. This difference explains why even after finetuning, the models with scored training stages outperform the unscored models, though by less than if directly comparing the scored and unscored stages themselves.
In Figure 2, the marginal benefit of scoring for the 30k dataset size is +0.5 F 0.5 , compared to +0.9 F 0.5 for doubling the size of the finetuning data (without scoring). For tasks constrained by the availability of high-quality data, and for which labeling costs are high, scoring noisy pretraining data may be a thrifty path to performance gains.

Test set results
We evaluate our best unscored and scored systems at all stages of training on BEA-19 test, CoNLL-14 and JFLEG. Results are shown in Table 7. Results for BEA-19 test are provided by the official Codalab competition of the BEA-2019 shared task, where this work qualifies as Unrestricted due to its reliance on additional parallel data like the Wikipedia revisions pretraining dataset. Because the most competitive results in the BEA-2019 task were submitted to the Restricted track, the results of this work are not perfectly comparable to most recent and competitive GEC publications. Additionally, many of the cited works make use of the NUCLE dataset (Dahlmeier et al., 2013), which was not used in this work. Nonetheless it is useful to contextualize the results within the scope of recent progress in GEC. A comparison to recent prior work is made in Table 8. This work achieves state-of-the-art results for the JFLEG and CoNLL-14 test sets, and obtains competitive results on BEA-19 test.

Future Work
The huge jump in performance between unscored and scored pretraining data demonstrates the possibility of making much more effective use of large and noisy datasets through the incorporation of example-level quality scores. While ∆ppl is one such score, there is significant room for improve-   ment, as seen in the example-level analysis in Section 7. Other methods for scoring individual examples should be explored. In our scored training, we have presented hardcoded training strategies selected for their simplicity. These un-tuned strategies are easy to implement, but do not represent optimal uses of an example-level heuristic score. The fact that there is such variability between them in the two experiments of Table 4 suggests that training methods that are sensitive to the particularities of the scored dataset and the model may be able to make much better use of the same scored data. For example, a training scheme that, during training, dynamically decided which data to include or exclude (or how to weight the included data) could be expected to outperform our hard-coded strategies and hyperparameters. A training strategy along these lines has been implemented successfully by  for NMT.
These two complementary directions of future work, the development of new example-level quality heuristics, and the techniques to apply them in scored training, present an intriguing path for future exploration.