Skip to Main Content
Table 7: 
Test set evaluation results. For each test set, the finetuning checkpoint selected, the identity-correction threshold, and the number of rounds of iterative decoding are tuned to the respective dev sets. BEA-19 test results are provided via the Codalab competition website of the BEA-2019 shared task. Each non-ensemble row represents the average of four models, whose construction is described in Section 6. The ensembles combine the four models from the preceding row.
 Training StrategyBEA-19 testCoNLL-14 testJFLEG test
Prec.Rec.F0.5 (ERRANT)Prec.Rec.F0.5 (M2)GLEU+
unscored PRE 35.7 41.7 36.8 44.6 36.2 42.6 54.1 
→ Lang-8 62.7 52.4 60.3 64.0 42.8 58.3 62.5 
→ BF 67.4 61.7 66.1 67.6 44.3 61.1 63.6 
ensemble 74.1 64.3 71.9 72.6 46.7 65.3 64.7 
scored PREBF (soft56.6 47.1 54.4 61.6 38.2 54.8 59.4 
Lang-8BF (soft68.0 57.8 65.7 68.6 44.7 62.0 63.7 
→ BF 67.6 62.5 66.5 69.4 43.9 62.1 63.8 
ensemble 75.4 64.7 73.0 74.7 46.9 66.8 64.5 
PREBF (soft) → Lang-8 64.1 52.2 61.3 66.0 41.8 59.2 62.5 
→ BF 66.8 61.5 65.7 68.3 45.4 62.0 63.6 
ensemble 71.7 67.4 70.8 71.2 49.9 65.6 64.9 
 Training StrategyBEA-19 testCoNLL-14 testJFLEG test
Prec.Rec.F0.5 (ERRANT)Prec.Rec.F0.5 (M2)GLEU+
unscored PRE 35.7 41.7 36.8 44.6 36.2 42.6 54.1 
→ Lang-8 62.7 52.4 60.3 64.0 42.8 58.3 62.5 
→ BF 67.4 61.7 66.1 67.6 44.3 61.1 63.6 
ensemble 74.1 64.3 71.9 72.6 46.7 65.3 64.7 
scored PREBF (soft56.6 47.1 54.4 61.6 38.2 54.8 59.4 
Lang-8BF (soft68.0 57.8 65.7 68.6 44.7 62.0 63.7 
→ BF 67.6 62.5 66.5 69.4 43.9 62.1 63.8 
ensemble 75.4 64.7 73.0 74.7 46.9 66.8 64.5 
PREBF (soft) → Lang-8 64.1 52.2 61.3 66.0 41.8 59.2 62.5 
→ BF 66.8 61.5 65.7 68.3 45.4 62.0 63.6 
ensemble 71.7 67.4 70.8 71.2 49.9 65.6 64.9 
Close Modal

or Create an Account

Close Modal
Close Modal