Table 7: 

Mean score of human judgments and M0.52 score for each system in domains (NF = Native Formal, NWI = Native Web Informal, R = Romani, SL = Second Learners, Σ = whole dataset). All results in the whole dataset (the Σ column) are statistically significant with p-value <0.001, except for the AG finetuned and Joint GEC+NMT systems, where the p-value is less than 6.2% for M0.52 score and less than 4.3% for human score, using the Monte Carlo permutation test with 10M samples and probability of error at most 10−6 (Fay and Follmann, 2002; Gandy, 2009).

SystemM0.52-scoreMean human score
NFNWIRSLΣNFNWIRSLΣ
Original — — — — — 8.47 7.99 7.76 7.18 7.61 
 
Korektor 28.99 31.51 46.77 55.93 45.09 8.26 7.60 7.90 7.55 7.63 
Synthetic trained 46.83 38.63 46.36 62.20 53.07 8.55 7.99 8.10 7.88 7.98 
AG finetuned 65.77 55.20 69.71 71.41 68.08 8.97 8.22 8.91 8.35 8.38 
GECCC finetuned 72.50 71.09 72.23 73.21 72.96 9.19 8.72 8.91 8.67 8.74 
Joint GEC+NMT 68.14 66.64 65.21 70.43 67.40 9.06 8.37 8.69 8.19 8.35 
 
Reference — — — — — 9.58 9.48 9.60 9.63 9.57 
SystemM0.52-scoreMean human score
NFNWIRSLΣNFNWIRSLΣ
Original — — — — — 8.47 7.99 7.76 7.18 7.61 
 
Korektor 28.99 31.51 46.77 55.93 45.09 8.26 7.60 7.90 7.55 7.63 
Synthetic trained 46.83 38.63 46.36 62.20 53.07 8.55 7.99 8.10 7.88 7.98 
AG finetuned 65.77 55.20 69.71 71.41 68.08 8.97 8.22 8.91 8.35 8.38 
GECCC finetuned 72.50 71.09 72.23 73.21 72.96 9.19 8.72 8.91 8.67 8.74 
Joint GEC+NMT 68.14 66.64 65.21 70.43 67.40 9.06 8.37 8.69 8.19 8.35 
 
Reference — — — — — 9.58 9.48 9.60 9.63 9.57 
Close Modal

or Create an Account

Close Modal
Close Modal