Comparison of GEC corpora in size, token error rate, domain, and number of reference annotations in the test portion. SL = second language learners.
Language . | Corpus . | Sentences . | Err. r. . | Domain . | # Refs. . |
---|---|---|---|---|---|
English | Lang-8 | 1 147 451 | 14.1% | SL | 1 |
NUCLE | 57 151 | 6.6% | SL | 1 | |
FCE | 33 236 | 11.5% | SL | 1 | |
W&I+LOCNESS | 43 169 | 11.8% | SL, native students | 5 | |
CoNLL-2014 test | 1 312 | 8.2% | SL | 2,10,8 | |
JFLEG | 1 511 | — | SL | 4 | |
GMEG | 6 000 | — | web, formal articles, SL | 4 | |
AESW | over 1M | — | scientific writing | 1 | |
CWEB | 13 574 | ∼2% | web | 2 | |
Czech | AKCES-GEC | 47 371 | 21.4% | SL essays, Romani ethnolect of Czech | 2 |
German | Falko-MERLIN | 24 077 | 16.8% | SL essays | 1 |
Russian | RULEC-GEC | 12 480 | 6.4% | SL, heritage speakers | 1 |
Spanish | COWS-L2H | 12 336 | — | SL, heritage speakers | 2 |
Ukrainian | UA-GEC | 20 715 | 7.1% | natives/SL, translations and personal texts | 2 |
Romanian | RONACC | 10 119 | — | native speakers transcriptions | 1 |
Language . | Corpus . | Sentences . | Err. r. . | Domain . | # Refs. . |
---|---|---|---|---|---|
English | Lang-8 | 1 147 451 | 14.1% | SL | 1 |
NUCLE | 57 151 | 6.6% | SL | 1 | |
FCE | 33 236 | 11.5% | SL | 1 | |
W&I+LOCNESS | 43 169 | 11.8% | SL, native students | 5 | |
CoNLL-2014 test | 1 312 | 8.2% | SL | 2,10,8 | |
JFLEG | 1 511 | — | SL | 4 | |
GMEG | 6 000 | — | web, formal articles, SL | 4 | |
AESW | over 1M | — | scientific writing | 1 | |
CWEB | 13 574 | ∼2% | web | 2 | |
Czech | AKCES-GEC | 47 371 | 21.4% | SL essays, Romani ethnolect of Czech | 2 |
German | Falko-MERLIN | 24 077 | 16.8% | SL essays | 1 |
Russian | RULEC-GEC | 12 480 | 6.4% | SL, heritage speakers | 1 |
Spanish | COWS-L2H | 12 336 | — | SL, heritage speakers | 2 |
Ukrainian | UA-GEC | 20 715 | 7.1% | natives/SL, translations and personal texts | 2 |
Romanian | RONACC | 10 119 | — | native speakers transcriptions | 1 |