Czech Grammar Error Correction with a Large and Diverse Corpus

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgments on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http://hdl.handle.net/11234/1-4639.


Introduction
Representative data both in terms of size and domain coverage are vital for NLP systems development. However, in the field of grammar error correction (GEC), most GEC corpora are limited to corrections of mistakes made by foreign or second language learners even in the case of English (Tajiri et al., 2012;Dahlmeier et al., 2013;Yannakoudakis et al., 2011Yannakoudakis et al., , 2018Ng et al., 2014;Napoles et al., 2017). At the same time, as recently pointed out by Flachs et al. (2020), learner corpora are only a part of the full spectrum of GEC applications. To alleviate the skewed perspective, the authors released a corpus of website texts.
Despite recent efforts aimed to mitigate the notorious shortage of national GEC-annotated corpora (Boyd, 2018;Rozovskaya and Roth, 2019;Davidson et al., 2020;Syvokon and Nahorna, 2021;Cotet et al., 2020;Náplava and Straka, 2019), the lack of adequate data is even more acute in languages other than English. We aim to address both the issue of scarcity of non-English data and the ubiquitous need for broad domain coverage by presenting a new, large and diverse Czech corpus, expertly annotated for GEC.
Grammar Error Correction Corpus for Czech (GECCC) includes texts from multiple domains in a total of 83 058 sentences, being, to our knowledge, the largest non-English GEC corpus, as well as being one of the largest GEC corpora overall.
In order to represent a diversity of writing styles and origins, besides essays of both native and non-native speakers from Czech learner corpora, we also scraped website texts to complement the learner domain with supposedly lower error density texts, encompassing a representation of the following four domains: • Native Formal -essays written by native students of elementary and secondary schools • Native Web Informal -informal website discussions • Romani -essays written by children and teenagers of the Romani ethnic minority • Second Learners -essays written by nonnative learners Using the presented data, we compare several state-of-the-art Czech GEC systems, including some Transformer-based.
Finally, we conduct a meta-evaluation of GEC metrics against human judgments to select the most appropriate metric for evaluating corrections on the new dataset. The analysis is performed across domains, in line with Napoles et al. (2019).  Our contributions include (i) a large and diverse Czech GEC corpus, covering learner corpora and website texts, with unified and, in some domains, completely new GEC annotations, (ii) a comparison of Czech GEC systems, and (iii) a meta-evaluation of common GEC metrics against human judgment on the released corpus.
2 Related Work

Grammar Error Correction Corpora
Until recently, attention has been focused mostly on English, while GEC data resources for other languages were in short supply. Here we list a few examples of English GEC corpora, collected mostly within an English-as-a-second-language (ESL) paradigm. For a comparison of their relevant statistics see Table 1.
Lang-8 Corpus of Learner English (Tajiri et al., 2012) is a corpus of English language learner texts from the Lang-8 social networking system.
NUCLE (Dahlmeier et al., 2013) consists of essays written by undergraduate students of the National University of Singapore.
FCE (Yannakoudakis et al., 2011) includes short essays written by non-native learners for the Cambridge ESOL First Certificate in English.
W&I+LOCNESS is a union of two datasets, the W&I (Write & Improve) dataset (Yannakoudakis et al., 2018) of non-native learners' essays, com-plemented by the LOCNESS corpus (Granger, 1998), a collection of essays written by native English students.
The GEC error annotations for the learner corpora above were distributed with the BEA-2019 Shared Task on Grammatical Error Correction (Bryant et al., 2019).
The CoNLL-2014 shared task test set (Ng et al., 2014) is often used for GEC systems evaluation. This small corpus consists of 50 essays written by 25 South-East Asian undergraduates.
JFLEG (Napoles et al., 2017) is another frequently used GEC corpus with fluency edits in addition to usual grammatical edits.
To broaden the restricted variety of domains, focused primarily on learner essays, a CWEB collection (Flachs et al., 2020) of website texts was recently released, aiming at contributing lower error density data.
AESW (Daudaravicius et al., 2016) is a large corpus of scientific writing (over 1M sentences), edited by professional editors.
Finally, Napoles et al. (2019) recently released GMEG, a corpus for the evaluation of GEC metrics across domains.
To better account for multiple correction options, datasets often contain several reference sentences for each original noisy sentence in the test set, proposed by multiple annotators. As we can see in Table 1, the number of annotations typically ranges between 1 and 5 with an exception of the CoNLL14 test set, which-on top of the official 2 reference corrections-later received 10 annotations from Bryant and Ng (2015) and 8 alternative annotations from Sakaguchi et al. (2016).

Czech Learner Corpora
By the early 2010s, Czech was one of a few languages other than English to boast a series of learner corpora, compiled under the umbrella project AKCES, evoking the concept of acquisition corpora (Šebesta, 2010).
The native section includes transcripts of hand-written essays (SKRIPT 2012) and classroom conversation (SCHOLA 2010) from elementary and secondary schools. Both have their counterparts documenting the Roma ethnolect of Czech: 1 essays (ROMi 2013) and recordings and transcripts of dialogues (ROMi 1.0). 2 The non-native section goes by the name of CzeSL, the acronym of Czech as the Second Language. CzeSL consists of transcripts of short hand-written essays collected from non-native learners with various levels of proficiency and native languages, mostly students attending Czech language courses before or during their studies at a Czech university. There are several releases of 1 The Romani ethnolect of Czech is the result of contact with Romani as the linguistic substrate. To a lesser (and weakening) extent the ethnolect shows some influence of Slovak or even Hungarian, because most of its speakers have roots in Slovakia. The ethnolect can exhibit various specifics across all linguistic levels. However, nearly all of them are complementary with their colloquial or standard Czech counterparts. A short written text, devoid of phonological properties, may be hard to distinguish from texts written by learners without the Romani backround. The only striking exception are misspellings in contexts where the latter benefit from more exposure to written Czech. The typical example is the omission of word boundaries within phonological words, e.g., between a clitic and its host. In other respects, the pattern of error distribution in texts produced by ethnolect speakers is closer to native rather than foreign learners (Bořkovcová, 2007(Bořkovcová, , 2017 CzeSL, which differ mainly to what extent and how the texts are annotated (Rosen et al., 2020). 3 More recently, hand-written essays have been transcribed and annotated in TEITOK (Janssen, 2016), 4 a tool combining a number of corpus compilation, annotation and exploitation functionalities.
Learner Czech is also represented in MERLIN, a multilingual (German, Italian, and Czech) corpus built in 2012-2014 from texts submitted as a part of tests for language proficiency levels (Boyd et al., 2014). 5 Finally, AKCES-GEC (Náplava and Straka, 2019) is a GEC corpus for Czech created from the subset of the above mentioned AKCES resources (Šebesta, 2010): the CzeSL-man corpus (non-native Czech learners with manual annotation) and a part of the ROMi corpus (speakers of the Romani ethnolect).
Compared to the AKCES-GEC, the new GECCC corpus contains much more data (47 371 sentences vs. 83 058 sentences, respectively), by extending data in the existing domains and also adding two new domains: essays written by native learners and website texts, making it the largest non-English GEC corpus and one of the largest GEC corpora overall.

Data Selection
We draw the original uncorrected data from the following Czech learner corpora or Czech websites: • Native Formal -essays written by native students of elementary and secondary schools from the SKRIPT 2012 learner corpus, compiled in the AKCES project • Native Web Informal -newly annotated informal website discussions from Czech Facebook Dataset (Habernal et al., 2013a,b) and Czech news site novinky.cz.
•  • Second Learners -essays written by nonnative learners, from the Foreigners section of the AKCES-GEC corpus, and the MERLIN corpus Since we draw our data from several Czech corpora originally created in different tools with different annotation schemes and instructions, we re-annotated the errors in a unified manner for the entire development and test set and partially also for the training set. The data split was carefully designed to maintain representativeness, coverage and backwards compatibility. Specifically, (i) test and development data contain roughly the same amount of annotated data from all domains, (ii) original AKCES-GEC dataset splits remain unchanged, and (iii) additional available detailed annotations such as user proficiency level in MERLIN were leveraged to support the split balance. Overall, the main objective was to achieve a representative cover over development and testing data. Table 2 presents the sizes of data resources in the number of documents. The first column (Documents) shows the number of all available documents collected in an initial scan. The second column (Selected) is a selected subset from the available documents, due to budgetary constraints and to achieve a representative sample over all domains and data portions. The relatively higher number of documents selected for the Native Web Informal domain is due to its substantially shorter texts, yielding fewer sentences; also, we needed to populate this part of the corpus as a completely new domain with no previously annotated data.
To achieve more fine-grained balancing of the splits, we used additional metadata where available: user's proficiency levels and origin language from MERLIN and the age group from AKCES.

Preprocessing
De/tokenization is an important part of data preprocessing in grammar error correction. Some formats, such as the M 2 format (Dahlmeier and Ng, 2012), require tokenized formats to track and evaluate correction edits. On the other hand, detokenized text in its natural form is required for other applications. We therefore release our corpus in two formats: a tokenized M 2 format and detokenized format aligned at sentence, paragraph, and document level. As part of our data is drawn from earlier, tokenized GEC corpora AKCES-GEC and MERLIN, this data had to be detokenized. A slightly modified Moses detokenizer 6 is attached to the corpus. To tokenize the data for the M 2 format, we use the UDPipe tokenizer (Straka et al., 2016).

Annotation
The test and development sets in all domains were annotated from scratch by five in-house expert annotators, 7 including re-annotations of the development and test data of the earlier GEC corpora to achieve a unified annotation style. All the test sentences were annotated by two annotators; one half of the development sentences received two annotations and the second half one annotation. The annotation process took about 350 hours in total.
The annotation instructions were unified across all domains: The corrected text must not contain any grammatical or spelling errors and should sound fluent. Fluency edits are allowed if the original is incoherent. The entire document was given as a context for the annotation. Annotators were instructed to remove documents that were too incomprehensible or those containing private information.
To keep the annotation process simple for the annotators, the sentences were annotated (corrected) in a text editor and postprocessed automatically to retrieve and categorize the GEC edits by the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017).

Train Data
The first source for the training data are the data from the SKRIPT 2012; the MERLIN corpus and the AKCES-GEC train set that were not annotated, thus containing original annotations. These data cover the Native Formal, the Romani, and the Second Learners domain. The second part of the training data are newly annotated data. Specifically, these are all Native Web Informal data and also a small part in the Second Learners domain. All data in the training set were annotated with one annotation.

Corpus Alignment
The majority of models proposed for grammatical error correction operates over sentences. However, preliminary studies on document-level grammatical error correction recently appeared (Chollampatt et al., 2019;Yuan and Bryant, 2021). The models were shown to benefit from larger context as certain errors such as errors in articles or tense choice do require larger context. To simplify future work with our dataset, we release three alignment levels: (i) sentence-level, (ii) paragraph-level, and (iii) document-level. Given that the state-of-the-art grammatical error correction systems still operate on sentence level despite the initial attempts with document-level systems, we perform model training and evaluation at the usual sentence level. 8

Inter-Annotator Agreement
As suggested by Rozovskaya and Roth (2010), followed later by Rozovskaya and Roth (2019) and Syvokon and Nahorna (2021), we evaluate inter-annotator agreement by asking a second annotator to judge the need for a correction in a sentence already annotated by someone else, in a single-blind setting as to the status of the sentence (corrected/uncorrected). 9 Five annotators annotated the first pass and three annotators judged the sentence correctness in the second pass. In 8 Note that even if human evaluation in Section 5 is performed on sentence-aligned data, human annotators process whole documents, and thus take the full context into account. 9 A sentence-level agreement on sentence correctness is generally preferred in GEC annotations to an exact inter-annotator match on token edits, since different series of corrections may possibly lead to a correct sentence (Bryant and Ng, 2015).  Table 3: Inter-annotator agreement based on second-pass judgments. Numbers represent percentage of sentences judged correct in secondpass proofreading. Five annotators annotated the first pass, and three annotators judged the sentence correctness in the second pass. the second pass, each of the three annotators judged a disjoint set of 120 sentences. Table 3 summarizes the inter-annotator agreement based on second-pass judgments: The numbers represent the percentage of sentences judged correct in the second pass. Both the average and the standard deviation (82.96 ± 12.12) of our inter-annotator agreement are similar to inter-annotator agreement measured on English (63 ± 18.46, Rozovskaya and Roth 2010), Russian (80 ± 16.26, Rozovskaya and Roth 2019), and Ukrainian (69.5 ± 7.78 Syvokon and Nahorna 2021).

Error Type Analysis
To retrieve and categorize the correction edits from the erroneous-corrected sentence pairs, ER-Ror ANnotation Toolkit (ERRANT) (Bryant et al., 2017) was used. Inspired by Boyd (2018), we adapted the original English error types to the Czech language. For the resulting set see Table 4. The POS error types are based on the UD POS tags (Nivre et al., 2020) and may contain an optional :INFL subtype when the original and the corrected words share a common lemma. The word-order error type was extended by an optional  Table 5: Corpus statistics at three alignment levels: sentence-aligned, paragraph-aligned, and docaligned. Average error rate was computed on the concatenation of development and test data at all three alignment levels.
:SPELL subtype to allow for capturing word order errors including words with minor spelling errors. The original orthography error type ORTH covering both errors in casing and whitespaces is now subtyped with :WSPACE and :CASING to better distinguish between the two phenomena. Finally, we add two error types specific to Czech: DIACR for errors in either missing or redundant diacritics and QUOTATION for wrongly used quotation marks. Two original error types remain unchanged: MORPH, indicating replacement of a token by another with the same lemma but different POS, and SPELL, indicating incorrect spelling. For part-of-speech tagging and lemmatization we rely on UDPipe (Straka et al., 2016). 10 The word list for detecting spelling errors comes from MorfFlex . 11 We release the Czech ERRANT at https:// github.com/ufal/errant czech. We assume that it is applicable to other languages with a similar set of errors, especially Slavic languages, if lemmatizer, tagger, and morphological dictionary are available.

Final Dataset
The final corpus consists of 83 058 sentences and is distributed in two formats: the tokenized M 2 format (Dahlmeier and Ng, 2012) and the detokenized format with alignments at the sentence, paragraph, and document levels. Although the detokenized format does not include correction edits, it does retain full information about the original spacing.
The statistics of the final dataset are presented in Table 5. The individual domains are balanced on 10 Using the czech-pdt-ud-2.5-191206.udpipe model. 11 We also use the aggresive variant of the stemmer from https://research.variancia.com/czech_stemmer/. the sentence level in the development and testing sets, each of them containing about 8 000 sentences. The number of paragraphs and documents varies: on average, the Native Web Informal domain contains less than 2 sentences per document, while the Native Formal domain more than 20.
As expected, the domains differ also in the error rate, that is, the proportion of erroneous tokens (see Table 5). The students' essays in the Native Formal domain are almost 3 times less erroneous than any other domain, while in the Romani and Second Learners domain, approximately every fourth token is incorrect.
Furthermore, the prevalence of error types differs for each individual domain. The 10 most common error types in each domain are presented in Figure 1. Overall, errors in punctuation (PUNCT) constitute the most common error type. They are the most common error in three domains, although their relative frequency varies. We further estimated that of these errors, 9% (Native Formal) to 27% (Native Web Informal) are uninteresting from the linguistic perspective, as they are only omissions of the sentence formal ending, probably purposeful in case of Native Web Informal. The rest (75-91%) appears in a sentence, most of which (35-68% Native Formal) is a misplaced comma: In Czech, syntactic status of finite clauses strictly determine the use of commas in the sentence. Finally, in 5-7% cases of all punctuation errors, a correction included joining two sentences or splitting a sentence into two sentences. Errors in either missing or wrongly used diacritics (DIACR), spelling errors (SPELL), and errors in orthography (ORTH) are also common, with varying frequency across domains.
Compared to the AKCES-GEC corpus, the Grammar Error Correction Corpus for Czech contains    Náplava and Straka (2019). Note that models vastly differ in training/fine-tuning data and size (e.g., Rothe et al. (2021) xxl is 50 times larger than AG finetuned).
more than 3 times as many sentences in the development and test sets, more than 50% sentences in the training set and also two new domains.
To the best of our knowledge, the newly introduced GECCC dataset is the largest among GEC corpora in languages other than English and it is surpassed in size only by the English Lang-8 and AESW datasets. With the exclusion of these two datasets, the GECCC dataset contains more sentences than any other GEC corpus currently known to us.

Model
In this section, we describe five systems for automatic error correction in Czech and analyze their performance on the new dataset. Four of these systems represent previously published Czech work (Richter et al., 2012;Náplava and Straka, 2019;Náplava et al., 2021) and one is our new implementation. The first system is a pre-neural approach, published and available for Czech (Richter et al., 2012), included for historical reasons as a previously known and available Czech GEC tool; the following four systems represent the current state of the art in GEC: They are all neural network architectures based on Transformers, differing in the training procedure, training data, or training objective. A comparison of systems, trained and evaluated on English, Czech, German, and Russian, with state of the art, is given in Table 6.

Models
We experiment with the following models: Korektor (Richter et al., 2012) is a pre-neural statistical spellchecker and (occasional) grammar checker. It uses the noisy channel approach with a candidate model that for each word suggests its variants up to a predefined edit distance. Internally, a hidden Markov model (Baum and Petrie, 1966) is built. Its hidden states are the variants of words proposed by the candidate model, and the transition costs are determined from three N -gram language models built over word forms, -----9.58 9.48 9.60 9.63 9.57 Table 7: Mean score of human judgments and M 2 0.5 score for each system in domains (NF = Native Formal, NWI = Native Web Informal, R = Romani, SL = Second Learners, Σ = whole dataset). All results in the whole dataset (the Σ column) are statistically significant with p-value < 0.001, except for the AG finetuned and Joint GEC+NMT systems, where the p-value is less than 6.2% for M 2 0.5 score and less than 4.3% for human score, using the Monte Carlo permutation test with 10M samples and probability of error at most 10 −6 (Fay and Follmann, 2002; Gandy, 2009). lemmas, and part-of-speech-tags. To find an optimal correction, Viterbi algorithm (Forney, 1973) is used.
Synthetic trained (Náplava and Straka, 2019) is a neural-based Transformer model that is trained to translate the original ungrammatical text to a well formed text. The original Transformer model (Vaswani et al., 2017) is regularized with an additional source and target word dropout and the training objective is modified to focus on tokens that should change (Grundkiewicz and Junczys-Dowmunt, 2019). As the amount of existing annotated data is small, an unsupervised approach with a spelling dictionary is used to generate a large amount of synthetic training data. The model is trained solely on these synthetic data.
AKCES-GEC (AG) finetuned (Náplava and Straka, 2019) is based on Synthetic trained, but finetunes its weights on a mixture of synthetic and authentic data from the AKCES-GEC corpus, namely, on data from the Romani and Second Learners domains. See Table 6 for comparison with state of the art in English, Czech, German and Russian.
GECCC finetuned uses the same architecture as Synthetic trained, but we finetune its weights on a mixture of synthetic and (much larger) authentic data from the newly released GECCC corpus. We use the official code of Náplava and Straka (2019) with the default settings and mix the synthetic and new authentic data in a ratio of 2:1.
Joint GEC+NMT (Náplava et al., 2021) is a Transformer model trained in a multi-task setting.
It pursues two objectives: (i) to correct Czech and English texts; (ii) to translate the noised Czech texts into English texts and the noised English texts into Czech texts. The source data come from the CzEng v2.0 corpus (Kocmi et al., 2020) and were noised using a statistical system, KaziText (Náplava et al., 2021), that tries to model several most frequently occurring errors such as diacritics, spelling or word ordering. The statistics of the Czech noise were estimated on the new training set, therefore, the system was indirectly trained also on data from Native Formal and Native Web Informal domains, unlike the AG finetuned system. The statistics of the English noise were estimated on NUCLE (Dahlmeier et al., 2013), FCE (Yannakoudakis et al., 2011), and W&I+LOCNESS (Yannakoudakis et al., 2018;Granger, 1998). Table 7 summarizes the evaluation of the five grammar error correction systems (described in the previous Section 4.1), evaluated with highestcorrelating and widely used metric, the M 2 score with β = 0.5, denoted as M 2 0.5 (left); and with human judgments (right). For the meta-evaluation of GEC metrics against human judgments, see the following Section 5.

Results and Analysis
Clearly, learning on GEC annotated data improves performance significantly, as evidenced by a giant leap between the systems without GEC data (Korektor, Synthetic trained)   of GEC data volume and domains is statistically significantly better (p < 0.001), as the only difference between AG finetuned and GECCC finetuned systems is that the former uses the AKCES-GEC corpus, while the latter is trained on larger and domain-richer GECCC. Access to larger data and more domains in the multi-task setting is useful (compare Joint GEC+NMT and AG finetuned on newly added Native Formal and Native Web Informal domains), although direct training seems superior (GECCC finetuned over Joint GEC+NMT).
We further analyze the best model (GECCC finetuned) and inspect its performance with respect to individual error types. For simpler analysis, we grouped all POS-related errors into two error types: POS and POS:INFL for words that are erroneous only in inflection and share the same lemma with their correction.
As we can see in Table 8, the model is very good at correcting local errors in diacritics (DIACR), quotation (QUOTATION), spelling (SPELL), and casing (ORTH:CASING). Unsurprisingly, small changes are easier than longer edits; similarly, the system is better in inflection corrections (POS: INFL, words with the same lemma) than on POS (correction involves finding a word with a different lemma).
Should the word be split or joined with an adjacent word, the model does so with a relatively high success rate (ORTH:WSPACE). The model is also able to correctly reorder words (WO), but here its recall is rather low. The model performs worst on errors categorized as OTHER, which includes edits that often require rewriting larger pieces of text. Generally, the model has higher precision than recall, which suits the needs of standard GEC, where proposing a bad correction for a good text is worse than being inert to an existing error.

Meta-evaluation of Metrics
There are several automatic metrics used for evaluating system performance on GEC dataset, although it is not clear which of them is preferable in terms of high correlation with human judgments on our dataset.
The most popular GEC metrics are the Max-Match (M 2 ) scorer (Dahlmeier and Ng, 2012) and the ERRANT scorer (Bryant et al., 2017).
The MaxMatch (M 2 ) scorer reports the F-score over the optimal phrasal alignment between a source sentence and a system hypothesis reaching the highest overlap with the gold standard annotation. It was used as the official metric for the CoNLL 2013 and 2014 Shared Tasks (Ng et al., 2013(Ng et al., , 2014 and is also used on various other datasets such as the German Falko-MERLIN GEC (Boyd, 2018) or Russian RULEC-GEC (Rozovskaya and Roth, 2019).
The ERRANT scorer was used as the official metric of the recent Building Educational Application 2019 Shared Task on GEC (Bryant et al., 2019). The ERRANT scorer also contains a set of rules operating over a set of linguistic annotations to construct the alignment and extract individual edits.
Other popular automatic metrics are the General Language Evaluation Understanding (GLEU) metric (Napoles et al., 2015), which additionally measures text fluency, and I-Measure (Felice and Briscoe, 2015), which calculates weighted accuracy of both error detection and correction.

Human Judgments Annotation
In order to evaluate the correlation of several GEC metrics with human judgments, we collected annotations of the original erroneous sentences, the manually corrected gold references, and automatic corrections made by five GEC systems described in Section 4. We used the hybrid partial ranking with scalars (Sakaguchi and Van Durme, 2018), in which the annotators judged the sentences on a scale from 0-10 (from ungrammatical to correct). 12 The sentences were evaluated with respect to the context of the document. In total, three annotators judged 1 100 documents, sampled from the test set comprising about 4 300 original sentences and about 15 500 unique corrected variants and gold references of the sentences. The annotators annotated 127 documents jointly and the rest was annotated by a single annotator. This annotation process took about 170 hours. Together with the model training, data preparation, and management of the annotation process, our rough estimation is about 300+ man-hours for the correlation analysis per corpus (language).

Agreement in Human Judgments
For the agreement in human judgments, we report the Pearson correlation and Spearman's rank correlation coefficient between 3 human judgments of 5 automatic sentence corrections at the systemand sentence-level. At the sentence level, the correlation of the judgments about the 5 sentence corrections is calculated for each sentence and each pair of the three annotators. The final sentence-level annotator agreement is the mean of these values over all sentences.
At the system level, the annotators' judgments for each system are averaged over the sentences, and the correlation of these averaged judgments is computed for each pair of the three annotators. In order to obtain smoother estimates (especially for Spearman's ρ), we utilize bootstrap resampling with 100 samples of a test set.
The human judgments agreement across domains is shown in Table 9. On the sentence level, the human judgments correlation is high on the least erroneous domain Native Formal, implying that it is easier to judge the corrections in a low error density setting, and it is more difficult in high error density domains, such as Romani and Second Learners (compare error rates in Table 5).

Metrics Correlations with Judgments
Following Napoles et al. (2019), we provide a meta-evaluation of the following common GEC metrics robustness on our corpus: • MaxMatch (M 2 ) (Dahlmeier and Ng, 2012) 12 Recent work (Sakaguchi and Van Durme, 2018;Novikova et al., 2018) found partial ranking with scalars to be more reliable than direct assessment framework used by WMT (Bojar et al., 2016) and earlier GEC evaluation approaches (Grundkiewicz et al., 2015;Napoles et al., 2015).  Table 9: Human judgments agreement: Pearson (r) and Spearman (ρ) mean correlation between 3 human judgments of 5 sentence versions at sentence-and system-level.
• ERRANT (Bryant et al., 2017) • GLEU (Napoles et al., 2015) • I-measure (Felice and Briscoe, 2015) Moreover, we vary the proportion of recall and precision, ranging from 0 to 2.0 for M 2 -scorer and ERRANT, as Grundkiewicz et al. (2015) report that the standard choice of considering precision two times as important as recall may be sub-optimal.
While we considered both sentence-level and system-level evaluation in Section 5.2, the automatic metrics should by design be used on a whole corpus, leaving us with only system-level evaluation. Given that the GEC systems perform differently on the individual domains (as indicated by Table 7), we perform the correlation computation on each domain separately and report the average. For a given domain and metric, we compute the correlation between the automatic metric evaluations of the five systems on one side and the (average of) human judgments on the other side. In order to obtain a smoother estimate of Spearman's ρ and also to estimate standard deviations, we employ bootstrap resampling again, with 100 samples.
The results are presented in Table 10. While Spearman's ρ has more straightforward interpretation, it also has a much higher variance, because it harshly penalizes the differences in the ranking of systems with similar performance (namely, AG finetuned and Joint GEC+NMT in our case). This fact has previously been observed by Macháček and Bojar (2013).
Therefore, we choose the most suitable GEC metric for our GECCC dataset according to Pearson r, which implies that M 2 0.5 and ERRANT 0.5 are   Table 10: System-level Pearson (r) and Spearman (ρ) correlation between the automatic metric scores and human annotations.
the metrics most correlating with human judgments. Of those two, we prefer the M 2 0.5 score, not due to its marginal superiority in correlation (Table 10), but rather because it is much more language-agnostic compared to ERRANT, which requires a POS tagger, lemmatizer, morphological dictionary, and language-specific rules.
Our results confirm that both M 2 -scorer and ERRANT with β = 0.5 (chosen only by intuition for the CoNLL 2014 Shared task; Ng et al., 2014) correlate much better with human judgments, compared to β = 0.2 and β = 1. The detailed plots of correlations of M 2 β score and ERRANT β score with human judgments for β ranging between 0 and 2, presented in Figure 2, show that optimal β in our case lies between 0.4 and 0.5. However, we opt to employ the widely used β = 0.5 because of its prevalence and because the difference to the optimal β is marginal.
Our results are distinct from the results of Grundkiewicz et al. (2015), where β = 0.18 correlates best on the CoNLL 14 test set. Nevertheless, Napoles et al. (2019) demonstrate that β = 0.5 correlates slightly better than β = 0.2 on the FCE dataset, but that β = 0.2 correlates substantially better than β = 0.5 on Wikipedia and also on Yahoo discussions (a dataset containing paragraphs of Yahoo! Answers, which are informal user answers to other users' questions).
In the latter work, Napoles et al. (2019) propose that larger β = 0.5 correlate better on datasets with higher error rate and vice versa, given that the FCE dataset has 20.2% token error rate, compared to the error rates of 9.8% and 10.5% of Wikipedia and Yahoo, respectively. The hypothesis seems to extend to our results and the results of Grundkiewicz et al. (2015), considering that the GECCC dataset and the CoNLL 14 test set have token error rates of 18.2% and 8.2%, respectively. Table 7 presents both human scores for the GEC systems described in Section 4 and also results obtained by the chosen M 2 0.5 metric. The results are presented both on the individual domains and the entire dataset. Measuring over the entire dataset, human judgments and the M 2 -scorer rank the systems in accordance.

GEC Systems Results
Judged by the human annotators, all systems are better than the ''do nothing'' baseline (the Original) measured over the entire dataset, although Korektor makes harmful changes in two domains: Native Formal and Native Web Informal. These two domains contain frequent named entities, which upon an eager change disturb the meaning of a sentence, leading to severe penalization by human annotators. Korektor is also not capable of deleting, inserting, splitting or joining tokens. The fact that Korektor sometimes performs detrimental changes cannot be revealed by the M 2 -scorer as it assigns zero score to the Original baseline and does not allow negative scores.
The human judgments confirm that there is still a large gap between the optimal Reference score and the best performing models. Regarding the domains, the neural models in the finetuned mode that had access to data from all domains seemed to improve the results consistently across each domain. However, given the fact that the source sentences in the Second Learners domain received the worst scores by human annotators, this domain seems to hold the greatest potential for future improvements.

Conclusions
We release a new Czech GEC corpus, the Grammar Error Correction Corpus for Czech (GECCC). This large corpus with 83 058 sentences covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by non-native speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant czech. We compare several strong Czech GEC systems, and finally, we provide a meta-evaluation of common GEC metrics across domains in our data. We conclude that M 2 and ERRANT scores with β = 0.5 are the measures most correlating with human judgments on our dataset, and we choose the M 2 0.5 as the preferred metric for the GECCC dataset.