An Error Analysis Framework for Shallow Surface Realization

Abstract The metrics standardly used to evaluate Natural Language Generation (NLG) models, such as BLEU or METEOR, fail to provide information on which linguistic factors impact performance. Focusing on Surface Realization (SR), the task of converting an unordered dependency tree into a well-formed sentence, we propose a framework for error analysis which permits identifying which features of the input affect the models’ results. This framework consists of two main components: (i) correlation analyses between a wide range of syntactic metrics and standard performance metrics and (ii) a set of techniques to automatically identify syntactic constructs that often co-occur with low performance scores. We demonstrate the advantages of our framework by performing error analysis on the results of 174 system runs submitted to the Multilingual SR shared tasks; we show that dependency edge accuracy correlate with automatic metrics thereby providing a more interpretable basis for evaluation; and we suggest ways in which our framework could be used to improve models and data. The framework is available in the form of a toolkit which can be used both by campaign organizers to provide detailed, linguistically interpretable feedback on the state of the art in multilingual SR, and by individual researchers to improve models and datasets.1


Introduction
Surface Realization (SR) is a natural language generation task that consists in converting a linguistic representation into a well-formed sentence.
SR is a key module in pipeline generation models, where it is usually the last item in a pipeline of modules designed to convert the input (knowledge graph, tabular data, numerical data) into a text.While end-to-end generation models have been 1 Our code and settings to reproduce the experiments are available at https://gitlab.com/shimorina/tacl-2021.
proposed that do away with such pipeline architecture and therefore with SR, pipeline generation models (Dušek and Jurčíček, 2016;Castro Ferreira et al., 2019;Elder et al., 2019;Moryossef et al., 2019) have been shown to perform on a par with these end-to-end models while providing increased controllability and interpretability (each step of the pipeline provides explicit intermediate representations that can be examined and evaluated).
As illustrated in, for example, Dušek and Jurčíček (2016), Elder et al. (2019), andLi (2015), SR also has potential applications in tasks such as summarization and dialogue response generation.In such approaches, shallow dependency trees are viewed as intermediate structures used to mediate between input and output, and SR permits regenerating a summary or a dialogue turn from these intermediate structures.
Finally, multilingual SR is an important task in its own right in that it permits a detailed evaluation of how neural models handle the varying word order and morphology of the different natural languages.While neural language models are powerful at producing high quality text, the results of the multilingual SR tasks (Mille et al., 2018(Mille et al., , 2019) ) clearly show that the generation, from shallow dependency trees, of morphologically and syntactically correct sentences in multiple languages remains an open problem.
As the use of multiple input formats made the comparison and evaluation of existing surface realisers difficult, Belz et al. (2011) and Mille et al. (2018Mille et al. ( , 2019) ) organized the SR shared tasks, which provide two standardized input formats for surface realizers: deep and shallow dependency trees.Shallow dependency trees are unordered, lemmatized dependency trees.Deep dependency trees include semantic rather than syntactic relations and abstract over function words.
While the SR tasks provide a common benchmark on which to evaluate and compare SR systems, the evaluation protocol they use (automatic metrics and human evaluation) does not support a detailed error analysis.Metrics (BLEU, DIST, NIST, METEOR, TER) and human assessments are reported on the system level, and so do not provide a detailed feedback for each participant.Neither do they give information about which syntactic phenomena impact performance.
In this work, we propose a framework for error analysis that allows for an interpretable, linguistically informed analysis of SR results.While shallow surface realization involves both determining word order (linearization) and inflecting lemmas (morphological realization), since inflection error detection is already covered in morphological shared tasks (Cotterell et al., 2017;Gorman et al., 2019), we focus on error analysis for word order.
Motivated by extensive linguistic studies that deal with syntactic dependencies and their relation to cognitive language processing (Liu, 2008;Futrell et al., 2015;Kahane et al., 2017), we investigate word ordering performance in SR models given various tree-based metrics.Specifically, we explore the hypothesis according to which these metrics, which provide a measure of the SR input complexity, correlate with automatic metrics commonly used in NLG.We find that Dependency Edge Accuracy (DEA) correlates with BLEU, which suggests that DEA could be used as an alternative, more interpretable, automatic evaluation metric for surface realizers.
We apply our framework to the results of two evaluation campaigns and demonstrate how it can be used to highlight some global results about the state of the art (e.g., that certain dependency relations such as the list dependency have low accuracy across the board for all 174 submitted runs).
We indicate various ways in which our error analysis framework could be used to improve a model or a dataset, thereby arguing for approaches to model and dataset improvement that are more linguistically guided.
Finally, we make our code available in the form of a toolkit that can be used both by campaign organizers to provide a detailed feedback on the state of the art for surface realization and by researchers to better analyze, interpret, and improve their models.

Related Work
There has been a long tradition in NLP exploring syntactic and semantic evaluation measures based on linguistic structures (Liu and Gildea, 2005;Mehay and Brew, 2007;Giménez and Màrquez, 2009;Tratz and Hovy, 2009;Lo et al., 2012).In particular, dependency-based automatic metrics have been developed for summarization (Hovy et al., 2005;Katragadda, 2009;Owczarzak, 2009) and machine translation (Owczarzak et al., 2007;Yu et al., 2014).Relations between metrics were also studied: Dang and Owczarzak (2008) found that automatic metrics perform on a par with the dependency-based metric of Hovy et al. (2005) while evaluating summaries.The closest research to ours, which focused on evaluating how dependency-based metrics correlate with human ratings, is Cahill (2009), who showed that syntactic-based metrics perform equally well as compared to automatic metrics in terms of their correlation with human judgments for a German surface realizer.
Researchers, working on SR and word ordering, have been resorting to different metrics to report their models' performance.Zhang et al. (2012), Zhang (2013), Zhang and Clark (2015), Puduppully et al. (2016), andSong et al. (2018) used BLEU; Schmaltz et al. (2016) parsed their outputs and calculated the UAS parsing metric; Filippova and Strube (2009) used Kendall correlation together with edit-distance to account for English word order.Similarly, Dyer (2019) used Spearman correlation between produced and gold word order for a dozen of languages.White and Rajkumar (2012), in their CCG-based realization, calculated average dependency lengths between grammar-generated sentences and gold standard.Gardent and Narayan (2012) and Narayan and Gardent (2012) proposed an error mining algorithm for generation grammars to identify the most likely sources of failures, when generating from dependency trees.Their algorithm mines suspicious subtrees in a dependency tree, which are likely to cause errors.King and White (2018) drew attention to their model performance for non-projective sentences.Puzikov et al. (2019) assessed their binary classifier for word ordering using the accuracy of predicting the position of a dependent with respect to its head, and a sibling.Yu et al. (2019) showed that, for their system, error rates correlate with word order freedom, and reported linearization error rates for some frequent dependency types.In a similar vein, Shimorina and Gardent (2019) looked at their system performance in terms of dependency relations, which shed light on the differences between their non-delexicalized and delexicalized models.
In sum, multiple metrics and tools have been developed by individual researchers to evaluate and interpret their model results: dependencybased metrics, correlation between these metrics and human ratings, performance on projective vs. non-projective input, linearization error rate, and so forth.At a more global level, however, automatic metrics and human evaluation continue to be massively used.
In this study, we gather a set of linguistically informed, interpretable metrics and tools within a unified framework, apply this framework to the results of two evaluation campaigns (174 participant submissions) and generally argue for a more interpretable evaluation approach for surface realizers.

Framework for Error Analysis
Our error analysis framework gathers a set of performance metrics together with a wide range of tree-based metrics designed to measure the syntactic complexity of the sentence to be generated.We apply correlation tests between these two types of metrics and mine a model output to automatically identify the syntactic constructs that often co-occur with low performance scores.

Syntactic Complexity Metrics
To measure syntactic complexity, we use several metrics commonly used for dependency trees (tree depth and length, mean dependency distance) as well as the ratio, in a test set, of sentences with non-projective structures.
We also consider the entropy of the dependency relations and a set of metrics based on ''flux'' recently proposed by Kahane et al. (2017).
The flux size is its cardinality, that is, the number of edges it contains: 2 for 5-6 and 4 for 7-8.
The flux weight is the size of the largest disjoint subset of edges in the flux (Kahane et al., 2017, p. 74).A set of edges is disjoint if the edges it contains do not share any node.For instance, in the inter-word position 5-6, nmod and case share a common node 8, so the flux weight is 1 (i.e., it was impossible to find two disjoint edges).The idea behind the flux-based metrics was to try accounting for cognitive complexity of syntactic structures, in the same fashion as in Miller (1956), who showed a processing limitation of syntactic constituents in a spoken language.
For each reference dependency tree, we calculate the metrics listed in Table 1.These can then be averaged over different dimensions (all runs, all runs of a given participant, runs on a given corpus, language, etc.).Table 2 shows the statistics obtained for each corpus used in the SR shared tasks.We refer the reader to the Universal Dependencies project2 to learn more about differences between specific treebanks.Dependency Relation Entropy.Entropy has been used in typological studies to quantify word order freedom across languages (Liu, 2010;Futrell et al., 2015;Gulordava and Merlo, 2016).It gives an estimate of how regular or irregular a dependency relation is with respect to word order.A relation d with high entropy indicates that d-dependents sometimes occur to the left and sometimes to the right of their head-that is, their order is not fixed.
The entropy H of a dependency relation d is calculated as where p(L) is the probability for a dependent to be on the left from the head, and p(R) is the probability for a dependent to be on the right from the head.For instance, if the dependency relation amod is found to be head-final 20 times in a treebank, and head-initial 80 times, its entropy is equal to 0.72.Entropy ranges from 0 to 1: Values close to zero indicate low word order freedom; values close to one mark high variation in head directionality.

Performance Metrics
Performance is assessed using sentence-level BLEU-4, DEA, and human evaluation scores.
DEA. DEA measures how many edges from a reference tree can be found in a system output, given the gold lemmas and dependency distance

Syntactic Complexity Explanation
tree depth the depth of the deepest node {3} tree length number of nodes {8} mean dependency distance average distance between a head and a dependent.For a dependency linking two adjacent nodes, the distance is equal to one (e.g., nsubj in Figure 1).(1 + 2 + 1 + 4 + 1 + 2 + 3)/7 = 2 mean flux size average flux size of each inter-word position (1 + 1 + 2 + 1 + 2 + 3 + 4)/7 = 2 mean flux weight average flux weight of each inter-word position (1 + 1 + 1 + 1 + 1 + 1 + 1)/7 = 1 mean arity average number of direct dependents of a node (0 + 2 + 0 + 2 + 0 + 0 + 0 + 3)/8 = 0.875 projectivity True if the sentence has a projective parse tree (there are no crossing dependency edges and/or projection lines) {True} Table 1: Metrics for Syntactic Complexity of a sentence (the values in braces indicate the corresponding value for the tree in Figure 1).as markers.An edge is represented as a triple (head lemma, dependent lemma, distance), for example, (I, enjoy, −1) or (time, school, +4) in Figure 1. 3 In the output, the same triples can be found based on the lemmas, the direction (after or before the head), and the dependency distance.In our example, two out of the seven dependency relations cannot be found in the output: (school, high, −1) and (school, franklin, −2).Thus, DEA is 0.71 (5/7).
Human Evaluation Scores.The framework include sentence-level z-scores for Adequacy and Fluency4 reported in the SR'18 and SR'19 shared tasks.The z-scores were calculated on the set of all raw scores by the given annotator using each annotator's mean and standard deviation.Note that those were available for a sample of test instances for some languages only and were calculated using the final system outputs, rather than the lemmatized ones.

Correlation Tests
The majority of our metrics are numerical, which allows us to measure dependence between them using correlation.One of the metrics-projectivity -is nominal, so we apply a non-parametric test to measure whether two independent samples (''projective sentences'' and ''non-projective sentences'') have the same distribution of scores.

Error Mining
Tree error mining of Narayan and Gardent (2012) was initially developed to explain errors in grammar-based generators.The algorithm takes as input two groups of dependency trees: Those whose derivation was covered (P for Pass) and those whose derivation was not covered (F for Fail) by the generator.Based on these two groups, the algorithm computes a suspicion score S for each subtree f in the input data as follows: Intuitively, a high suspicion score indicates a subtree (a syntactic construct) in the input data which often co-occurs with failure and seldom with success.The score is inspired from the decision tree classifier information gain metrics (Quinlan, 1986), which is there used to cluster the input data into subclusters with maximal purity and adapted to take into account the degree to which a subtree associates with failure rather than the entropy of the subclusters.
To imitate those two groups of successful and unsuccessful generation, we adapted a threshold based on BLEU.All the instances in a model output are divided into two parts: The first quartile (25% of instances)5 with a low sentence-level BLEU was considered as failure, the rest-as success.Error mining can then be used to automatically identify subtrees of the input tree that often co-occur with failure and rarely with success.Moreover, mining can be applied to trees decorated with any combination of lemmas, dependency relations and/or POS tags.

Data and Experimental Setting
We apply our error analysis methods to 174 system outputs (runs) submitted to the shallow track of SR'18 and SR'19 shared tasks (Mille et al., 2018(Mille et al., , 2019)).For each generated sentence in the submissions, we compute the metrics described in the preceding section as follows.
Computing Syntactic Complexity Metrics.Tree-based metrics, dependency relation entropy and projectivity are computed on the gold parse trees from Universal Dependencies v2.0 and v2.3 (Nivre et al., 2017) for SR'18 and SR'19, respectively.Following common practice in dependency linguistics computational studies, punctuation marks were stripped from the reference trees (based on punct dependency relation).If a node to be removed had children, these were assigned to the parent of the node.
Computing Performance Metrics.We compute sentence-level BLEU-4 with the smoothing method 2 from Chen and Cherry (2014), implemented in NLTK. 6o compute dependency edge accuracy, we process systems' outputs to allow for comparison with the lemmatized dependency tree of the reference sentence.Systems' outputs were tokenized and lemmatized; contractions were also split to match lemmas in the UD treebanks.Finally, to be consistent with punctuation-less references, punctuation was also removed from systems' outputs.The preprocessing was done with the stanfordnlp library (Qi et al., 2018).
For human judgments, we collect those provided by the shared tasks for a sample of test data and for some languages (en, es, fr for SR'18 and es ancora, en ewt, ru syntagrus, zh gsd for SR'19).Table 2 shows how many submissions each language received.
Computing Correlation.For all numerical variables, we assess the relationship between rankings of two variables using Spearman's ρ correlation.When calculating correlation coefficients, missing values were ignored (that was the case for human evaluations).Correlations were calculated separately for each submission (one system run for one corpus).Because up to 45 comparisons can be made for one submission, we controlled for the multiple testing problem using the Holm-Bonferroni method while doing a significance test.We also calculated means and medians of the correlations for each corpus (all submissions mixed), for each team (a team has multiple submissions), and average correlations through all the 174 submissions.For projectivity (nominal variable) we use a Mann-Whitney U test to determine whether there is a difference in performance between projective and non-projective sentences.We ran three tests where performance was defined in terms of BLEU, fluency, and adequacy.As for some corpora, the count of non-projective sentences in their test set is low (e.g., 1.56% in en ewt), we ran the test on the corpora that have more than 5% of non-projective sentences, that is, cs (10%), fi (6%), nl (20%), and ru (8%) for SR'18, and hi hdtb (9%), ko gsd (9%), ko kaist (19%), and ru syntagrus (6%) for SR'19.For the calculation of the Mann-Whitney U test, we used scipy-1.4.1.Similar to the correlation analysis, the test was calculated separately for each submission and for each corpus.
Mining the Input Trees.The error mining algorithm was run for each submission separately and with three different settings: (i) dependency relations (dep); (ii) POS tags (POS); (iii) dependency relations and POS tags (POS-dep).

Error Analysis
We analyze results focusing successively on: treebased syntactic complexity (are sentences with more complex syntactic trees harder to generate?), projectivity (how much does non-projectivity impact results?), entropy (how much do word order variations affect performance?),DEA and error mining (which syntactic constructions lead to decreased scores?).

Tree-Based Syntactic Complexity
We examine correlation tests results for all metrics on the system level (all submissions together) and for a single model, the BME-UW system (Kovács et al., 2019) on a single corpus/language (zh gsd, Chinese).Figure 2a shows median Spearman ρ coefficients across all the 174 submissions, and Figure 2b shows the coefficients for the BME-UW system on the zh gsd corpus.
We investigate both correlations between syntactic complexity and performance metrics and within each category.Similar observations can be made for both settings.

Correlation between Performance Metrics.
As often remarked in the NLG context (Stent et al., 2005;Novikova et al., 2017;Reiter, 2018), BLEU shows a weak correlation with Fluency and Adequacy on the sentence level.Similarly, dependency edge accuracy shows weak correlations with human judgments (ρ ad = 0.2 and ρ f l = 0.24 for the median; ρ ad = 0.14 and ρ f l = 0.23 for BME-UW). 7n contrast, BLEU shows a strong correlation with dependency edge accuracy (median: ρ = 0.68; BME-UW: ρ = 0.88).Contrary to BLEU however, DEA has a direct linguistic interpretation (it indicates which dependency relations are harder to handle) and can be exploited to analyze and improve a model.We therefore advocate for a more informative evaluation that incorporates DEA in addition to the standard metrics.We believe this will lead to more easily interpretable results and possibly the development of better, linguistically informed SR models.

Correlation between Syntactic Complexity
Metrics.Unsurprisingly, tree-based metrics have positive correlations between each other (the redish area on the right) ranging from weak to strong.Due to calculation technique overlap, some of them can show strong correlation (e.g., mean dependency distance and mean flux size).In general no correlation between tree-based metrics and system performance was found globally (i.e., for all models and all testsets).We can use the framework to analyze results on specific corpora or languages, however.For instance, zooming in on the fr corpus, we can observe a weak negative correlation at the system level (correlation with the median) between tree-based metrics (e.g., ρ = −0.38 for mean arity and tree length) and DEA.Thus, on this corpus, performance decreases as syntactic complexity (as measured by DEA) increases.Similarly, for ar, cs, fi, it, nl, treebased metrics show some negative correlation with BLEU8 whereby ρ median values between dependency metrics and BLEU for those corpora vary from −0.21 to −0.38 for ar, from −0.43 to −0.57 for cs, from −0.2 to −0.46 for fi, from −0.17 to −0.34 for it, and from −0.29 to −0.42 for nl.
Such increase in correlations were observed mainly for corpora, for which performance was not high across submissions (see Mille et al. (2018)).We hypothesize that BLEU correlates more with the tree-based metrics if system performance is bad.
Significance Testing.Overall, across submissions, coefficients were found non-significant only when they were close to zero (see Figure 2b).

Projectivity
Table 3 shows performance results with respect to the projectivity parameter.
Zooming in on the ru syntagrus corpus and two models, one that can produce non-projective trees, BME-UW (Kovács et al., 2019), and one that cannot, the IMS system (Yu et al., 2019), we observe two opposite trends.
For the BME-UW model, the median values for fluency and adequacy are higher for nonprojective sentences.Fluency medians (proj/nonproj) are 0.15/0.19(Mann-Whitney U = 4109131.0,n 1 = 6070, n 2 = 421, p < 0.001 two-tailed); adequacy medians (proj/non-proj) are 0.31/0.48(U = 2564235.0,n 1 = 6070, n 2 = 421, p < 0.001).In other words, while the model can handle non-projective structures, a key drawback revealed by our error analysis is that for sentences with projective structures (which incidentally, are much more frequent in the data), the model output is in fact judged less fluent and less adequate by human annotators than for non-projective sentences.
Conversely, for the IMS system, median values for fluency is higher for projective sentences (0.42 vs. 0.18 for non-projective sentences), and the distributions in the two groups differed significantly (U = 4038434.0,p < 0.001 twotailed).For adequacy, the median value for projective sentences (0.58) is also significantly higher than that for non-projective sentences (0.37, U = 2583463.0,p < 0.001 two-tailed).This in turn confirms the need for models that can handle non-projective structures.
Another interesting point highlighted by the results on the ru syntagrus corpus in Table 3 is that similar BLEU scores for projective and non-projective structures do not necessarily mean similar human evaluation scores.
In terms of BLEU only, that is, taking all other corpora with no human evaluations, and modulo the caveat just made about the relation between BLEU and human evaluation, we find that non-projective median values were always lower than projective ones, and distributions showed significant differences, throughout all the 25 comparisons made.This underlines the need for models that can handle both projective and non-projective structures.

Entropy
Correlation between dependency relation entropy and dependency edge accuracy permits identifying which model, language, or corpus is particularly affected by word order freedom.
For instance,9 for the id gsd corpus, three teams have a Spearman's ρ in the range from −0.62 to −0.67, indicating that their model underperforms for dependency relations with free word order.Conversely, two other teams showed weak correlation (ρ = −0.31and ρ = −0.36)for the same id gsd corpus.
The impact of entropy also varies depending on the language, the corpus, and, more generally, the entropy of the data.For instance, for Japanese (ja gsd corpus), dependency relations have low entropy (the mean entropy averaged on all relations is 0.02) and so we observe no correlation between entropy and performance.Conversely, for Czech (the treebank with the highest mean entropy, H = 0.52), two teams show non-trivial negative correlations (ρ = −0.54 and ρ = −0.6) between entropy and DEA.

Which Syntactic Constructions Are
Harder to Handle?
DEA.For a given dependency relation, DEA assesses how well a model succeeds in realizing that relation.To identify which syntactic constructs are problematic for surface realization models, we therefore compute dependency edge accuracy per relation, averaging over all submissions.Table 4 shows the results.Unsurprisingly, relations with low counts (first five relations in the table) have low accuracy.Because they are rare (in fact they are often absent from most corpora), SR models struggle to realize these.
Other relations with low accuracy are either relations with free word order (i.e., advcl, discourse, obl, advmod) or whose semantics is vague (dep-unspecified dependency).Clearly, in case of the latter, systems cannot make a good prediction; as for the former, the low DEA score may be an artefact of the fact that it is computed with respect to a single reference.As the construct may occur in different positions in a sentence, several equally correct sentences may match the input but only one will not be penalised by the comparison with the reference.This underlines once again the need for an evaluation setup with multiple references.
Relations with the highest accuracy are those for function words (case-case-marking elements, det-determiners, clf -classifiers), fixed multiword expressions (fixed), and nominal dependents (amod, nmod, nummod).Those dependencies on average have higher stability with respect to their head in terms of distance, more often demonstrate a fixed word order, and do not  exhibit a certain degree of probable shifting as the relations described above.Due to those factors, their realization performance is higher.Interestingly, when computing DEA per dependency relation and per corpus, we found similar DEA scores for all corpora.That is, dependency relations have consistently low/high DEA score across all corpora therefore indicating that improvement on a given relation will improve performance on all corpora/languages.
Finally, we note that, at the model level, DEA scores are useful metrics for researchers as it brings interpretability and separation into error type subcases.
Error Mining for Syntactic Trees.We can also obtain a more detailed picture of which syntactic constructs degrade performance using error mining.After running error mining on all submissions, we examine the subtrees in the input that have highest coverage, that is, for which the percentage of submissions tagging these forms as suspicious10 is highest.Tables 5, 6, and 7 show the results when using different views of the data (i.e., focusing only on dependency information, only on POS tags, or on both).dependency relations with low dependency edge accuracy, for constructions with low suspicion score or for input trees with large depth, length or mean dependency distance).This could be done either manually (by annotating sentences containing the relevant constructions) or automatically by parsing text and then filtering for those parse trees which contain the dependency relations and subtrees for which the model underperforms.For those cases where the problematic construction is frequent, we conjecture that this might lead to a better overall score increase than ''blind'' global data augmentation.
Language Specific Adaptation.Languages exhibit different word order schemas and have different ways of constraining word order.Error analysis can help identify which languagespecific constructs impact performance and how to improve a language-specific model with respect to these constructs.
For instance, a dependency relation with high entropy and low accuracy indicates that the model has difficulty learning the word order freedom of that relation.Model improvement can then target a better modelling of those factors which determine word order for that relation.In Romance languages, for example, adjectives mostly occur after the noun they modify.However, some adjectives are pre-posed.As the pre-posed adjectives rather form a finite set, a plausible way to improve the model would be to enrich the input representation by indicating for each adjective whether it belongs to the class of pre-or post-posed adjectives.
Global Model Improvement.Error analysis can suggest direction for model improvement.For instance, a high proportion of non-projective sentences in the language reference treebank together with lower performance metrics for those sentences suggests improving the ability of the model to handle non-projective structures.Indeed, Yu et al. (2020) showed that the performance of the model of Yu et al. (2019) could be greatly improved by extending it to handle non-projective structures.
Treebank Specific Improvement.Previous research has shown that treebanks contain inconsistencies thereby impacting both learning and evaluation (Zeman, 2016).
The tree-based metrics and the error mining techniques provided in our toolkit can help identify those dependency relations and constructions which have consistently low scores across different models or diverging scores across different treebanks for the same language.For instance, a case of strong inconsistencies in the annotation of multi-word expressions (MWE) may be highlighted by a low DEA for the fixed dependency relation (which should be used to annotate MWE).Such annotation errors could also be detected using lemma-based error mining, namely, error mining for forms decorated with lemmas.Such mining would then show that the most suspicious forms are decorated with multi-word expressions (e.g., ''in order to'').
Ensemble Model.Given a model M and a test set T , our toolkit can be used to compute, for each dependency relation d present in the test set, the average DEA of that model for that relation (DEA d  M , the sum of the model's DEA for all d-edge in T normalized by the number of these edges).This could be used to learn an ensemble model which, for each input, outputs the sentence generated by the model whose score according to this metric is highest.Given an input tree t consisting of a set of edges D, the score of a model M could for instance be the sum of the model's average DEA for the edges contained in the input tree normalized by the number of edges in that tree, namely, 1 |D| × d∈D DEA d M .

Conclusion
We presented a framework for error analysis that supports a detailed assessment of which syntactic factors impact the performance of surface realisation models.We applied it to the results of two SR shared task campaigns and suggested ways in which it could be used to improve models and datasets for shallow surface realisation.More generally, we believe that scores such as BLEU and, to some extent, human ratings do not provide a clear picture of the extent to which SR models can capture the complex constraints governing word order in the world natural languages.We hope that the metrics and tools gathered in this evaluation toolkit can help address this issue.

Figure 1 :
Figure 1: A reference UD dependency tree (nodes are lemmas) and a possible SR model output.The final output is used to compute human judgments and the lemmatized output to compute BLEU and dependency edge accuracy (both are given without punctuation).
is the number of sentences containing a subtree f , c(¬f ) is the number of sentences where f is not present, c(f |F ) is the number of sentences containing f for which generation failed, and c(¬f |P ) is the number of sentences not containing f for which generation succeeded.

Table 2 :
Descriptive statistics (mean and stdev apart from the first two and the last column) for the UD treebanks used in SR'18 (UD v2.0) andSR'19 (UD v2.3).S: number of submissions, count: number of sentences in a test set, MDD: mean dependency distance, MFS: mean flux size, MFW: mean flux weight, MA: mean arity, NP: percentage of non-projective sentences.For the tree-based metrics (MDD, MFS, MFW, MA), macro-average values are reported.For SR'18, we follow the notation for treebanks as used in the shared task (only language code); in parentheses we list treebank names.

Table 3 :
Median values for BLEU, Fluency, and Adequacy for projective/non-projective sentences for each submission.Medians for non-projective sentences which are higher than for the projective sentences are in bold.All comparisons were significant with p < 0.001.Human judgments were available for ru syntagrus only.

Table 5 :
Top-20 of the most frequent suspicious trees (dep-based) across all submissions.In case of conj, when tree patterns were similar, they were merged, X serving as a placeholder.Coverage: percentage of submissions where a subtree was mined as suspicious.MSS: mean suspicion score for a subtree.
Shashi Narayan and Claire Gardent.2012.Error mining with suspicion trees: Seeing the forest for the trees.In Proceedings of COLING 2012, pages 2011-2026, Mumbai, India.The COLING 2012 Organizing Committee.