Abstract
The metrics standardly used to evaluate Natural Language Generation (NLG) models, such as BLEU or METEOR, fail to provide information on which linguistic factors impact performance. Focusing on Surface Realization (SR), the task of converting an unordered dependency tree into a well-formed sentence, we propose a framework for error analysis which permits identifying which features of the input affect the models’ results. This framework consists of two main components: (i) correlation analyses between a wide range of syntactic metrics and standard performance metrics and (ii) a set of techniques to automatically identify syntactic constructs that often co-occur with low performance scores. We demonstrate the advantages of our framework by performing error analysis on the results of 174 system runs submitted to the Multilingual SR shared tasks; we show that dependency edge accuracy correlate with automatic metrics thereby providing a more interpretable basis for evaluation; and we suggest ways in which our framework could be used to improve models and data. The framework is available in the form of a toolkit which can be used both by campaign organizers to provide detailed, linguistically interpretable feedback on the state of the art in multilingual SR, and by individual researchers to improve models and datasets.1
1 Introduction
Surface Realization (SR) is a natural language generation task that consists in converting a linguistic representation into a well-formed sentence.
SR is a key module in pipeline generation models, where it is usually the last item in a pipeline of modules designed to convert the input (knowledge graph, tabular data, numerical data) into a text. While end-to-end generation models have been proposed that do away with such pipeline architecture and therefore with SR, pipeline generation models (Dušek and Jurčíček, 2016; Castro Ferreiraet al., 2019; Elder et al., 2019; Moryossef et al., 2019) have been shown to perform on a par with these end-to-end models while providing increased controllability and interpretability (each step of the pipeline provides explicit intermediate representations that can be examined and evaluated).
As illustrated in, for example, Dušek and Jurčíček (2016) Elder et al. (2019), and Li (2015), SR also has potential applications in tasks such as summarization and dialogue response generation. In such approaches, shallow dependency trees are viewed as intermediate structures used to mediate between input and output, and SR permits regenerating a summary or a dialogue turn from these intermediate structures.
Finally, multilingual SR is an important task in its own right in that it permits a detailed evaluation of how neural models handle the varying word order and morphology of the different natural languages. While neural language models are powerful at producing high quality text, the results of the multilingual SR tasks (Mille et al., 2018, 2019) clearly show that the generation, from shallow dependency trees, of morphologically and syntactically correct sentences in multiple languages remains an open problem.
As the use of multiple input formats made the comparison and evaluation of existing surface realisers difficult, Belz et al. (2011) and Mille et al. (2018, 2019) organized the SR shared tasks, which provide two standardized input formats for surface realizers: deep and shallow dependency trees. Shallow dependency trees are unordered, lemmatized dependency trees. Deep dependency trees include semantic rather than syntactic relations and abstract over function words.
While the SR tasks provide a common benchmark on which to evaluate and compare SR systems, the evaluation protocol they use (automaticmetrics and human evaluation) does not support a detailed error analysis. Metrics (BLEU, DIST, NIST, METEOR, TER) and human assessments are reported on the system level, and so do not provide a detailed feedback for each participant. Neither do they give information about which syntactic phenomena impact performance.
In this work, we propose a framework for error analysis that allows for an interpretable, linguistically informed analysis of SR results. While shallow surface realization involves both determining word order (linearization) and inflecting lemmas (morphological realization), since inflection error detection is already covered in morphological shared tasks (Cotterell et al., 2017; Gorman et al., 2019), we focus on error analysis for word order.
Motivated by extensive linguistic studies that deal with syntactic dependencies and their relation to cognitive language processing (Liu, 2008; Futrell et al., 2015; Kahane et al., 2017), we investigate word ordering performance in SR models given various tree-based metrics. Specifically, we explore the hypothesis according to which these metrics, which provide a measure of the SR input complexity, correlate with automatic metrics commonly used in NLG. We find that Dependency Edge Accuracy (DEA) correlates with BLEU, which suggests that DEA could be used as an alternative, more interpretable, automatic evaluation metric for surface realizers.
We apply our framework to the results of two evaluation campaigns and demonstrate how it can be used to highlight some global results about the state of the art (e.g., that certain dependency relations such as the list dependency have low accuracy across the board for all 174 submitted runs).
We indicate various ways in which our error analysis framework could be used to improve a model or a dataset, thereby arguing for approaches to model and dataset improvement that are more linguistically guided.
Finally, we make our code available in the form of a toolkit that can be used both by campaign organizers to provide a detailed feedback on the state of the art for surface realization and by researchers to better analyze, interpret, and improve their models.
2 Related Work
There has been a long tradition in NLP exploring syntactic and semantic evaluation measures based on linguistic structures (Liu and Gildea, 2005; Mehay and Brew, 2007; Giménez and Màrquez, 2009; Tratz and Hovy, 2009; Lo et al., 2012). In particular, dependency-based automatic metrics have been developed for summarization (Hovy et al., 2005; Katragadda, 2009; Owczarzak, 2009) and machine translation (Owczarzak et al., 2007; Yu et al., 2014). Relations between metrics were also studied: Dang and Owczarzak (2008) found that automatic metrics perform on a par with the dependency-based metric of Hovy et al. (2005) while evaluating summaries. The closest research to ours, which focused on evaluating how dependency-based metrics correlate with human ratings, is Cahill (2009), who showed that syntactic-based metrics perform equally well as compared to automatic metrics in terms of their correlation with human judgments for a German surface realizer.
Researchers, working on SR and word ordering, have been resorting to different metrics to report their models’ performance. Zhang et al. (2012), Zhang (2013), Zhang and Clark (2015), Puduppully et al. (2016), and Song et al. (2018) used BLEU; Schmaltz et al. (2016) parsed their outputs and calculated the UAS parsing metric; Filippova and Strube (2009) used Kendall correlation together with edit-distance to account for English word order. Similarly, Dyer (2019) used Spearman correlation between produced and gold word order for a dozen of languages. Whiteand Rajkumar (2012), in their CCG-based realization, calculated average dependency lengths between grammar-generated sentences and gold standard. Gardent and Narayan (2012) and Narayan and Gardent (2012) proposed an error mining algorithm for generation grammars to identify the most likely sources of failures, when generating from dependency trees. Their algorithm mines suspicious subtrees in a dependency tree, which are likely to cause errors. King andWhite (2018) drew attention to their model performance for non-projective sentences. Puzikovet al. (2019) assessed their binary classifier for word ordering using the accuracy of predicting the position of a dependent with respect to its head, and a sibling. Yu et al. (2019) showed that, for their system, error rates correlate with word order freedom, and reported linearization error rates for some frequent dependency types. In a similar vein, Shimorina and Gardent (2019) looked at their system performance in terms of dependency relations,which shed light on the differences between their non-delexicalized and delexicalized models.
In sum, multiple metrics and tools have been developed by individual researchers to evaluate and interpret their model results: dependency-based metrics, correlation between these metrics and human ratings, performance on projective vs. non-projective input, linearization error rate, and so forth. At a more global level, however, automatic metrics and human evaluation continue to be massively used.
In this study, we gather a set of linguistically informed, interpretable metrics and tools within a unified framework, apply this framework to the results of two evaluation campaigns (174 participant submissions) and generally argue for a more interpretable evaluation approach for surface realizers.
3 Framework for Error Analysis
Our error analysis framework gathers a set of performance metrics together with a wide range of tree-based metrics designed to measure the syntactic complexity of the sentence to be generated. We apply correlation tests between these two types of metrics and mine a model output to automatically identify the syntactic constructs that often co-occur with low performance scores.
3.1 Syntactic Complexity Metrics
To measure syntactic complexity, we use several metrics commonly used for dependency trees (tree depth and length, mean dependency distance) as well as the ratio, in a test set, of sentences with non-projective structures.
We also consider the entropy of the dependency relations and a set of metrics based on “flux” recently proposed by Kahane et al. (2017).
Flux.
The flux is defined for each inter-word position (e.g., 5–6 in Figure 1). Given the inter-word position (i,j), the flux of (i,j) is the set of edges (d,k,l) such that d is a dependency relation, k ≤ i and j ≤ l. For example, in Figure 1 the flux for the inter-word position between the nodes 5 and 6 is {(nmod, 4, 8), (case, 5, 8)} and {(nmod, 4, 8), (case, 5, 8), (compound, 6, 8), (compound, 7, 8)} for the position between the nodes 7 and 8.
The flux size is its cardinality, that is, the number of edges it contains: 2 for 5–6 and 4 for 7–8.
The flux weight is the size of the largest disjoint subset of edges in the flux (Kahane et al., 2017, p. 74). A set of edges is disjoint if the edges it contains do not share any node. For instance, in the inter-word position 5–6, nmod and case share a common node 8, so the flux weight is 1 (i.e., it was impossible to find two disjoint edges). The idea behind the flux-based metrics was to try accounting for cognitive complexity of syntactic structures, in the same fashion as in Miller (1956), who showed a processing limitation of syntactic constituents in a spoken language.
For each reference dependency tree, we calculate the metrics listed in Table 1. These can then be averaged over different dimensions (all runs, all runs of a given participant, runs on a given corpus, language, etc.). Table 2 shows the statistics obtained for each corpus used in the SR shared tasks. We refer the reader to the Universal Dependencies project2 to learn more about differences between specific treebanks.
Syntactic Complexity . | Explanation . |
---|---|
tree depth | the depth of the deepest node {3} |
tree length | number of nodes {8} |
mean dependency distance | average distance between a head and a dependent. For a dependency linking two adjacent nodes, the distance is equal to one (e.g., nsubj in Figure 1). {(1 + 2 + 1 + 4 + 1 + 2 + 3)/7 = 2} |
mean flux size | average flux size of each inter-word position {(1 + 1 + 2 + 1 + 2 + 3 + 4)/7 = 2} |
mean flux weight | average flux weight of each inter-word position {(1 + 1 + 1 + 1 + 1 + 1 + 1)/7 = 1} |
mean arity | average number of direct dependents of a node {(0 + 2 + 0 + 2 + 0 + 0 + 0 + 3)/8 = 0.875} |
projectivity | True if the sentence has a projective parse tree (there are no crossing dependency edges and/or projection lines) {True} |
Syntactic Complexity . | Explanation . |
---|---|
tree depth | the depth of the deepest node {3} |
tree length | number of nodes {8} |
mean dependency distance | average distance between a head and a dependent. For a dependency linking two adjacent nodes, the distance is equal to one (e.g., nsubj in Figure 1). {(1 + 2 + 1 + 4 + 1 + 2 + 3)/7 = 2} |
mean flux size | average flux size of each inter-word position {(1 + 1 + 2 + 1 + 2 + 3 + 4)/7 = 2} |
mean flux weight | average flux weight of each inter-word position {(1 + 1 + 1 + 1 + 1 + 1 + 1)/7 = 1} |
mean arity | average number of direct dependents of a node {(0 + 2 + 0 + 2 + 0 + 0 + 0 + 3)/8 = 0.875} |
projectivity | True if the sentence has a projective parse tree (there are no crossing dependency edges and/or projection lines) {True} |
. | . | S . | count . | depth . | length . | MDD . | MFS . | MFW . | MA . | NP . |
---|---|---|---|---|---|---|---|---|---|---|
SR’18 | ar (padt) | 3 | 676 | 7.37±3.29 | 38.5±30.38 | 2.61±0.93 | 2.61±0.93 | 1.44±0.26 | 0.94±0.08 | 1.48 |
cs (pdt) | 2 | 9,876 | 3.95±1.99 | 14.49±9.43 | 2.12±0.74 | 2.12±0.74 | 1.19±0.29 | 0.86±0.18 | 9.91 | |
es (ancora) | 6 | 1,719 | 5.21±2.2 | 26.88±15.7 | 2.47±0.66 | 2.47±0.66 | 1.33±0.25 | 0.93±0.09 | 2.39 | |
en (ewt) | 8 | 2,061 | 2.71±1.88 | 10.57±9.55 | 1.86±0.95 | 1.86±0.95 | 1.02±0.42 | 0.75±0.3 | 1.65 | |
fi (tdt) | 3 | 1,525 | 3.48±1.81 | 11.42±7.22 | 2.02±0.62 | 2.02±0.62 | 1.16±0.23 | 0.86±0.12 | 5.57 | |
fr (gsd) | 5 | 416 | 4.33±1.75 | 21.21±12.57 | 2.44±0.59 | 2.44±0.59 | 1.28±0.25 | 0.93±0.07 | 2.16 | |
it (isdt) | 4 | 480 | 4.38±2.23 | 19.14±14.07 | 2.19±0.61 | 2.19±0.61 | 1.23±0.23 | 0.91±0.06 | 2.29 | |
nl (alpino) | 4 | 685 | 3.74±1.86 | 15.03±9.11 | 2.48±1.05 | 2.48±1.05 | 1.21±0.39 | 0.85±0.22 | 20.15 | |
pt (bosque) | 4 | 476 | 4.32±2.12 | 18.58±12.11 | 2.25±0.63 | 2.25±0.63 | 1.23±0.27 | 0.9±0.13 | 4.20 | |
ru (syntagrus) | 2 | 6,366 | 4.1±1.96 | 14.65±9.14 | 2.12±0.66 | 2.12±0.66 | 1.23±0.27 | 0.88±0.13 | 8.37 | |
SR’19 | ar_padt | 4 | 680 | 7.38±3.28 | 38.54±30.34 | 2.6±0.93 | 2.6±0.93 | 1.45±0.26 | 0.94±0.08 | 1.76 |
en_ewt | 5 | 2,077 | 2.72±1.88 | 10.6±9.62 | 1.87±0.95 | 1.87±0.95 | 1.02±0.42 | 0.75±0.3 | 1.54 | |
en_gum | 11 | 778 | 3.69±1.91 | 15.0±10.63 | 2.14±0.75 | 2.14±0.75 | 1.16±0.31 | 0.85±0.2 | 3.08 | |
en_lines | 11 | 914 | 3.55±1.6 | 14.97±9.56 | 2.27±0.62 | 2.27±0.62 | 1.2±0.23 | 0.89±0.11 | 4.60 | |
en_partut | 11 | 153 | 4.52±2.01 | 20.06±9.77 | 2.48±0.51 | 2.48±0.51 | 1.26±0.21 | 0.93±0.05 | 0.65 | |
es_ancora | 6 | 1,721 | 5.2±2.2 | 26.87±15.7 | 2.47±0.66 | 2.47±0.66 | 1.33±0.25 | 0.93±0.09 | 2.38 | |
es_gsd | 6 | 426 | 5.06±2.25 | 25.18±16.43 | 2.41±0.57 | 2.41±0.57 | 1.31±0.23 | 0.94±0.05 | 4.69 | |
fr_gsd | 7 | 416 | 4.41±1.78 | 21.22±12.58 | 2.41±0.58 | 2.41±0.58 | 1.28±0.25 | 0.93±0.07 | 1.20 | |
fr_partut | 7 | 110 | 4.85±1.82 | 21.84±10.01 | 2.44±0.46 | 2.44±0.46 | 1.29±0.21 | 0.94±0.03 | 0.91 | |
fr_sequoia | 7 | 456 | 4.01±2.21 | 19.66±15.61 | 2.13±0.84 | 2.13±0.84 | 1.16±0.37 | 0.84±0.25 | 0.88 | |
hi_hdtb | 5 | 1,684 | 4.19±1.48 | 19.6±8.99 | 2.96±0.82 | 2.96±0.82 | 1.48±0.23 | 0.94±0.03 | 8.91 | |
id_gsd | 5 | 557 | 4.57±1.85 | 18.02±12.39 | 2.04±0.54 | 2.04±0.54 | 1.22±0.2 | 0.92±0.07 | 0.72 | |
ja_gsd | 6 | 551 | 4.36±1.97 | 20.25±13.35 | 2.43±0.66 | 2.43±0.66 | 1.4±0.32 | 0.92±0.09 | 0.00 | |
ko_gsd | 5 | 989 | 3.59±1.78 | 10.29±6.77 | 2.21±0.79 | 2.21±0.79 | 1.33±0.36 | 0.86±0.1 | 9.20 | |
ko_kaist | 4 | 2,287 | 3.86±1.54 | 11.0±4.56 | 2.27±0.67 | 2.27±0.67 | 1.44±0.32 | 0.89±0.07 | 19.15 | |
pt_bosque | 5 | 477 | 4.32±2.11 | 18.57±12.09 | 2.25±0.63 | 2.25±0.63 | 1.23±0.27 | 0.9±0.13 | 4.40 | |
pt_gsd | 5 | 1,204 | 4.85±1.87 | 22.74±12.2 | 2.39±0.55 | 2.39±0.55 | 1.31±0.23 | 0.94±0.05 | 1.66 | |
ru_gsd | 5 | 601 | 4.11±1.69 | 15.83±10.24 | 2.12±0.69 | 2.12±0.69 | 1.24±0.21 | 0.91±0.06 | 4.49 | |
ru_syntagrus | 4 | 6,491 | 4.08±1.94 | 14.78±9.24 | 2.13±0.65 | 2.13±0.65 | 1.23±0.26 | 0.88±0.13 | 6.49 | |
zh_gsd | 7 | 500 | 4.22±1.08 | 20.64±10.17 | 2.98±0.84 | 2.98±0.84 | 1.46±0.27 | 0.94±0.03 | 0.40 |
. | . | S . | count . | depth . | length . | MDD . | MFS . | MFW . | MA . | NP . |
---|---|---|---|---|---|---|---|---|---|---|
SR’18 | ar (padt) | 3 | 676 | 7.37±3.29 | 38.5±30.38 | 2.61±0.93 | 2.61±0.93 | 1.44±0.26 | 0.94±0.08 | 1.48 |
cs (pdt) | 2 | 9,876 | 3.95±1.99 | 14.49±9.43 | 2.12±0.74 | 2.12±0.74 | 1.19±0.29 | 0.86±0.18 | 9.91 | |
es (ancora) | 6 | 1,719 | 5.21±2.2 | 26.88±15.7 | 2.47±0.66 | 2.47±0.66 | 1.33±0.25 | 0.93±0.09 | 2.39 | |
en (ewt) | 8 | 2,061 | 2.71±1.88 | 10.57±9.55 | 1.86±0.95 | 1.86±0.95 | 1.02±0.42 | 0.75±0.3 | 1.65 | |
fi (tdt) | 3 | 1,525 | 3.48±1.81 | 11.42±7.22 | 2.02±0.62 | 2.02±0.62 | 1.16±0.23 | 0.86±0.12 | 5.57 | |
fr (gsd) | 5 | 416 | 4.33±1.75 | 21.21±12.57 | 2.44±0.59 | 2.44±0.59 | 1.28±0.25 | 0.93±0.07 | 2.16 | |
it (isdt) | 4 | 480 | 4.38±2.23 | 19.14±14.07 | 2.19±0.61 | 2.19±0.61 | 1.23±0.23 | 0.91±0.06 | 2.29 | |
nl (alpino) | 4 | 685 | 3.74±1.86 | 15.03±9.11 | 2.48±1.05 | 2.48±1.05 | 1.21±0.39 | 0.85±0.22 | 20.15 | |
pt (bosque) | 4 | 476 | 4.32±2.12 | 18.58±12.11 | 2.25±0.63 | 2.25±0.63 | 1.23±0.27 | 0.9±0.13 | 4.20 | |
ru (syntagrus) | 2 | 6,366 | 4.1±1.96 | 14.65±9.14 | 2.12±0.66 | 2.12±0.66 | 1.23±0.27 | 0.88±0.13 | 8.37 | |
SR’19 | ar_padt | 4 | 680 | 7.38±3.28 | 38.54±30.34 | 2.6±0.93 | 2.6±0.93 | 1.45±0.26 | 0.94±0.08 | 1.76 |
en_ewt | 5 | 2,077 | 2.72±1.88 | 10.6±9.62 | 1.87±0.95 | 1.87±0.95 | 1.02±0.42 | 0.75±0.3 | 1.54 | |
en_gum | 11 | 778 | 3.69±1.91 | 15.0±10.63 | 2.14±0.75 | 2.14±0.75 | 1.16±0.31 | 0.85±0.2 | 3.08 | |
en_lines | 11 | 914 | 3.55±1.6 | 14.97±9.56 | 2.27±0.62 | 2.27±0.62 | 1.2±0.23 | 0.89±0.11 | 4.60 | |
en_partut | 11 | 153 | 4.52±2.01 | 20.06±9.77 | 2.48±0.51 | 2.48±0.51 | 1.26±0.21 | 0.93±0.05 | 0.65 | |
es_ancora | 6 | 1,721 | 5.2±2.2 | 26.87±15.7 | 2.47±0.66 | 2.47±0.66 | 1.33±0.25 | 0.93±0.09 | 2.38 | |
es_gsd | 6 | 426 | 5.06±2.25 | 25.18±16.43 | 2.41±0.57 | 2.41±0.57 | 1.31±0.23 | 0.94±0.05 | 4.69 | |
fr_gsd | 7 | 416 | 4.41±1.78 | 21.22±12.58 | 2.41±0.58 | 2.41±0.58 | 1.28±0.25 | 0.93±0.07 | 1.20 | |
fr_partut | 7 | 110 | 4.85±1.82 | 21.84±10.01 | 2.44±0.46 | 2.44±0.46 | 1.29±0.21 | 0.94±0.03 | 0.91 | |
fr_sequoia | 7 | 456 | 4.01±2.21 | 19.66±15.61 | 2.13±0.84 | 2.13±0.84 | 1.16±0.37 | 0.84±0.25 | 0.88 | |
hi_hdtb | 5 | 1,684 | 4.19±1.48 | 19.6±8.99 | 2.96±0.82 | 2.96±0.82 | 1.48±0.23 | 0.94±0.03 | 8.91 | |
id_gsd | 5 | 557 | 4.57±1.85 | 18.02±12.39 | 2.04±0.54 | 2.04±0.54 | 1.22±0.2 | 0.92±0.07 | 0.72 | |
ja_gsd | 6 | 551 | 4.36±1.97 | 20.25±13.35 | 2.43±0.66 | 2.43±0.66 | 1.4±0.32 | 0.92±0.09 | 0.00 | |
ko_gsd | 5 | 989 | 3.59±1.78 | 10.29±6.77 | 2.21±0.79 | 2.21±0.79 | 1.33±0.36 | 0.86±0.1 | 9.20 | |
ko_kaist | 4 | 2,287 | 3.86±1.54 | 11.0±4.56 | 2.27±0.67 | 2.27±0.67 | 1.44±0.32 | 0.89±0.07 | 19.15 | |
pt_bosque | 5 | 477 | 4.32±2.11 | 18.57±12.09 | 2.25±0.63 | 2.25±0.63 | 1.23±0.27 | 0.9±0.13 | 4.40 | |
pt_gsd | 5 | 1,204 | 4.85±1.87 | 22.74±12.2 | 2.39±0.55 | 2.39±0.55 | 1.31±0.23 | 0.94±0.05 | 1.66 | |
ru_gsd | 5 | 601 | 4.11±1.69 | 15.83±10.24 | 2.12±0.69 | 2.12±0.69 | 1.24±0.21 | 0.91±0.06 | 4.49 | |
ru_syntagrus | 4 | 6,491 | 4.08±1.94 | 14.78±9.24 | 2.13±0.65 | 2.13±0.65 | 1.23±0.26 | 0.88±0.13 | 6.49 | |
zh_gsd | 7 | 500 | 4.22±1.08 | 20.64±10.17 | 2.98±0.84 | 2.98±0.84 | 1.46±0.27 | 0.94±0.03 | 0.40 |
Dependency Relation Entropy.
Entropy has been used in typological studies to quantify word order freedom across languages (Liu, 2010; Futrell et al., 2015; Gulordava and Merlo, 2016). It gives an estimate of how regular or irregular a dependency relation is with respect to word order. A relation d with high entropy indicates that d-dependents sometimes occur to the left and sometimes to the right of their head—that is, their order is not fixed.
3.2 Performance Metrics
Performance is assessed using sentence-level BLEU-4, DEA, and human evaluation scores.
DEA.
DEA measures how many edges from a reference tree can be found in a system output, given the gold lemmas and dependency distance as markers. An edge is represented as a triple (head lemma, dependent lemma, distance), for example, (I, enjoy, −1) or (time, school, +4) in Figure 1.3 In the output, the same triples can be found based on the lemmas, the direction (after or before the head), and the dependency distance. In our example, two out of the seven dependency relations cannot be found in the output: (school, high, −1) and (school, franklin, −2). Thus, DEA is 0.71 (5/7).
Human Evaluation Scores.
The framework include sentence-level z-scores for Adequacy and Fluency4 reported in the SR’18 and SR’19 shared tasks. The z-scores were calculated on the set of all raw scores by the given annotator using each annotator’s mean and standard deviation. Note that those were available for a sample of test instances for some languages only and were calculated using the final system outputs, rather than the lemmatized ones.
3.3 Correlation Tests
The majority of our metrics are numerical, which allows us to measure dependence between them using correlation. One of the metrics—projectivity —is nominal, so we apply a non-parametric test to measure whether two independent samples (“projective sentences” and “non-projective sentences”) have the same distribution of scores.
3.4 Error Mining
To imitate those two groups of successful and unsuccessful generation, we adapted a threshold based on BLEU. All the instances in a model output are divided into two parts: The first quartile (25% of instances)5 with a low sentence-level BLEU was considered as failure, the rest—as success. Error mining can then be used to automatically identify subtrees of the input tree that often co-occur with failure and rarely with success. Moreover, mining can be applied to trees decorated with any combination of lemmas, dependency relations and/or POS tags.
4 Data and Experimental Setting
Computing Syntactic Complexity Metrics.
Tree-based metrics, dependency relation entropy and projectivity are computed on the gold parse trees from Universal Dependencies v2.0 and v2.3 (Nivre et al., 2017) for SR’18 and SR’19, respectively. Following common practice in dependency linguistics computational studies, punctuation marks were stripped from the reference trees (based on punct dependency relation). If a node to be removed had children, these were assigned to the parent of the node.
Computing Performance Metrics.
We compute sentence-level BLEU-4 with the smoothing method 2 from Chen and Cherry (2014), implemented in NLTK.6
To compute dependency edge accuracy, we process systems’ outputs to allow for comparison with the lemmatized dependency tree of the reference sentence. Systems’ outputs were tokenized and lemmatized; contractions were also split to match lemmas in the UD treebanks. Finally, to be consistent with punctuation-less references, punctuation was also removed from systems’ outputs. The preprocessing was done with the stanfordnlp library (Qi et al., 2018).
For human judgments, we collect those provided by the shared tasks for a sample of test data and for some languages (en, es, fr for SR’18 and es_ancora, en_ewt, ru_syntagrus, zh_gsd for SR’19). Table 2 shows how many submissions each language received.
Computing Correlation.
For all numerical variables, we assess the relationship between rankings of two variables using Spearman’s ρ correlation. When calculating correlation coefficients, missing values were ignored (that was the case for human evaluations). Correlations were calculated separately for each submission (one system run for one corpus). Because up to 45 comparisons can be made for one submission, we controlled for the multiple testing problem using the Holm-Bonferroni method while doing a significance test. We also calculated means and medians of the correlations for each corpus (all submissions mixed), for each team (a team has multiple submissions), and average correlations through all the 174 submissions.
For projectivity (nominal variable) we use a Mann–Whitney U test to determine whether there is a difference in performance between projective and non-projective sentences. We ran three tests where performance was defined in terms of BLEU, fluency, and adequacy. As for some corpora, the count of non-projective sentences in their test set is low (e.g., 1.56% in en_ewt), we ran the test on the corpora that have more than 5% of non-projective sentences, that is, cs (10%), fi (6%), nl (20%), and ru (8%) for SR’18, and hi_hdtb (9%), ko_gsd (9%), ko_kaist (19%), and ru_syntagrus (6%) for SR’19. For the calculation of the Mann–Whitney U test, we used scipy-1.4.1. Similar to the correlation analysis, the test was calculated separately for each submission and for each corpus.
Mining the Input Trees.
The error mining algorithm was run for each submission separately and with three different settings: (i) dependency relations (dep); (ii) POS tags (POS); (iii) dependency relations and POS tags (POS-dep).
5 Error Analysis
We analyze results focusing successively on: tree-based syntactic complexity (are sentences with more complex syntactic trees harder to generate?), projectivity (how much does non-projectivity impact results?), entropy (how much do word order variations affect performance?), DEA and error mining (which syntactic constructions lead to decreased scores?).
5.1 Tree-Based Syntactic Complexity
We examine correlation tests results for all metrics on the system level (all submissions together) and for a single model, the BME-UW system (Kovács et al., 2019) on a single corpus/language (zh_gsd, Chinese). Figure 2a shows median Spearman ρ coefficients across all the 174 submissions, and Figure 2b shows the coefficients for the BME-UW system on the zh_gsd corpus.
We investigate both correlations between syntactic complexity and performance metrics and within each category. Similar observations can be made for both settings.
Correlation between Performance Metrics.
As often remarked in the NLG context (Stent et al., 2005; Novikova et al., 2017; Reiter, 2018), BLEU shows a weak correlation with Fluency and Adequacy on the sentence level. Similarly, dependency edge accuracy shows weak correlations with human judgments (ρad = 0.2 and ρfl = 0.24 for the median; ρad = 0.14 and ρfl = 0.23 for BME-UW).7
In contrast, BLEU shows a strong correlation with dependency edge accuracy (median: ρ = 0.68; BME-UW: ρ = 0.88). Contrary to BLEU however, DEA has a direct linguistic interpretation (it indicates which dependency relations are harder to handle) and can be exploited to analyze and improve a model. We therefore advocate for a more informative evaluation that incorporates DEA in addition to the standard metrics. We believe this will lead to more easily interpretable results and possibly the development of better, linguistically informed SR models.
Correlation between Syntactic Complexity Metrics.
Unsurprisingly, tree-based metrics have positive correlations between each other (the redish area on the right) ranging from weak to strong. Due to calculation technique overlap, some of them can show strong correlation (e.g., mean dependency distance and mean flux size).
Correlation between Syntactic Complexity and Performance Metrics.
Tree-based metrics do not correlate with human assessments (ρ fluctuates around zero for median and from − 0.06 to − 0.29 for BME-UW).
In general no correlation between tree-based metrics and system performance was found globally (i.e., for all models and all testsets). We can use the framework to analyze results on specific corpora or languages, however. For instance, zooming in on the fr corpus, we can observe a weak negative correlation at the system level (correlation with the median) between tree-based metrics (e.g., ρ = −0.38 for mean arity and tree length) and DEA. Thus, on this corpus, performance decreases as syntactic complexity (as measured by DEA) increases. Similarly, for ar, cs, fi, it, nl, tree-based metrics show some negative correlation with BLEU8 whereby ρ median values between dependency metrics and BLEU for those corpora vary from − 0.21 to − 0.38 for ar, from − 0.43 to − 0.57 for cs, from − 0.2 to − 0.46 for fi, from − 0.17 to − 0.34 for it, and from − 0.29 to − 0.42 for nl.
Such increase in correlations were observed mainly for corpora, for which performance was not high across submissions (see Mille et al. (2018)). We hypothesize that BLEU correlates more with the tree-based metrics if system performance is bad.
Significance Testing.
Overall, across submissions, coefficients were found non-significant only when they were close to zero (see Figure 2b).
5.2 Projectivity
Table 3 shows performance results with respect to the projectivity parameter.
team . | corpus . | BLEU Proj/Non-Proj . | Fl_z . | Ad_z . | Sample sizes . |
---|---|---|---|---|---|
AX | cs | 0.25/0.19 | −/− | −/− | 8897/979 |
BinLin | cs | 0.49/0.38 | −/− | −/− | 8897/979 |
AX | fi | 0.25/0.2 | −/− | −/− | 1440/85 |
BinLin | fi | 0.44/0.33 | −/− | −/− | 1440/85 |
OSU | fi | 0.47/0.38 | −/− | −/− | 1440/85 |
AX | nl | 0.28/0.2 | −/− | −/− | 547/138 |
BinLin | nl | 0.39/0.3 | −/− | −/− | 547/138 |
OSU | nl | 0.38/0.28 | −/− | −/− | 547/138 |
Tilburg | nl | 0.43/0.36 | −/− | −/− | 547/138 |
AX | ru | 0.27/0.22 | −/− | −/− | 5833/533 |
BinLin | ru | 0.44/0.36 | −/− | −/− | 5833/533 |
BME-UW | hi_hdtb | 0.66/0.6 | −/− | −/− | 1534/150 |
DepDist | hi_hdtb | 0.66/0.62 | −/− | −/− | 1534/150 |
IMS | hi_hdtb | 0.82/0.73 | −/− | −/− | 1534/150 |
LORIA | hi_hdtb | 0.29/0.22 | −/− | −/− | 1534/150 |
Tilburg | hi_hdtb | 0.68/0.64 | −/− | −/− | 1534/150 |
BME-UW | ko_gsd | 0.54/0.38 | −/− | −/− | 898/91 |
DepDist | ko_gsd | 0.51/0.37 | −/− | −/− | 898/91 |
IMS | ko_gsd | 0.84/0.56 | −/− | −/− | 898/91 |
LORIA | ko_gsd | 0.43/0.4 | −/− | −/− | 898/91 |
Tilburg | ko_gsd | 0.08/0.06 | −/− | −/− | 898/91 |
BME-UW | ko_kaist | 0.51/0.39 | −/− | −/− | 1849/438 |
IMS | ko_kaist | 0.82/0.6 | −/− | −/− | 1849/438 |
LORIA | ko_kaist | 0.43/0.37 | −/− | −/− | 1849/438 |
Tilburg | ko_kaist | 0.14/0.11 | −/− | −/− | 1849/438 |
BME-UW | ru_syntagrus | 0.58/0.59 | 0.15/0.19 | 0.31/0.48 | 6070/421 |
IMS | ru_syntagrus | 0.76/0.77 | 0.42/0.18 | 0.58/0.37 | 6070/421 |
LORIA | ru_syntagrus | 0.61/0.62 | 0.33/0.3 | 0.39/0.55 | 6070/421 |
Tilburg | ru_syntagrus | 0.46/0.47 | −0.2/−0.37 | −0.01/−0.2 | 6070/421 |
team . | corpus . | BLEU Proj/Non-Proj . | Fl_z . | Ad_z . | Sample sizes . |
---|---|---|---|---|---|
AX | cs | 0.25/0.19 | −/− | −/− | 8897/979 |
BinLin | cs | 0.49/0.38 | −/− | −/− | 8897/979 |
AX | fi | 0.25/0.2 | −/− | −/− | 1440/85 |
BinLin | fi | 0.44/0.33 | −/− | −/− | 1440/85 |
OSU | fi | 0.47/0.38 | −/− | −/− | 1440/85 |
AX | nl | 0.28/0.2 | −/− | −/− | 547/138 |
BinLin | nl | 0.39/0.3 | −/− | −/− | 547/138 |
OSU | nl | 0.38/0.28 | −/− | −/− | 547/138 |
Tilburg | nl | 0.43/0.36 | −/− | −/− | 547/138 |
AX | ru | 0.27/0.22 | −/− | −/− | 5833/533 |
BinLin | ru | 0.44/0.36 | −/− | −/− | 5833/533 |
BME-UW | hi_hdtb | 0.66/0.6 | −/− | −/− | 1534/150 |
DepDist | hi_hdtb | 0.66/0.62 | −/− | −/− | 1534/150 |
IMS | hi_hdtb | 0.82/0.73 | −/− | −/− | 1534/150 |
LORIA | hi_hdtb | 0.29/0.22 | −/− | −/− | 1534/150 |
Tilburg | hi_hdtb | 0.68/0.64 | −/− | −/− | 1534/150 |
BME-UW | ko_gsd | 0.54/0.38 | −/− | −/− | 898/91 |
DepDist | ko_gsd | 0.51/0.37 | −/− | −/− | 898/91 |
IMS | ko_gsd | 0.84/0.56 | −/− | −/− | 898/91 |
LORIA | ko_gsd | 0.43/0.4 | −/− | −/− | 898/91 |
Tilburg | ko_gsd | 0.08/0.06 | −/− | −/− | 898/91 |
BME-UW | ko_kaist | 0.51/0.39 | −/− | −/− | 1849/438 |
IMS | ko_kaist | 0.82/0.6 | −/− | −/− | 1849/438 |
LORIA | ko_kaist | 0.43/0.37 | −/− | −/− | 1849/438 |
Tilburg | ko_kaist | 0.14/0.11 | −/− | −/− | 1849/438 |
BME-UW | ru_syntagrus | 0.58/0.59 | 0.15/0.19 | 0.31/0.48 | 6070/421 |
IMS | ru_syntagrus | 0.76/0.77 | 0.42/0.18 | 0.58/0.37 | 6070/421 |
LORIA | ru_syntagrus | 0.61/0.62 | 0.33/0.3 | 0.39/0.55 | 6070/421 |
Tilburg | ru_syntagrus | 0.46/0.47 | −0.2/−0.37 | −0.01/−0.2 | 6070/421 |
Zooming in on the ru_syntagrus corpus and two models, one that can produce non-projective trees, BME-UW (Kovács et al., 2019), and one that cannot, the IMS system (Yu et al., 2019), we observe two opposite trends.
For the BME-UW model, the median values for fluency and adequacy are higher for non-projective sentences. Fluency medians (proj/non-proj) are 0.15/0.19 (Mann–Whitney U = 4109131.0, n1 = 6070, n2 = 421, p < 0.001 two-tailed); adequacy medians (proj/non-proj) are 0.31/0.48 (U = 2564235.0, n1 = 6070, n2 = 421, p < 0.001). In other words, while the model can handle non-projective structures, a key drawback revealed by our error analysis is that for sentences with projective structures (which incidentally, are much more frequent in the data), the model output is in fact judged less fluent and less adequate by human annotators than for non-projective sentences.
Conversely, for the IMS system, median values for fluency is higher for projective sentences (0.42 vs. 0.18 for non-projective sentences), and the distributions in the two groups differed significantly (U = 4038434.0, p < 0.001 two-tailed). For adequacy, the median value for projective sentences (0.58) is also significantly higher than that for non-projective sentences (0.37, U = 2583463.0, p < 0.001 two-tailed). This in turn confirms the need for models that can handle non-projective structures.
Another interesting point highlighted by the results on the ru_syntagrus corpus in Table 3 is that similar BLEU scores for projective and non-projective structures do not necessarily mean similar human evaluation scores.
In terms of BLEU only, that is, taking all other corpora with no human evaluations, and modulo the caveat just made about the relation between BLEU and human evaluation, we find that non-projective median values were always lower than projective ones, and distributions showed significant differences, throughout all the 25 comparisons made. This underlines the need for models that can handle both projective and non-projective structures.
5.3 Entropy
Correlation between dependency relation entropy and dependency edge accuracy permits identifying which model, language, or corpus is particularly affected by word order freedom.
For instance,9 for the id_gsd corpus, three teams have a Spearman’s ρ in the range from − 0.62 to − 0.67, indicating that their model under-performs for dependency relations with free word order. Conversely, two other teams showed weak correlation (ρ = −0.31 and ρ = −0.36) for the same id_gsd corpus.
The impact of entropy also varies depending on the language, the corpus, and, more generally, the entropy of the data. For instance, for Japanese (ja_gsd corpus), dependency relations have low entropy (the mean entropy averaged on all relations is 0.02) and so we observe no correlation between entropy and performance. Conversely, for Czech (the treebank with the highest mean entropy, H = 0.52), two teams show non-trivial negative correlations (ρ = −0.54 and ρ = −0.6) between entropy and DEA.
5.4 Which Syntactic Constructions Are Harder to Handle?
DEA.
For a given dependency relation, DEA assesses how well a model succeeds in realizing that relation. To identify which syntactic constructs are problematic for surface realization models, we therefore compute dependency edge accuracy per relation, averaging over all submissions. Table 4 shows the results.
deprel . | count . | Accuracy . |
---|---|---|
list | 4,914 | 17.75 |
vocative | 974 | 21.91 |
dislocated | 7,832 | 23.11 |
reparandum | 33 | 27.27 |
goeswith | 1,453 | 27.98 |
parataxis | 27,484 | 28.76 |
dep | 14,496 | 29.80 |
advcl | 60,719 | 32.52 |
csubj | 8,229 | 36.60 |
discourse | 3,862 | 37.45 |
ccomp | 33,513 | 41.74 |
obl | 232,097 | 42.39 |
appos | 35,781 | 43.59 |
advmod | 180,678 | 44.84 |
iobj | 16,240 | 44.96 |
conj | 149,299 | 45.77 |
orphan | 843 | 48.49 |
expl | 10,137 | 50.90 |
acl | 79,168 | 51.24 |
cop | 45,187 | 51.78 |
nsubj | 268,686 | 51.80 |
xcomp | 36,633 | 56.12 |
obj | 190,140 | 57.87 |
nummod | 61,459 | 58.46 |
aux | 95,748 | 58.47 |
mark | 105,993 | 59.77 |
compound | 82,314 | 59.99 |
nmod | 357,367 | 60.94 |
flat | 62,686 | 61.28 |
amod | 246,733 | 61.68 |
cc | 123,866 | 61.94 |
clf | 1,668 | 67.47 |
fixed | 27,978 | 73.08 |
det | 280,978 | 73.51 |
case | 465,583 | 74.15 |
deprel . | count . | Accuracy . |
---|---|---|
list | 4,914 | 17.75 |
vocative | 974 | 21.91 |
dislocated | 7,832 | 23.11 |
reparandum | 33 | 27.27 |
goeswith | 1,453 | 27.98 |
parataxis | 27,484 | 28.76 |
dep | 14,496 | 29.80 |
advcl | 60,719 | 32.52 |
csubj | 8,229 | 36.60 |
discourse | 3,862 | 37.45 |
ccomp | 33,513 | 41.74 |
obl | 232,097 | 42.39 |
appos | 35,781 | 43.59 |
advmod | 180,678 | 44.84 |
iobj | 16,240 | 44.96 |
conj | 149,299 | 45.77 |
orphan | 843 | 48.49 |
expl | 10,137 | 50.90 |
acl | 79,168 | 51.24 |
cop | 45,187 | 51.78 |
nsubj | 268,686 | 51.80 |
xcomp | 36,633 | 56.12 |
obj | 190,140 | 57.87 |
nummod | 61,459 | 58.46 |
aux | 95,748 | 58.47 |
mark | 105,993 | 59.77 |
compound | 82,314 | 59.99 |
nmod | 357,367 | 60.94 |
flat | 62,686 | 61.28 |
amod | 246,733 | 61.68 |
cc | 123,866 | 61.94 |
clf | 1,668 | 67.47 |
fixed | 27,978 | 73.08 |
det | 280,978 | 73.51 |
case | 465,583 | 74.15 |
Unsurprisingly, relations with low counts (first five relations in the table) have low accuracy. Because they are rare (in fact they are often absent from most corpora), SR models struggle to realize these.
Other relations with low accuracy are either relations with free word order (i.e., advcl, discourse, obl, advmod) or whose semantics is vague (dep—unspecified dependency). Clearly, in case of the latter, systems cannot make a good prediction; as for the former, the low DEA score may be an artefact of the fact that it is computed with respect to a single reference. As the construct may occur in different positions in a sentence, several equally correct sentences may match the input but only one will not be penalised by the comparison with the reference. This underlines once again the need for an evaluation setup with multiple references.
Relations with the highest accuracy are those for function words (case—case-marking elements, det—determiners, clf—classifiers), fixed multiword expressions (fixed), and nominal dependents (amod, nmod, nummod). Those dependencies on average have higher stability with respect to their head in terms of distance, more often demonstrate a fixed word order, and do not exhibit a certain degree of probable shifting as the relations described above. Due to those factors, their realization performance is higher.
Interestingly, when computing DEA per dependency relation and per corpus, we found similar DEA scores for all corpora. That is, dependency relations have consistently low/high DEA score across all corpora therefore indicating that improvement on a given relation will improve performance on all corpora/languages.
Finally, we note that, at the model level, DEA scores are useful metrics for researchers as it brings interpretability and separation into error type subcases.
Error Mining for Syntactic Trees.
We can also obtain a more detailed picture of which syntactic constructs degrade performance using error mining. After running error mining on all submissions, we examine the subtrees in the input that have highest coverage, that is, for which the percentage of submissions tagging these forms as suspicious10 is highest. Tables 5, 6, and 7 show the results when using different views of the data (i.e., focusing only on dependency information, only on POS tags, or on both).
rank . | subtree . | cov. . | MSS . |
---|---|---|---|
1–2 | (conj (X)) | 70–73 | 1.17 |
3 | (advcl (nsubj)) | 62 | 0.91 |
4 | (advcl (advmod)) | 62 | 0.95 |
5 | (advmod (advmod)) | 59 | 0.77 |
6 | (conj (advcl)) | 57 | 0.75 |
7 | (nsubj (conj)) | 56 | 0.68 |
8–11 | (conj (X)) | 52–56 | 0.87 |
12 | (nmod (advmod)) | 52 | 0.56 |
13 | (nsubj (amod)) | 52 | 0.75 |
14–15 | (conj (X)) | 49–50 | 0.73 |
16 | (parataxis (nsubj)) | 49 | 0.75 |
17 | (conj (advmod advmod)) | 48 | 0.65 |
18 | (advcl (cop)) | 48 | 0.60 |
19 | (advcl (aux)) | 47 | 0.59 |
20 | (ccomp (advmod)) | 47 | 0.68 |
rank . | subtree . | cov. . | MSS . |
---|---|---|---|
1–2 | (conj (X)) | 70–73 | 1.17 |
3 | (advcl (nsubj)) | 62 | 0.91 |
4 | (advcl (advmod)) | 62 | 0.95 |
5 | (advmod (advmod)) | 59 | 0.77 |
6 | (conj (advcl)) | 57 | 0.75 |
7 | (nsubj (conj)) | 56 | 0.68 |
8–11 | (conj (X)) | 52–56 | 0.87 |
12 | (nmod (advmod)) | 52 | 0.56 |
13 | (nsubj (amod)) | 52 | 0.75 |
14–15 | (conj (X)) | 49–50 | 0.73 |
16 | (parataxis (nsubj)) | 49 | 0.75 |
17 | (conj (advmod advmod)) | 48 | 0.65 |
18 | (advcl (cop)) | 48 | 0.60 |
19 | (advcl (aux)) | 47 | 0.59 |
20 | (ccomp (advmod)) | 47 | 0.68 |
tree . | coverage . | MSS . |
---|---|---|
(ADJ (PRON)) | 70 | 0.90 |
(VERB (VERB)) | 69 | 1.21 |
(ADJ (ADJ)) | 68 | 0.89 |
(NOUN (ADV)) | 67 | 1.03 |
(ADJ (ADP)) | 66 | 0.77 |
(VERB (ADJ)) | 65 | 0.98 |
(ADV (ADV)) | 63 | 0.87 |
(NOUN (AUX)) | 62 | 0.90 |
(ADJ (VERB)) | 60 | 0.80 |
(VERB (CCONJ)) | 60 | 1.02 |
(PRON (ADP)) | 56 | 0.81 |
(VERB (VERB VERB)) | 55 | 0.89 |
(NUM (NUM)) | 55 | 0.72 |
(PROPN (NOUN)) | 53 | 0.79 |
(PRON (VERB)) | 53 | 0.63 |
(ADJ (CCONJ)) | 52 | 0.65 |
(VERB (ADV)) | 52 | 0.96 |
(ADJ (SCONJ)) | 52 | 0.62 |
(VERB (ADP)) | 51 | 0.76 |
(VERB (PROPN)) | 51 | 0.83 |
tree . | coverage . | MSS . |
---|---|---|
(ADJ (PRON)) | 70 | 0.90 |
(VERB (VERB)) | 69 | 1.21 |
(ADJ (ADJ)) | 68 | 0.89 |
(NOUN (ADV)) | 67 | 1.03 |
(ADJ (ADP)) | 66 | 0.77 |
(VERB (ADJ)) | 65 | 0.98 |
(ADV (ADV)) | 63 | 0.87 |
(NOUN (AUX)) | 62 | 0.90 |
(ADJ (VERB)) | 60 | 0.80 |
(VERB (CCONJ)) | 60 | 1.02 |
(PRON (ADP)) | 56 | 0.81 |
(VERB (VERB VERB)) | 55 | 0.89 |
(NUM (NUM)) | 55 | 0.72 |
(PROPN (NOUN)) | 53 | 0.79 |
(PRON (VERB)) | 53 | 0.63 |
(ADJ (CCONJ)) | 52 | 0.65 |
(VERB (ADV)) | 52 | 0.96 |
(ADJ (SCONJ)) | 52 | 0.62 |
(VERB (ADP)) | 51 | 0.76 |
(VERB (PROPN)) | 51 | 0.83 |
subtree . | cov. . | MSS . |
---|---|---|
(VERB∼ conj (ADV∼advmod)) | 60 | 0.90 |
(VERB∼conj (PRON∼nsubj)) | 60 | 0.78 |
(NOUN∼nsubj (ADJ∼amod)) | 55 | 0.77 |
(ADV∼advmod (ADV∼advmod)) | 54 | 0.69 |
(VERB∼advcl (ADV∼advmod)) | 53 | 0.76 |
(VERB∼advcl (NOUN∼nsubj)) | 53 | 0.70 |
(VERB∼conj (VERB∼advcl)) | 50 | 0.60 |
(VERB∼advcl (PRON∼obj)) | 48 | 0.53 |
(VERB∼ccomp (ADV∼advmod)) | 47 | 0.57 |
(NOUN∼nsubj (NOUN∼conj)) | 46 | 0.46 |
(VERB∼advcl (NOUN∼obl)) | 46 | 0.68 |
(VERB∼conj (PRON∼obj)) | 45 | 0.57 |
(VERB∼advcl (AUX∼aux)) | 44 | 0.56 |
(VERB∼conj (AUX∼aux)) | 41 | 0.59 |
(NOUN∼obl (ADJ∼amod)) | 40 | 0.62 |
(NOUN∼nsubj (VERB∼acl)) | 40 | 0.46 |
(VERB∼acl (ADV∼advmod)) | 40 | 0.47 |
(NOUN∼obl (ADV∼advmod)) | 38 | 0.43 |
(NOUN∼conj (VERB∼acl)) | 38 | 0.38 |
(VERB∼ccomp (AUX∼aux)) | 38 | 0.48 |
subtree . | cov. . | MSS . |
---|---|---|
(VERB∼ conj (ADV∼advmod)) | 60 | 0.90 |
(VERB∼conj (PRON∼nsubj)) | 60 | 0.78 |
(NOUN∼nsubj (ADJ∼amod)) | 55 | 0.77 |
(ADV∼advmod (ADV∼advmod)) | 54 | 0.69 |
(VERB∼advcl (ADV∼advmod)) | 53 | 0.76 |
(VERB∼advcl (NOUN∼nsubj)) | 53 | 0.70 |
(VERB∼conj (VERB∼advcl)) | 50 | 0.60 |
(VERB∼advcl (PRON∼obj)) | 48 | 0.53 |
(VERB∼ccomp (ADV∼advmod)) | 47 | 0.57 |
(NOUN∼nsubj (NOUN∼conj)) | 46 | 0.46 |
(VERB∼advcl (NOUN∼obl)) | 46 | 0.68 |
(VERB∼conj (PRON∼obj)) | 45 | 0.57 |
(VERB∼advcl (AUX∼aux)) | 44 | 0.56 |
(VERB∼conj (AUX∼aux)) | 41 | 0.59 |
(NOUN∼obl (ADJ∼amod)) | 40 | 0.62 |
(NOUN∼nsubj (VERB∼acl)) | 40 | 0.46 |
(VERB∼acl (ADV∼advmod)) | 40 | 0.47 |
(NOUN∼obl (ADV∼advmod)) | 38 | 0.43 |
(NOUN∼conj (VERB∼acl)) | 38 | 0.38 |
(VERB∼ccomp (AUX∼aux)) | 38 | 0.48 |
Table 5 highlights coordination (conj, 13 subtrees out of 20) and adverbial clause modifiers (advcl, 5 cases) as a main source of low BLEU scores. This mirrors the results shown for single dependency relations (cf. Section 5.4) but additionally indicates specific configurations in which these relations are most problematic such as for instance, the combination of an adverbial clause modifier with a nominal subject (nsubj, 62% coverage), or an adverbial modifier (advmod, 62% coverage), or the combination of two adverbial modifiers together (e.g., down there, far away, very seriously).
Table 6 shows the results for the POS setting. Differently from the dep-based view, it highlights head-dependent constructs with identical POS tags, for example, (ADV (ADV)), (ADJ (ADJ)), (NUM (NUM)), (VERB (VERB)), and (VERB (VERB VERB)), as a frequent source of errors. For instance, the relative order of two adjectives (ADJ (ADJ)) is sometimes lexically driven and therefore difficult to predict (Malouf, 2000).
6 Using Error Analysis for Improving Models or Datasets
As shown in the preceding section, the error analysis framework introduced in Section 3 can be used by evaluation campaign organizers to provide a linguistically informed interpretation of campaign results aggregated over multiple system runs, languages or corpora.
For individual researchers and model developers, our framework also provides a means to have a fine-grained interpretation of their model results that they can then use to guide model improvement, to develop new models, or to improve training data. We illustrate this point by giving some examples of how the toolkit could be used to help improve a model or a dataset.
Data Augmentation.
Augmenting the training set with silver data has repeatedly been shown to increase performance (Konstas et al., 2017; Elder and Hokamp, 2018). In those approaches, performance is improved by simply augmenting the size of the training data. In contrast, information from the error analysis toolkit could be used to support error-focused data augmentation, that is, to specifically augment the training data with instances of those cases for which the model underperforms (e.g., for dependency relations with low dependency edge accuracy, for constructions with low suspicion score or for input trees with large depth, length or mean dependency distance). This could be done either manually (by annotating sentences containing the relevant constructions) or automatically by parsing text and then filtering for those parse trees which contain the dependency relations and subtrees for which the model underperforms. For those cases where the problematic construction is frequent, we conjecture that this might lead to a better overall score increase than “blind” global data augmentation.
Language Specific Adaptation.
Languages exhibit different word order schemas and have different ways of constraining word order. Error analysis can help identify which language-specific constructs impact performance and how to improve a language-specific model with respect to these constructs.
For instance, a dependency relation with high entropy and low accuracy indicates that the model has difficulty learning the word order freedom of that relation. Model improvement can then target a better modelling of those factors which determine word order for that relation. In Romance languages, for example, adjectives mostly occur after the noun they modify. However, some adjectives are pre-posed. As the pre-posed adjectives rather form a finite set, a plausible way to improve the model would be to enrich the input representation by indicating for each adjective whether it belongs to the class of pre- or post-posed adjectives.
Global Model Improvement.
Error analysis can suggest direction for model improvement. For instance, a high proportion of non-projective sentences in the language reference treebank together with lower performance metrics for those sentences suggests improving the ability of the model to handle non-projective structures. Indeed, Yu et al. (2020) showed that the performance of the model of Yu et al. (2019) could be greatly improved by extending it to handle non-projective structures.
Treebank Specific Improvement.
Previous research has shown that treebanks contain inconsistencies thereby impacting both learning and evaluation (Zeman, 2016).
The tree-based metrics and the error mining techniques provided in our toolkit can help identify those dependency relations and constructions which have consistently low scores across different models or diverging scores across different treebanks for the same language. For instance, a case of strong inconsistencies in the annotation of multi-word expressions (MWE) may be highlighted by a low DEA for the fixed dependency relation (which should be used to annotate MWE). Such annotation errors could also be detected using lemma-based error mining, namely, error mining for forms decorated with lemmas. Such mining would then show that the most suspicious forms are decorated with multi-word expressions (e.g., “in order to”).
Ensemble Model.
Given a model M and a test set T, our toolkit can be used to compute, for each dependency relation d present in the test set, the average DEA of that model for that relation (, the sum of the model’s DEA for all d-edge in T normalized by the number of these edges). This could be used to learn an ensemble model which, for each input, outputs the sentence generated by the model whose score according to this metric is highest. Given an input tree t consisting of a set of edges D, the score of a model M could for instance be the sum of the model’s average DEA for the edges contained in the input tree normalized by the number of edges in that tree, namely, .
7 Conclusion
We presented a framework for error analysis that supports a detailed assessment of which syntactic factors impact the performance of surface realisation models. We applied it to the results of two SR shared task campaigns and suggested ways in which it could be used to improve models and datasets for shallow surface realisation. More generally, we believe that scores such as BLEU and, to some extent, human ratings do not provide a clear picture of the extent to which SR models can capture the complex constraints governing word order in the world natural languages. We hope that the metrics and tools gathered in this evaluation toolkit can help address this issue.
Acknowledgments
We are grateful to Kim Gerdes for sharing his thoughts at the initial stage of this research project and giving us useful literature pointers, and we thank Shashi Narayan for making his tree error mining code available to us. This research project would not also be possible without the data provided by the Surface Realization shared task organisers, whose support and responsiveness we gratefully acknowledge. We also thank our reviewers for their constructive and valuable feedback. This project was supported by the French National Research Agency (Gardent; award ANR-20-CHIA-0003, XNLG “Multi-lingual, Multi-Source Text Generation”).
Notes
Our code and settings to reproduce the experiments are available at https://gitlab.com/shimorina/tacl-2021.
We report signed values for dependency distance, rather than absolute ones, to account for the dependent position—after or before the head.
It is our empirical choice. Any other threshold can also be chosen.
We do not include other automatic n-gram-based metrics used in the SR shared tasks because they usually correlate with each other.
Bear in mind that using human assessments for word ordering evaluation has one downside because the assessments were collected for final sentences, and were not specifically created for word ordering evaluation. A more detailed human evaluation focused on word ordering might be needed to confirm the findings including human judgments.
Unfortunately no human evaluations were available for those corpora.
As indicated in Section 4, we computed correlation scores between entropy for all systems, all corpora and all performance scores. These are not shown here as space is lacking.
A form is suspicious if its suspicion score is not null.