Abstract
In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.
1. Introduction
From its foundations, Statistical Machine Translation (SMT) as a field had two defining characteristics. First, translation was modeled as a generative process at the sentence level. Second, it was purely statistical over words or word sequences and made little to no use of linguistic information (Brown et al., 1993; Koehn, Och, and Marcu, 2003).
Although modern SMT systems switched to a discriminative log-linear framework (Och, 2003; Watanabe et al., 2007; Chiang, Marton, and Resnik, 2008; Hopkins and May, 2011), which allows for additional sources as features, it is generally hard to incorporate dependencies beyond a small window of adjacent words, thus making it difficult to use linguistically rich models.
One of the fruitful research directions for improving SMT has been the usage of more structured linguistic information. For instance, in SMT we find systems based on syntax (Galley et al., 2004; Quirk, Menezes, and Cherry, 2005), hierarchical structures (Chiang, 2005), and semantic roles (Wu and Fung, 2009; Lo, Tumuluru, and Wu, 2012; Bazrafshan and Gildea, 2014). However, it was not until recently that syntax-based SMT systems started to outperform their phrase-based counterparts, especially for language pairs that need long-distance reordering such as Chinese–English and German–English (Nadejde, Williams, and Koehn, 2013).
Another less-explored way consists of going beyond the sentence-level; for example, translating at the document level or taking into account broader contextual information. The idea is to obtain adequate translations respecting cross-sentence relations and enforcing cohesion and consistency at the document level (Hardmeier, Nivre, and Tiedemann, 2012; Ben et al., 2013; Louis and Webber, 2014; Tu, Zhou, and Zong, 2014; Xiong, Zhang, and Wang, 2015). Research in this direction has also been the focus of the two editions of the DiscoMT workshop, in 2013 and 2015 (Webber et al., 2013, 2015; Hardmeier et al., 2015).
Automatic MT evaluation is an integral part of the process of developing and tuning an SMT system. Reference-based evaluation measures compare the output of a system to one or more human translations (called references) and produce a similarity score indicating the quality of the translation. The first metrics approached similarity as a shallow word n-gram matching between the translation and one or more references, with a limited use of linguistic information. BLEU (Papineni et al., 2002) is the best-known metric in this family, and has been used for years as the evaluation standard in the MT community. BLEU can be efficiently calculated and has shown good correlation with human assessments when evaluating systems on large quantities of text. However, it is also known that BLEU and similar metrics are unreliable for high-quality translation output (Doddington, 2002; Lavie and Agarwal, 2007), and they cannot tell apart raw machine translation output from a fully fluent professionally post-edited version thereof (Denkowski and Lavie, 2012). Moreover, lexical-matching similarity has been shown to be both insufficient and not strictly necessary for two sentences to convey the same meaning (Coughlin, 2003; Culy and Riehemann, 2003; Callison-Burch, Osborne, and Koehn, 2006).
Several alternatives emerged to overcome these limitations, most notably TER (Snover et al., 2006) and METEOR (Lavie and Denkowski, 2009). Researchers have explored, with good results, the addition of other levels of linguistic information, including synonymy and paraphrasing (Lavie and Denkowski, 2009), syntax (Liu and Gildea, 2005; Giménez and Màrquez, 2007; Popovic and Ney, 2007), semantic roles (Giménez and Màrquez, 2007; Lo, Tumuluru, and Wu, 2012), and, most recently, discourse (Giménez et al., 2010; Wong and Kit, 2012; Guzmán et al., 2014a, 2014b; Joty et al., 2014).
Beyond all previous considerations, MT systems are usually evaluated by computing translation quality on individual sentences and performing some simple aggregation to produce the system-level evaluation scores. To the best of our knowledge, semantic relations between clauses in a sentence and between sentences in a text have not been seriously explored. However, clauses and sentences rarely stand on their own in a well-written text; rather, the logical relationship between them carries significant information that allows the text to express a meaning as a whole. Each clause follows smoothly from the ones before it and leads into the ones that come afterward. This logical relationship between clauses forms a coherence structure (Hobbs, 1979). In discourse analysis, we seek to uncover this coherence structure underneath the text.
Several formal theories of discourse have been proposed to describe the coherence structure (Mann and Thompson, 1988; Asher and Lascarides, 2003; Webber, 2004). Rhetorical Structure Theory (RST; Mann and Thompson 1988) is perhaps the most influential of these in computational linguistics, where it is used either to parse the text in language understanding or to plan a coherent text in language generation (Taboada and Mann, 2006). RST describes coherence using discourse relations between parts of a text and postulates a hierarchical tree structure called discourse tree. For example, Figure 1 in the next section shows discourse trees for three different translations of a source sentence.
Example of three different discourse trees for the translations of a source sentence: (a) the reference, (b) a higher-quality translation, (c) a lower-quality translation.
Example of three different discourse trees for the translations of a source sentence: (a) the reference, (b) a higher-quality translation, (c) a lower-quality translation.
Modeling discourse brings together the usage of higher-level linguistic information and the exploration of relations between clauses and sentences in a text, which makes it a very attractive goal for MT and its evaluation. We believe that the semantic and pragmatic information captured in the form of discourse trees (i) can yield better MT evaluation metrics, and (ii) can help develop discourse-aware SMT systems that produce more coherent translations.
In this work, we focus on the first of the two previous research hypotheses. Specifically, we show that sentence-level discourse information can be used to produce reference-based evaluation measures that perform well on their own, but more importantly, can be used to improve over many existing MT evaluation metrics regarding correlation with human assessments. We conduct our research in three steps. First, we design a simple discourse-aware similarity measure, DR-lex, based on RST trees, generated with a publicly available discourse parser (Joty, Carenini, and Ng, 2015), and the well-known all subtree kernel (Collins and Duffy, 2001). The subtree kernel computes a similarity value by comparing the discourse tree representation of a system translation with that of a reference translation. We show that a simple uniform linear combination with this metric helps to improve a large number of MT evaluation metrics at the segment-level and at the system-level in the context of the WMT11 and the WMT12 metrics shared task benchmarks (Callison-Burch et al., 2011, 2012). Second, we show that tuning (i.e., learning) the weights in the linear combination of metrics using human-assessed examples is a robust way to improve the effectiveness of the DR-lex metric significantly. Following the idea of an interpolated combination, we put together several variants of our discourse metric (using different tree-based representations) with many strong pre-existing metrics provided by the Asiya toolkit for MT evaluation (Gonzàlez, Giménez, and Màrquez, 2012). The result is DiscoTKparty, which scored best at the WMT14 Metrics task (Bojar et al., 2014), both at the system level and at the segment level. Third, we conduct an ablation study that helps us understand which elements of the discourse parse tree have the highest impact on the quality of the evaluation measure. Interestingly enough, the nuclearity feature (i.e., the distinction between main and subordinate units) of the RST tree turns out to be more important than the discourse relation types (e.g., Elaboration, Contrast).
Note that, although extensive, this study is restricted to sentence-level evaluation, which arguably can limit the benefits of using global discourse properties (i.e., document-level discourse structure). Fortunately, many sentences are long and complex enough to present rich discourse structures connecting their basic clauses. Thus, although limited, this setting can demonstrate the potential of discourse-level information for MT evaluation. Furthermore, sentence-level scoring is compatible with most translation systems, which work on a sentence-by-sentence basis. It could also be beneficial to modern MT tuning mechanisms such as PRO (Hopkins and May, 2011) and MIRA (Watanabe et al., 2007; Chiang, Marton, and Resnik, 2008), which also work at the sentence level. Finally, it could also be used for re-ranking n-best lists of translation hypotheses.
The rest of the paper is organized as follows. Section 2 introduces our proposal for a family of discourse-based similarity metrics. Sections 3 and 4 describe the experimental setting and the evaluation of the discourse-based metrics, alone and in combination with other pre-existing measures. Section 5 empirically analyzes the main discourse-based metric and performs an ablation study to better understand its contributions. Finally, Sections 6 and 7 discuss related work and present the conclusions together with some directions for future research.
2. Discourse-Based Similarity Measures
Different formal theories of discourse have been proposed in the literature, reflecting different viewpoints about what is the best way to describe the coherence structure of a text. For example, Asher and Lascarides (2003) proposed the Segmented Discourse Representation Theory, which is driven by sentence semantics. Webber (2004) and Danlos (2009) extended the sentence grammar to formalize discourse structure. Mann and Thompson (1988) proposed RST, which was inspired by empirical analysis of authentic texts. Although RST was initially intended to be used for text generation, it later became popular as a framework for parsing the structure of a text. This work relies on RST-based coherence structure.
RST posits a tree representation of a text, known as a discourse tree. As shown in Figure 1(a), the leaves of a discourse tree (three in this example) correspond to contiguous atomic clause-like text spans, called elementary discourse units (EDUs), which serve as building blocks for constructing the tree. In the tree, adjacent EDUs are connected by certain coherence relations (e.g., Elaboration, Attribution), thus forming larger discourse units, which in turn are also subject to this process of relation-based linking. Discourse units that are linked by a relation are further distinguished based on their relative importance in the text: nuclei are the core arguments of the relation, and satellites are supportive ones. A discourse relation can be either mononuclear or multinuclear. A mononuclear relation connects a nucleus and a satellite (e.g., Elaboration, Attribution in Figure 1(a)), where a multinuclear relation connects two or more nuclei (e.g., Joint, Contrast). Thus, an RST discourse tree comprises four types of elements: (i) EDUs that comprise textual information, (ii) the structure or skeleton of the tree, (iii) nuclearity statuses of the discourse units, and (iv) coherence relations by which adjacent discourse units are linked.
Our hypothesis in this article is that the similarity between the discourse trees of an automatic translation and of a reference translation provides additional information that can be valuable for evaluating MT systems. In particular, we believe that better system translations should be more similar to the human translations in their discourse structures than worse ones. As an example, consider the three discourse trees shown in Figure 1: (a) for a reference translation, and (b) and (c) for translations of two different systems from the WMT12 competition. Notice that the tree structure, the nuclearity statuses, and the relation labels in the reference translation are also realized in the system translation in Figure 1(b), but not in Figure 1(c); this makes (b) a better translation compared with (c), according to our hypothesis. We argue that existing metrics that only use lexical and syntactic information cannot distinguish well between the translations in Figure 1(b) and Figure 1(c).
In order to develop a discourse-aware evaluation metric, we first generate discourse trees for the reference-translated and the system-translated sentences using an RST discourse parser, and then we measure the similarity between the two trees. We describe these two steps in more detail next.
2.1 Generating Discourse Trees
Conventionally, discourse analysis in RST involves two main subtasks: (i) discourse segmentation, or breaking the text into a sequence of EDUs, and (ii) discourse parsing, or the task of linking the discourse units (which could be EDUs or larger units) into labeled discourse trees. Recently, Joty, Carenini, and Ng (2012, 2015) proposed discriminative models for discourse segmentation and discourse parsing. Their discourse segmenter uses a maximum entropy model and achieves state-of-the-art performance with an F1-score of 90.5, whereas human agreement for this task is 98.3 in F1-score.
The discourse parser uses a dynamic Conditional Random Field (Sutton, McCallum, and Rohanimanesh, 2007) as a parsing model to infer the probability of all possible discourse tree constituents. The inferred (posterior) probabilities are then used in a probabilistic CKY-like bottom–up parsing algorithm to find the most likely parse. Using the standard set of 18 coarse-grained discourse relations,1 the discourse parser achieved an F1-score of 79.8% at the sentence level, which is close to the human agreement of 83%. These high numbers inspired us to develop discourse-aware MT evaluation metrics.2
2.2 Measuring Similarity Between Discourse Trees
A number of metrics have been proposed to measure the similarity between two labeled trees—for example, Tree Edit Distance (Tai, 1979) and various Tree Kernels (TKs) (Collins and Duffy, 2001; Smola and Vishwanathan, 2003; Moschitti, 2006). One advantage of tree kernels is that they provide an effective way to integrate tree structures in kernel-based learning algorithms like SVMs, and learn from arbitrary tree fragments as features.
Collins and Duffy (2001) proposed a syntactic tree kernel to efficiently compute the number of common subtrees in two syntactic trees. To comply with the rules (or productions) of a context-free grammar in syntactic parsing, the subtrees in this kernel are subject to the constraint that their nodes are taken with either all or none of the children. Because the same constraint applies to discourse trees, we use the same tree kernel in our work. Figure 2 shows the valid subtrees according to the syntactic tree kernel for the discourse tree in Figure 1(a). Note that in this work we use the tree kernel only to measure the similarity between two discourse trees rather than to learn subtree features in a supervised kernel-based learning framework like SVM. As an example of the latter, see our more recent work (Guzmán et al., 2014a), which uses tree kernels over syntactic and discourse structures in an SVM preference ranking framework.
Discourse subtrees used by the syntactic tree kernel for the tree in Figure 1(a).
Discourse subtrees used by the syntactic tree kernel for the tree in Figure 1(a).
Collins and Duffy (2001) proposed two modifications of the kernel when using it in a classifier (e.g., SVM) to avoid the classifier behaving like a nearest neighbor rule: (i) to restrict the tree fragments considered in the kernel computation based on their depth, and/or (ii) to assign relative weights to the tree fragments based on their size. Because we do not use the kernel in a learning algorithm, these modifications do not apply to us; all subtrees are equally weighted in our kernel.
Figure 2 shows that, when applied to discourse trees, the syntactic tree kernel may limit us on the type of substructures that we wish to compare. For example, although matching the complete production (i.e., a parent with all of its children) may make more sense for subtrees with internal nodes only (i.e., non-terminals), we may want to relax this constraint at the terminal (text) level to allow word subsequence matches.
One way to cope with this limitation of the tree kernel is to change the representation of the trees to a form that is suitable to capture the relevant information for our task. For example, in order to allow for the syntactic tree kernel to find subtree matches at the word unigram level, we can include an artificial layer of leaves (e.g., by copying the same dummy label below each word). In this way, the words become pre-terminal nodes and can be matched against the words in the other tree.
Apart from this modification to match subtrees at the word level, we experimented with different representations of a discourse tree, each of which produces a different discourse-based evaluation metric. In this section we present two basic representations of the discourse tree, namely, DR and DR-lex, which we will use in Section 4 to demonstrate that the discourse measures are synergetic with several widely used MT evaluation metrics (DR stands for discourse representation).
Figure 3 shows the two representations DR and DR-lex for the highlighted subtree in Figure 1b, that spans the text: suggest the ECB should be the lender of last resort. As shown in Figure 3(a), DR does not include any lexical item. Therefore, the syntactic tree kernel, when applied to this representation of the discourse tree, measures the similarity between two candidate translations in terms of their discourse representations only.
Two discourse tree representations for the highlighted subtree in Figure 1(b).
On the contrary, DR-lex, as shown in Figure 3(b), includes the lexical items to account for lexical matching; moreover, it separates the structure (skeleton) of the tree from its labels (i.e., the nuclearity statuses and the relation labels). This allows the syntactic tree kernel to give partial credit to subtrees that differ in labels but match in their skeletons, or vice versa. More specifically, DR-lex uses the predefined tags SPAN and EDU to build the skeleton of the tree, and considers the nuclearity and/or the relation labels as properties, added as children, of these tags. For example, a SPAN has two properties (its nuclearity status and its relation label), whereas an EDU has only one property (its nuclearity status). The words of an EDU are placed under another predefined tag NGRAM. To allow the tree kernel to find subtree matches at the word level, we also include an additional layer of dummy leaves (for simplicity, not shown in Figure 3(b)).
3. Experimental Setting
In this section, we describe the data sets we used in our experiments, the interpolation approach we applied to combine our discourse-based metrics with pre-existing evaluation metrics, and all the correlation measures we used for evaluation.
3.1 Data Sets
In our experiments, we used the data available for the WMT11, WMT12, WMT13, and WMT14 metrics shared tasks for translations into English.3 This includes the output from the systems that participated in the MT evaluation campaigns in those four years and the corresponding English reference translations. The WMT11 and WMT12 data sets contain 2,000 and 3,003 sentences, respectively, for each of the following four language pairs: Czech–English (cs-en), French–English (fr-en), German–English (de-en), and Spanish–English (es-en). In WMT13, the Russian–English (ru-en) pair was added to the mix, and the data set has 3,000 sentences for each of the five language pairs. WMT14 dropped es-en and included Hindi–English (hi-en), with each language pair having 3,003 sentences, except for hi-en, for which there were 2,507 sentences.
The task organizers provided human judgments on the quality of the systems' translations. These judgments represent rankings of the output of five systems chosen at random, for a particular language pair and for a particular sentence. The overall coverage (i.e., the number of unique sentences that were evaluated) was only a fraction of the total (see Table 1). For example, for WMT11 fr-en, only 247 out of 3,000 sentences have human judgments. Although the evaluation set-up of WMT evaluation is performed in a sentence-level fashion, we believe that it is adequate for our purpose. The annotation interface allowed human judges to take longer-range discourse structure into account, as they were shown the source and the human reference translations in the context of one preceding and one following sentences.4
Number of systems (sys), unique non-tied translation pairs (pairs), and unique sentences for which such pairs exist (sent) for the different language pairs, for the human evaluation of the WMT11-WMT14 metric shared tasks. These statistics show what we use for training; the numbers for testing are higher, as explained in the text.
. | WMT11 . | WMT12 . | WMT13 . | WMT14 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
sys . | pairs . | sent . | sys . | pairs . | sent . | sys . | pairs . | sent . | sys . | pairs . | sent . | |
cs-en | 12 | 2,477 | 190 | 6 | 8,269 | 937 | 11 | 46,397 | 2,572 | 5 | 10,301 | 1,288 |
de-en | 28 | 7,358 | 346 | 16 | 9,084 | 968 | 17 | 75,856 | 2,589 | 13 | 15,971 | 1,472 |
es-en | 21 | 4,799 | 274 | 12 | 8,751 | 910 | 12 | 36,626 | 2,172 | – | – | – |
fr-en | 24 | 5,085 | 247 | 15 | 8,747 | 932 | 13 | 43,234 | 2,272 | 8 | 15,033 | 1,365 |
ru-en | – | – | – | – | – | – | 19 | 94,509 | 2,740 | 13 | 24,595 | 1,800 |
hi-en | – | – | – | – | – | – | – | – | – | 9 | 14,678 | 1,180 |
. | WMT11 . | WMT12 . | WMT13 . | WMT14 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
sys . | pairs . | sent . | sys . | pairs . | sent . | sys . | pairs . | sent . | sys . | pairs . | sent . | |
cs-en | 12 | 2,477 | 190 | 6 | 8,269 | 937 | 11 | 46,397 | 2,572 | 5 | 10,301 | 1,288 |
de-en | 28 | 7,358 | 346 | 16 | 9,084 | 968 | 17 | 75,856 | 2,589 | 13 | 15,971 | 1,472 |
es-en | 21 | 4,799 | 274 | 12 | 8,751 | 910 | 12 | 36,626 | 2,172 | – | – | – |
fr-en | 24 | 5,085 | 247 | 15 | 8,747 | 932 | 13 | 43,234 | 2,272 | 8 | 15,033 | 1,365 |
ru-en | – | – | – | – | – | – | 19 | 94,509 | 2,740 | 13 | 24,595 | 1,800 |
hi-en | – | – | – | – | – | – | – | – | – | 9 | 14,678 | 1,180 |
Table 1 shows the main statistics about the data that we used for training, where we excluded all pairs for which: (i) both translations were judged as equally good, or (ii) the number of votes for translation1 being better than translation2 equals the number of votes for it being worse than translation2. Moreover, we ignored repetitions—that is, if two judges voted the same way, we did not create two training examples, but just one (note, however, that on testing, repetitions will be accounted for). Excluding ties and repetitions reduces the number of training pairs significantly (e.g., for WMT13 cs-en, we have 46,397 pairs, whereas initially there were 85,469 judgments in total).
Note, however, that for testing, we used the official full data sets, where we used all pairwise judgments, including judgments saying that both translations are equally good, ties in the number of wins of translation1 vs. translation2 and repetitions. This is important to make our results fully comparable to previously published work.
As a final analysis on the WMT corpora, we studied the complexity of the discourse trees. Recall that we imposed the limitation of working with discourse structures at the sentence level. If we want the discourse metrics to be impactful, we need to make sure that a significant number of sentences have non-trivial discourse trees.
Figure 4 shows the proportion of sentences by discourse tree depth for the WMT11, WMT12, and WMT13 data sets. We computed these statistics with our automatic discourse parser applied to the reference translations. As can be seen, the three data sets show very similar curves. One relevant observation is that more than 70% of the sentences have a non-trivial discourse tree (depth > 0). Of course, the proportion of sentences decreases quickly with the tree depth. About 20% of the sentences have trees of depth 2 and slightly over 10% have trees of depth 3. The average depth for the three data sets is 1.77, with a minimum absolute value of 0 and a maximum of 32. The number of EDUs contained in those trees average to 2.77, with a minimum number of 1 and a maximum number of 33. Although the impact of discourse information is potentially higher at the paragraph level or document level, we showed that we have complex enough sentences in our data sets in terms of discourse structure. Thus, we have justified that there is potential in testing the effect of discourse information in MT evaluation metrics.
Distribution of sentences by tree depth computed based on the reference translations of WMT11, WMT12, and WMT13.
Distribution of sentences by tree depth computed based on the reference translations of WMT11, WMT12, and WMT13.
3.2 Learning Interpolation Weights for Metric Combination
Note that our approach to learn the interpolation weights is similar to the one used by PRO for tuning the relative weights of the components of a log-linear SMT model (Hopkins and May, 2011). Unlike PRO, (i) we used human judgments, not automatic scores, and (ii) we trained on all pairs, not on a subsample.
3.3 Correlation Measures
In our experiments, we only considered translation into English (as we had a discourse parser for English only), and we used the data described in Table 1. For evaluation, we followed the standard set-up of the Metrics task of WMT12 (Callison-Burch et al., 2012). For segment-level evaluation, we used Kendall's τ (Kendall, 1938), which can be calculated directly from the human pairwise judgments. For system-level evaluation, we used Spearman's rank correlation (Spearman, 1904) and, in some cases, also Pearson correlation (Pearson, 1895), which are appropriate correlation measures as here we have vectors of scores.
We measured the correlation of the evaluation metrics with the human judgments provided by the task organizers. As we explained earlier, the judgments represent rankings of the output of five systems chosen at random, for a particular sentence also chosen at random. From each of those rankings, we produce ten pairwise judgments (see Section 3.2). Then, using those pairwise human judgments, we evaluated the performance of the different MT evaluation metrics at the segment or at the system level.
3.3.1 Segment-Level Evaluation
The value of Kendall's τ ranges between −1 (all pairs are discordant) and 1 (all pairs are concordant), and negative values are worse than positive ones. Note that different sets of systems may be ranked for the different segments, but in the calculations we only use pairs of systems for which we have human judgments. Such direct judgments are available for a particular language pair and for a particular segment. We do not calculate Kendall's τ for each language pair; instead, we consider all pairwise judgments as part of a single set (as implemented in the official WMT scripts).
In the original Kendall's τ (Kendall, 1938), comparisons with human or metric ties are considered neither concordant nor discordant. In the experiments in Section 4, we used the official scorers from the WMT Metrics tasks to compute Kendall's τ. More precisely, in Sections 4.1 and 4.2 we use the WMT12 version of Kendall's τ (Callison-Burch et al., 2012), whereas in Section 4.3 we report results using the WMT14 scorer (Macháček and Bojar, 2014). We used these two different versions of the software to allow a direct comparison to the official results that were reported for the metrics task in WMT12 (Callison-Burch et al., 2012) and WMT14 (Macháček and Bojar, 2014).
3.3.2 System-Level Evaluation
For the correlation at the system level, we first produce a score for each of the systems according to the quality of their translations based on the evaluation metrics and on the human judgments. Then, we calculate the correlation between the scores for the participating systems using a target metric's scores and the human scores. We do this based on system ranks induced by the scores (using Spearman's rank correlation) or based on the scores themselves (using Pearson correlation). Note that, following WMT, we calculate the correlation score separately for each language pair, and then we average the resulting correlations to obtain the final score.
The Pearson correlation value ranges between −1 and 1, where higher absolute score is better. We used the official WMT14 scoring tool to calculate it.
4. Evaluation of the Discourse-Based Metrics
In this section, we show the utility of discourse information for machine translation evaluation. We present the evaluation results at the system level and at the segment level, using our two basic discourse-based metrics, which we refer to as DR and DR-lex (Section 2.1). In our experiments, we combine DR and DR-lex with other evaluation metrics in two different ways: using uniform linear interpolation (at the system level and at the segment level), and using a tuned linear interpolation for the segment-level. We only present the average results over all language pairs. For clarity, in our tables we show results divided into three evaluation groups:
Group I contains our discourse-based evaluation metrics, DR, and DR-lex.
Group II includes the publicly available MT evaluation metrics that participated in the WMT12 metrics task, excluding those that did not have results for all language pairs (Callison-Burch et al., 2012). More precisely, they are spede07pP, AMBER, Meteor, TerrorCat, SIMPBLEU, XEnErrCats, WordBlockEC, BlockErrCats, and posF.
Group III contains other important individual evaluation metrics that are commonly used in MT evaluation: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), Rouge (Lin, 2004), and TER (Snover et al., 2006). We calculated the metrics in this group using Asiya. In particular, we used the following Asiya versions of TER and Rouge: TERp-A and ROUGE-w.8
For each metric in groups II and III, we present the system-level and segment-level results for the original metric as well as for the linear interpolation of that metric with DR and with DR-lex. The combinations with DR and DR-lex that improve over the original metrics are shown in bold, and those that yield degradation are in italic.
For the segment-level evaluation, we further indicate which interpolated results yield statistically significant improvement over the original metric. Note that testing statistical significance is not trivial in our case because we have a complex correlation score for which the assumptions that standard tests make are not met. We thus resorted to a non-parametric randomization framework (Yeh, 2000), which is commonly used in NLP research.9
4.1 System-Level Results
Table 2 shows the system-level experimental results for WMT12. We can see that DR is already competitive by itself: On average, it has a correlation of 0.807, which is very close to the BLEU and the TER scores from group II (0.810 and 0.812, respectively). Moreover, DR yields improvements when combined with 13 of the 15 metrics, with a resulting correlation higher than those of the two individual metrics being combined. This fact suggests that DR contains information that is complementary to that used by most of the other metrics.
Results on WMT12 at the system-level (calculated on 6 systems for cs-en, 16 for de-en, 12 for es-en, and 15 for fr-en). Spearman's correlation with human judgments.
. | Metrics . | . | +DR . | +DR-lex . |
---|---|---|---|---|
I | DR | 0.807 | – | – |
DR-lex | 0.876 | – | – | |
II | SEMPOS | 0.902 | 0.853 | 0.903 |
AMBER | 0.857 | 0.829 | 0.869 | |
Meteor | 0.834 | 0.861 | 0.888 | |
TerrorCat | 0.831 | 0.854 | 0.889 | |
SIMPBLEU | 0.823 | 0.826 | 0.859 | |
TER | 0.812 | 0.836 | 0.848 | |
BLEU | 0.810 | 0.830 | 0.846 | |
posF | 0.754 | 0.841 | 0.857 | |
BlockErrCats | 0.751 | 0.859 | 0.855 | |
WordBlockEC | 0.738 | 0.822 | 0.843 | |
XEnErrCats | 0.735 | 0.819 | 0.843 | |
III | BLEU | 0.791 | 0.880 | 0.859 |
NIST | 0.817 | 0.842 | 0.875 | |
Rouge | 0.884 | 0.899 | 0.869 | |
TER | 0.908 | 0.926 | 0.920 |
. | Metrics . | . | +DR . | +DR-lex . |
---|---|---|---|---|
I | DR | 0.807 | – | – |
DR-lex | 0.876 | – | – | |
II | SEMPOS | 0.902 | 0.853 | 0.903 |
AMBER | 0.857 | 0.829 | 0.869 | |
Meteor | 0.834 | 0.861 | 0.888 | |
TerrorCat | 0.831 | 0.854 | 0.889 | |
SIMPBLEU | 0.823 | 0.826 | 0.859 | |
TER | 0.812 | 0.836 | 0.848 | |
BLEU | 0.810 | 0.830 | 0.846 | |
posF | 0.754 | 0.841 | 0.857 | |
BlockErrCats | 0.751 | 0.859 | 0.855 | |
WordBlockEC | 0.738 | 0.822 | 0.843 | |
XEnErrCats | 0.735 | 0.819 | 0.843 | |
III | BLEU | 0.791 | 0.880 | 0.859 |
NIST | 0.817 | 0.842 | 0.875 | |
Rouge | 0.884 | 0.899 | 0.869 | |
TER | 0.908 | 0.926 | 0.920 |
As expected, DR-lex performs better than DR because it is lexicalized (at the unigram level), and also gives partial credit to correct structures. Individually, DR-lex outperforms most of the metrics from group II, and ranks as the second best metric in that group. Furthermore, when combined with individual metrics, DR-lex is able to improve 14 out of the 15 metrics. Averaging over all metrics in the table, the combination of DR improves the average of the individual metrics correlation from 0.816 to 0.852 (+0.035) and DR-lex further improves the average results up to 0.868 (+0.052).
Thus, we can conclude that at the system level, adding discourse information to a metric, even using the simplest of the combination schemes, is a good idea for most of the metrics.
4.2 Segment-Level Results
Table 3 shows the results for WMT12 at the segment-level. We can see that DR performs badly, with a high negative Kendall's τ of −0.433. This should not be surprising because (i) the discourse tree structure alone does not contain enough information for a good evaluation at the segment level, and (ii) this metric is more sensitive to the quality of the DT, which can be wrong or void. Moreover, DR is more likely to produce a high number of ties, which is harshly penalized by WMT12's definition of Kendall's τ. Conversely, ties and incomplete discourse analysis were not a problem at the system level, where evidence from all 3,003 test sentences is aggregated, allowing us to rank systems more precisely. Because of the low score of DR as an individual metric, it fails to yield improvements when uniformly combined with other metrics (see Untuned +DR column in Table 3).
Results on WMT12 at the segment-level (calculated on 11,021 pairs for cs-en, 11,934 for de-en, 9,796 for es-en, and 11,594 for fr-en): untuned and tuned versions. Kendall's τ with human judgments. Improvements over the baseline are shown in bold, and statistically significant improvements are marked with ** and * for p-value <0.01 and p-value <0.05, respectively.
. | Metrics . | Orig. . | Untuned . | Tuned . | ||
---|---|---|---|---|---|---|
+DR . | +DR-lex . | +DR . | +DR-lex . | |||
I | DR | −0.433 | – | – | – | – |
DR-lex | 0.133 | – | – | – | – | |
II | spede07pP | 0.254 | 0.190 | 0.223 | 0.253 | 0.254 |
Meteor | 0.247 | 0.178 | 0.217 | 0.250 | 0.251 | |
AMBER | 0.229 | 0.180 | 0.216 | 0.230 | 0.232 | |
SIMPBLEU | 0.172 | 0.141 | 0.191** | 0.181** | 0.199** | |
XEnErrCats | 0.165 | 0.132 | 0.185** | 0.175** | 0.194** | |
posF | 0.154 | 0.125 | 0.201** | 0.160** | 0.201** | |
WordBlockEC | 0.153 | 0.122 | 0.181** | 0.161** | 0.189** | |
BlockErrCats | 0.074 | 0.068 | 0.151** | 0.087** | 0.150** | |
TerrorCat | −0.186 | −0.111 | −0.104** | 0.181** | 0.196** | |
III | BLEU | 0.185 | 0.154 | 0.190 | 0.189 | 0.194* |
NIST | 0.214 | 0.172 | 0.206 | 0.222** | 0.224** | |
Rouge | 0.185 | 0.144 | 0.201** | 0.196** | 0.218** | |
TER | 0.217 | 0.179 | 0.229** | 0.229** | 0.246** |
. | Metrics . | Orig. . | Untuned . | Tuned . | ||
---|---|---|---|---|---|---|
+DR . | +DR-lex . | +DR . | +DR-lex . | |||
I | DR | −0.433 | – | – | – | – |
DR-lex | 0.133 | – | – | – | – | |
II | spede07pP | 0.254 | 0.190 | 0.223 | 0.253 | 0.254 |
Meteor | 0.247 | 0.178 | 0.217 | 0.250 | 0.251 | |
AMBER | 0.229 | 0.180 | 0.216 | 0.230 | 0.232 | |
SIMPBLEU | 0.172 | 0.141 | 0.191** | 0.181** | 0.199** | |
XEnErrCats | 0.165 | 0.132 | 0.185** | 0.175** | 0.194** | |
posF | 0.154 | 0.125 | 0.201** | 0.160** | 0.201** | |
WordBlockEC | 0.153 | 0.122 | 0.181** | 0.161** | 0.189** | |
BlockErrCats | 0.074 | 0.068 | 0.151** | 0.087** | 0.150** | |
TerrorCat | −0.186 | −0.111 | −0.104** | 0.181** | 0.196** | |
III | BLEU | 0.185 | 0.154 | 0.190 | 0.189 | 0.194* |
NIST | 0.214 | 0.172 | 0.206 | 0.222** | 0.224** | |
Rouge | 0.185 | 0.144 | 0.201** | 0.196** | 0.218** | |
TER | 0.217 | 0.179 | 0.229** | 0.229** | 0.246** |
Again, DR-lex is better than DR; with a positive τ of 0.133, yet as an individual metric, it ranks poorly compared to other metrics in groups II and III. However, when uniformly combined (see Untuned +DR column) with other metrics, DR-lex outperforms 9 of the 13 metrics in Table 3, with statistically significant improvements in 8 of these cases (p-value <0.01).
Following the learning method described in Section 3.2, we experimented also with tuning the interpolation weights in the metric combinations. We report results for (i) cross-validation on WMT12, and (ii) tuning on WMT12 and testing on WMT11.
Cross-validation on WMT12. For cross-validation on WMT12, we used ten folds of approximately equal sizes, each containing about 300 sentences; we constructed the folds by putting together entire documents, thus not allowing sentences from a document to be split over two different folds. During each cross-validation run, we trained our pairwise ranker using the human judgments corresponding to nine of the ten folds. We then used the remaining fold for evaluation. Note that in this process, we aggregated the data for different language pairs, and we produced a single set of tuning weights for all language pairs.10
The results are shown in the last two columns of Table 3 (Tuned). We can see that the tuned combinations with DR-lex improve over all but one of the individual metrics in groups II and III, with statistically significant differences in 10 out of the 12 cases. Even more interestingly, the tuned combinations that include the much weaker metric DR now improve over 12 out of 13 of the individual metrics, with 9 of these differences being statistically significant with p-value <0.01. This is remarkable given that DR has a strong negative τ as an individual metric at the sentence-level. Again, these results suggest that both DR and DR-lex contain information that is complementary to that of the individual metrics that we experimented with.
Averaging over all 13 cases, DR improves Kendall's τ from 0.159 to 0.193 (+0.035), and DR-lex improves it to 0.211 (+0.053). These sizable improvements highlight the importance of tuning the linear combination when working at the segment level.
Testing on WMT11. To rule out the possibility that the improvement of the tuned metrics on WMT12 could have come from over-fitting, and also in order to verify that the tuned metrics generalize when applied to other sentences, we also tested on an additional data set: WMT11. We tuned the weights for our metric combinations on all WMT12 pairwise judgments (no cross-validation), and we evaluated them on the WMT11 data set. Because the metrics that participated in WMT11 and WMT12 are different (and even when they have the same name, there is no guarantee that they have not changed from 2011 to 2012), this time we only report results for the standard group III metrics, thus ensuring that the metrics in the experiments are consistent for 2011 and 2012.
The results, presented in Table 4, show the same pattern as before: (i) adding DR and DR-lex improve overall individual metrics, with the differences being statistically significant in seven out of the eight cases with p-value <0.01; and (ii) the contribution of DR-lex is consistently larger than that of DR. Observe that these improvements are very close to those for the WMT12 cross-validation. This shows that the weights learned on WMT12 generalize well, as they are also good for WMT11.
Results on WMT11 at the segment-level (calculated on 3,695 pairs for cs-en, 8,950 for de-en, 5,974 for es-en, and 6,337 for fr-en): tuning on the entire WMT12. Kendall's τ with human judgments. Improvements over the baseline are shown in bold, and statistically significant improvements are marked with ** for p-value < 0.01.
. | Metrics . | Orig. . | Tuned . | |||
---|---|---|---|---|---|---|
+DR . | +DR-lex . | |||||
I | DR | −0.447 | – | – | ||
DR-lex | 0.146 | – | – | |||
III | BLEU | 0.186 | 0.192 | 0.207 ** | ||
NIST | 0.219 | 0.226 ** | 0.232 ** | |||
Rouge | 0.205 | 0.218 ** | 0.242 ** | |||
TER | 0.262 | 0.274 ** | 0.296 ** |
. | Metrics . | Orig. . | Tuned . | |||
---|---|---|---|---|---|---|
+DR . | +DR-lex . | |||||
I | DR | −0.447 | – | – | ||
DR-lex | 0.146 | – | – | |||
III | BLEU | 0.186 | 0.192 | 0.207 ** | ||
NIST | 0.219 | 0.226 ** | 0.232 ** | |||
Rouge | 0.205 | 0.218 ** | 0.242 ** | |||
TER | 0.262 | 0.274 ** | 0.296 ** |
4.3 DR-Based Metrics in a Strong MT Evaluation Measure
From the results presented in the previous sections, we can conclude that discourse structure is an important information source, which is not entirely correlated to other information sources considered so far, and thus should be taken into account when designing future metrics for automatic evaluation of machine translation output. In this section we show how the simple combination of DR-based metrics with a selection of other existing strong MT evaluation metrics can lead to a very competitive evaluation metric, DiscoTKparty (Joty et al., 2014), which we presented at the metrics task of WMT14 (Macháček and Bojar, 2014).
Asiya (Giménez and Màrquez, 2010a) is a suite for MT evaluation that provides a large set of metrics using different levels of linguistic information. We used the 12 individual metrics from Asiya's ULC (Giménez and Màrquez, 2010b), which was the best performing metric both at the system level and at the segment level at the WMT08 and WMT09 metrics tasks. From the original ULC, we replaced Meteor by the four newer variants Meteor-ex (exact match), Meteor-st (+stemming), Meteor-sy (+synonymy lookup), and Meteor-pa (+paraphrasing) in Asiya's terminology (Denkowski and Lavie, 2011). We also added to the mix TERp-A (a variant of TER with paraphrasing), BLEU, NIST, and Rouge-W, for a total of 18 individual metrics. The metrics in this set use diverse linguistic information, including lexical-, syntactic-, and semantic-oriented individual metrics.
Regarding the discourse metrics, we used five variants, including DR and DR-lex described in Section 2, and three more constrained variants oriented to match words between trees only if they occur under the same substructure types (e.g., the same nuclearity type). These variants are designed by introducing structural modifications in the discourse trees. A detailed description can be found in Joty et al. (2014).
We tuned the relative weights of the previous 23 individual metrics (18 Asiya+ 5 discourse) following the same maximum entropy learning framework described in Section 3.2. As the training set, we used the simple concatenation of WMT11, WMT12, and WMT13.
DiscoTKparty was the best-performing metric at WMT14 both at the segment and at the system level, among a set of 16 and 20 participants, respectively (Macháček and Bojar, 2014). Table 5 shows a comparison at the segment level of our tuned metric DiscoTKparty to the best rivaling metric at wmt14, for each individual language pair, using Kendall's τ. Note that this best rival differs across language pairs, for example, for fr-en, hi-en, and cs-en it is beer, for de-en it is upc-stout, and for ru-en it is REDcombSent. We can see that our metric outperforms this best rival for four of the language pairs, with statistically significant differences. The only exception is hi-en, where the best rival performs slightly better, not statistically significantly.
Comparing our tuned metric to the best rivaling metric at wmt14, for each individual language pair (this best rival differs across language pairs) at the segment-level using Kendall's τ. Statistically significant improvements are marked with ** for p-value < 0.01.
System . | fr-en . | de-en . | hi-en . | cs-en . | ru-en . | Overall . |
---|---|---|---|---|---|---|
DiscoTKparty | 0.433** | 0.380** | 0.434 | 0.328** | 0.355** | 0.386** |
Best at wmt14 | 0.417 | 0.345 | 0.438 | 0.284 | 0.336 | 0.364 |
+0.016 | +0.035 | −0.004 | +0.044 | +0.019 | +0.024 |
System . | fr-en . | de-en . | hi-en . | cs-en . | ru-en . | Overall . |
---|---|---|---|---|---|---|
DiscoTKparty | 0.433** | 0.380** | 0.434 | 0.328** | 0.355** | 0.386** |
Best at wmt14 | 0.417 | 0.345 | 0.438 | 0.284 | 0.336 | 0.364 |
+0.016 | +0.035 | −0.004 | +0.044 | +0.019 | +0.024 |
System translations for Hindi–English were of extremely low quality, and were very hard to discourse-parse accurately.11 The linguistically heavy components of our DiscoTKparty (discourse parsing, syntactic parsing, semantic role labeling, etc.) may suffer from the common ungrammaticality of the translation hypotheses for hi-en, whereas other, less linguistically heavy metrics seem to be more robust in such cases.
We show in Figure 5 the weights for the individual metrics combined in DiscoTKparty after tuning on the combined WMT11+12+13 data set. The horizontal axis displays all the individual metrics involved in the combination. The first block of metrics (from BLEU to DR-Orp*) consists of the 18 Asiya metrics. The last five (from DR-nolex to DR-lex1.1) metrics are the metric variants based on discourse trees. Note that all metric scores are passed through a min-max normalization step to put them in the same scale before tuning their relative weights.
Absolute coefficient values after tuning the DiscoTKparty metric on the WMT11+12+13 data set.
Absolute coefficient values after tuning the DiscoTKparty metric on the WMT11+12+13 data set.
We can see that most of the metrics involved in the metric combination play a significant role, the most important ones being TERp-A, METEOR-pa (paraphrases), and ROUGE-W. Some metrics accounting for syntactic and semantic information also get assigned relatively high weights (DP-Or*, CP-STM-4, and DR-Orp*). Interestingly, all five variants of our discourse metric received moderately high weights, with the four variants using lexical information (DR-lex's) being more important. In particular, DR-lex1 has the fourth highest absolute weight in the overall combination. This confirms again the importance of discourse information in machine translation evaluation.
5. Analysis
When dealing with evaluation metrics based on lexical matching, such as BLEU or NIST, it is easier to understand how and why they work, and what their limitations are. However, if a metric deals with complex structures like discourse trees, it is not straightforward to explain its performance.
In this section, we aim to better understand which parts of the discourse trees have the biggest impact on the performance of the discourse-based measures presented in Section 2. For that purpose, we first conduct an ablation study (see Section 5.1), where we dissect the different components of the discourse trees, and we analyze the impact that the deletion of such components has on the performance of our evaluation metrics. In a second study (see Section 5.2), we analyze which parts of a complete discourse tree are most useful to distinguish between good and bad translations. Overall, the components that we focus on in our analysis are the following: (i) Discourse relations (Elaboration, Attribution, etc.); (ii) Nuclearity statuses (i.e., Nucleus and Satellite); and (iii) Discourse structure (boundaries of the elementary discourse units, depth of the tree, etc.).
The previous two studies focus on quantitative aspects of the discourse trees. Section 5.3 discusses one real example to understand from a more qualitative point of view the contribution of the sentence-level discourse trees in the evaluation of good and bad translations. Finally, in Section 5.4, we discuss the issue of whether discourse trees provide information that is complementary to syntax.
5.1 Ablation Study at the System Level
We analyze the performance of our discourse-based metric DR-lex at the system level. We use DR-lex instead of DR, as it exhibits the most competitive performance, and incorporates both lexical and discourse information. We selected system-level evaluation because the metric is much more stable and accurate at the system level than at the segment level.
In our ablation experiments we contrast the original DR-lex metric, computed over full RST trees, to variations of the same, where the discourse trees have less information. When removing a particular element, we replace the corresponding labels by a dummy tag (*). We have the following ablation conditions, which are illustrated in Figure 6:
- 1.
Full: Original DR-lex metric with the full (labeled) RST tree structure.
- 2.
No discourse relations: We replace all relation labels (Attribution, Elaboration, etc.) in the tree by a dummy tag.
- 3.
No nuclearity: We replace all the nuclearity statuses (i.e., Nucleus, Satellite), by a dummy tag.
- 4.
No relation and no nuclearity tags: We replace both the relation and the nuclearity labels by dummy tags. This leaves the discourse structure (i.e., the skeleton of the tree) along with the lexical items.
- 5.
No discourse structure: We remove all the discourse structure, and we only leave the lexical information. Under this representation, the evaluation metric corresponds to unigram lexical matching.
Different discourse trees for the same example translation: “to better link the creation of content for all the titles published,” with decreasing amount of discourse information. The five representations correspond to the ones used in the ablation study.
Different discourse trees for the same example translation: “to better link the creation of content for all the titles published,” with decreasing amount of discourse information. The five representations correspond to the ones used in the ablation study.
We scored all modified trees using the same tree kernel that we used in DR-lex, and we scored their resulting rankings accordingly. The summarized system-level results for WMT11-13 are shown in Table 6, where we used all into-English language pairs.
System-level Spearman (ρ) and Pearson (r) correlation results for the ablation study over the DR-lex metric across the WMT{11,12,13} data sets and overall.
. | RST variant . | 2011 . | 2012 . | 2013 . | Overall . | ||||
---|---|---|---|---|---|---|---|---|---|
ρ . | r . | ρ . | r . | ρ . | r . | ρ . | r . | ||
DR-LEX | full | 0.848 | 0.860 | 0.876 | 0.912 | 0.920 | 0.919 | 0.881 | 0.897 |
no_rel | 0.843 | 0.856 | 0.876 | 0.909 | 0.919 | 0.919 | 0.879 | 0.895 | |
no_nuc | 0.822 | 0.828 | 0.867 | 0.896 | 0.910 | 0.914 | 0.866 | 0.879 | |
no_nuc & no_rel | 0.815 | 0.826 | 0.847 | 0.891 | 0.915 | 0.913 | 0.859 | 0.877 | |
no_discourse | 0.794 | 0.798 | 0.865 | 0.863 | 0.887 | 0.903 | 0.849 | 0.855 |
. | RST variant . | 2011 . | 2012 . | 2013 . | Overall . | ||||
---|---|---|---|---|---|---|---|---|---|
ρ . | r . | ρ . | r . | ρ . | r . | ρ . | r . | ||
DR-LEX | full | 0.848 | 0.860 | 0.876 | 0.912 | 0.920 | 0.919 | 0.881 | 0.897 |
no_rel | 0.843 | 0.856 | 0.876 | 0.909 | 0.919 | 0.919 | 0.879 | 0.895 | |
no_nuc | 0.822 | 0.828 | 0.867 | 0.896 | 0.910 | 0.914 | 0.866 | 0.879 | |
no_nuc & no_rel | 0.815 | 0.826 | 0.847 | 0.891 | 0.915 | 0.913 | 0.859 | 0.877 | |
no_discourse | 0.794 | 0.798 | 0.865 | 0.863 | 0.887 | 0.903 | 0.849 | 0.855 |
We can see a clear pattern in Table 6. Starting from the lexical matching (no_discourse), each layer of discourse information helps to improve performance, even if just a little bit. Overall, we observe a cumulative gain from 0.849 to 0.881 in terms of Spearman's ρ. Having only the discourse structure (no_nuc & no_rel) improves the performance over using lexical items only. This means that identifying the boundaries of the discourse units in the translations (i.e., which lexical items correspond to which EDU), and how those units should be linked, already can tell us something about the quality of the translation. Next, by adding nuclearity information (no_rel), we observe further improvement. This means that knowing which discourse unit is the main one and which one is subordinate is helpful for assessing the quality of the translation. Finally, using the discourse structure, the nuclearity, and the relations together yields the best overall performance. The differences are not very large, but the tendency is consistent across data sets.
Interestingly, the nuclearity status (no_rel) is more important than the type of relation (no_nuc). Eliminating the latter yields a tiny decrease in performance, whereas ignoring the former causes a much larger drop. Although this might seem counterintuitive at first (because we think that knowing the type of discourse relation is important), this can be attributed to the difficulty of discourse parsing machine translated text. As we will observe in the next section, assigning the correct relation can be a much harder problem than predicting the nuclearity statuses. Thus, parsing errors might be undermining the effectiveness of the discourse relation information.
Table 7 presents the results of the same ablation study but this time broken down per language pair. For each language pair, all years are considered (2011–2013). Overall, we observe the same pattern as in Table 6, namely, that all layers of discourse information are helpful to improve the results, and that the nuclearity information is more important than the discourse relation types.12
System-level Spearman (ρ) and Pearson (r) correlation results for the ablation study over the DR-lex metric across language pairs for the WMT{11,12,13} data sets.
. | RST variant . | cs-en . | de-en . | es-en . | fr-en . | ||||
---|---|---|---|---|---|---|---|---|---|
ρ* . | r . | ρ . | r . | ρ . | r . | ρ . | r . | ||
DR-LEX | full | 0.890 | 0.893 | 0.782 | 0.840 | 0.970 | 0.952 | 0.943 | 0.940 |
no_rel | 0.894 | 0.892 | 0.775 | 0.833 | 0.968 | 0.952 | 0.942 | 0.939 | |
no_nuc | 0.899 | 0.885 | 0.739 | 0.802 | 0.958 | 0.949 | 0.935 | 0.925 | |
no_nuc & no_rel | 0.895 | 0.884 | 0.720 | 0.798 | 0.935 | 0.950 | 0.940 | 0.917 | |
no_discourse | 0.833 | 0.861 | 0.738 | 0.743 | 0.942 | 0.930 | 0.936 | 0.919 |
. | RST variant . | cs-en . | de-en . | es-en . | fr-en . | ||||
---|---|---|---|---|---|---|---|---|---|
ρ* . | r . | ρ . | r . | ρ . | r . | ρ . | r . | ||
DR-LEX | full | 0.890 | 0.893 | 0.782 | 0.840 | 0.970 | 0.952 | 0.943 | 0.940 |
no_rel | 0.894 | 0.892 | 0.775 | 0.833 | 0.968 | 0.952 | 0.942 | 0.939 | |
no_nuc | 0.899 | 0.885 | 0.739 | 0.802 | 0.958 | 0.949 | 0.935 | 0.925 | |
no_nuc & no_rel | 0.895 | 0.884 | 0.720 | 0.798 | 0.935 | 0.950 | 0.940 | 0.917 | |
no_discourse | 0.833 | 0.861 | 0.738 | 0.743 | 0.942 | 0.930 | 0.936 | 0.919 |
However, some differences are observed depending on the language pair. For example, Spanish–English exhibits larger improvements (ρ goes from 0.942 to 0.970) than French-English (ρ goes from 0.936 to 0.943). This is despite both language pairs being mature in terms of the expected quality for these systems. On another axis, the German–English language pair shows much lower overall correlation compared to Spanish–English (0.782 vs. 0.970). This can be the effect of the inherent difficulty of this language pair because of long-distance reordering and so forth. However, note that adding all the discourse layers increases ρ from 0.738 to 0.782. These observations are consistent with our findings in the next section, where we explore the different parts of the discourse trees at a more fine-grained level.
5.2 Discriminating Between Good and Bad Translations
In the previous section, we analyzed how different parts of the discourse tree contribute to the performance of the DR-lex metric. In this section, we take a different approach: We investigate whether the information contained in the discourse trees helps to differentiate good from bad translations.
In order to do so, we analyze the discourse trees generated for three groups of translations: (i) gold, the reference translations; (ii) good, the translations of the top-two best (per language pair) systems; and (iii) bad, the translations of the worst-two (per language pair) systems. Our hypothesis is that there are characteristics in the good-translation discourse trees that make them more similar to the gold-translation trees than the bad-translation trees. The characteristics we analyze here are the following: relation labels, nuclearity labels, tree depth, and number of words. We perform the analysis at the sentence level, by comparing the trees of the gold, good, and bad translations.
5.2.1 Discourse Relations
There are 18 discourse relation labels in our RST parser. We separately computed the label frequency distributions from the RST trees of all gold, good, and bad translation hypotheses. Figure 7 shows the histogram for the ten most frequent classes on the Spanish–English portion of the WMT12 data set. We can see that there are clear differences between the good and the bad distributions, especially in the frequencies of the most common tags (Elaboration and Same-Unit). The good hypotheses have a distribution that is much closer to the human references (gold). For example, the frequency difference for the Elaboration tag between good and gold translation trees is 58, which is smaller than the difference between bad and gold, 323. In other words, the trees for bad translations exhibit a surplus of Elaboration tags.
Distribution of discourse relations for gold, good, and bad automatic translations on WMT12 Spanish–English. We show the ten most frequent discourse relations only.
Distribution of discourse relations for gold, good, and bad automatic translations on WMT12 Spanish–English. We show the ten most frequent discourse relations only.
If we compare the entire frequency distribution across different relations for the whole WMT12, we observe that the Kullback–Leibler (KL) divergence (Kullback and Leibler, 1951) between the good and the gold distributions is also smaller than the KL divergence between the bad and gold: 0.0021 vs 0.0039, and a similar tendency holds for WMT13. This means that good translations have discourse trees that encode relation tags that match the gold translation trees. This suggests that the relation tags should be an important part of the discourse metric.
In a second step, we computed the micro-averaged F1 score for each relation label, taking the gold translation discourse trees as a reference. Note that computing standard parsing quality metrics that span over constituents (e.g., F1 score over the constituents), would require the leaves of the two trees to be the same. In our case, we work with two different translations (one gold and one MT-generated), which makes their RST trees not directly comparable. Therefore, we apply an approximation, and we measure F1 score over the total number of instances of a specific tag, regardless of their position in the tree. Furthermore, we also consider that every instance of a predicted tag is correct if there is a corresponding tag of the same type in the gold tree. Effectively, this makes the number of true positives for a specific tag equal to the minimum number of instances for that tag in either the hypothesis or the gold trees. Although this is a simplification, this gives us an idea of how closely the RST trees for good/bad translation approximate the trees from the references.
The results for the five most prevalent relations are shown in Figure 8. We can see systematically higher F1 scores for good-translation trees compared with bad ones across all relations and all corpora. This supports our hypothesis at the discourse relation level—that is, discourse trees for good translations contain more similar discourse labels to the reference translation trees. Note, however, that F1 scores vary across relations and they are not very high (highest is around 70%), indicating that they are hard to predict.
F1 score for each of the top-five relations in good- vs. bad-translation trees across the WMT{11,12,13} data sets.
F1 score for each of the top-five relations in good- vs. bad-translation trees across the WMT{11,12,13} data sets.
Figure 9 contains the same information, but this time broken down by language pair. For each language pair and corpus year, we micro-average the results for the five most frequent discourse relations. Again, we observe a clear advantage for the good-translation trees over the bad ones for all language pairs and for all years. Some differences are observed across language pairs, which do not always have an intuitive explanation in terms of the difficulty of the language pair.13 For instance, larger gaps are observed for es-en and de-en, compared to the rest. This correlates very well with the results in Table 7, clearly connecting the discourse similarity and the quality of the evaluation metrics.
F1 score for each language pair in good- vs. bad-translation trees across the WMT{11,12,13} data sets: micro-averaging the scores of the top-five relations.
F1 score for each language pair in good- vs. bad-translation trees across the WMT{11,12,13} data sets: micro-averaging the scores of the top-five relations.
5.2.2 Nuclearity and Other Tree Information
Nuclearity describes the role of the discourse unit within the relation, which can be central (Nucleus) or supportive (Satellite). Here, we study the distribution of these labels together with two extra elements from the trees: the EDUs and the depth of the discourse tree (Depth). The results are shown in Figure 10. For the number of Nucleus, Satellite, and EDU labels, we compute the simplified F1 scores in the same way that we did for relation labels, focusing on the number of instances. For the tree Depth, we compute the micro-averaged root-mean-squared-error, or RMSE.
F1 scores for the nuclearity relations and EDUs (upper chart), and RMSE for Depth (lower chart) in good- vs. bad-translation trees, across the WMT{11,12,13} data sets.
F1 scores for the nuclearity relations and EDUs (upper chart), and RMSE for Depth (lower chart) in good- vs. bad-translation trees, across the WMT{11,12,13} data sets.
As with the discourse relations, we observe better results for the nuclearity labels and the other tree elements from the good-translation trees, compared with the bad ones. This is consistent across all years (higher F1 or lower RMSE). Note that the F1 values for nuclearity labels are significantly higher than the F1 scores for discourse relations (now moving in the 0.78–0.82 interval, compared with F1 average scores below 0.60 in the case of discourse relations). This helps to explain the larger impact of the nuclearity elements in the evaluation measure (see Tables 6 and 7). Since predicting discourse segments is easier than predicting nuclearity labels (F1 values close to 0.89), the EDU structure contributes to improving the evaluation measure; this corresponds mainly to the no_nuc & no_rel case in the ablation study (again, see Tables 6 and 7).
Finally, Figure 11 shows the results by language pair. We show the micro-averaged F1 scores of the nuclearity labels and EDUs (upper charts), and the RMSE for Depth (lower charts). Once again, the F1 and RMSE results for good translations are better than those for bad ones, sometimes by large margins. The only exception is for Depth in fr-en (WMT13). Looking at the overall scores and at the size of the gaps between the scores for good and bad, we can see that they are consistent with the per-language results of Table 7, showing once again the direct relation between matching discourse elements and the correlation with the human assessments of the discourse-based DR-lex metric.
Micro-averaged F1 scores for each language pair for nuclearity and EDU, and RMSE for Depth in good- vs. bad-translation trees across the WMT{11,12,13} data sets.
Micro-averaged F1 scores for each language pair for nuclearity and EDU, and RMSE for Depth in good- vs. bad-translation trees across the WMT{11,12,13} data sets.
The main conclusions that we can draw from this analysis can be summarized as follows: (i) The similarity between discourse trees is a good predictor of the quality of the translation, according to the human assessments; (ii) Different levels of discourse structure and relations provide different information, which shows smooth accumulative contribution to the final correlation score; (iii) Both discourse relations and nuclearity labels have sizeable impact on the evaluation metric, the latter being more important than the former. The last point emphasizes the appropriateness of the RST theory as a formalism for the discourse structure of texts. Contrary to other discourse theories (e.g., the Discourse Lexicalized Tree Adjoining Grammar [Webber, 2004] used to build the Penn Discourse Treebank [Prasad et al., 2008]), RST accounts for the nuclearity as an important element of the discourse structure.
5.3 Qualitative Analysis of Good and Bad Translations
In the previous two sections we provided a quantitative analysis of which discourse information has the biggest impact on the performance of our discourse-based measure (DR-lex) and also which parts of the discourse trees help in distinguishing good from bad translations. In this section, we present some qualitative analysis by inspecting a real example of good vs. bad translations, and showing how the discourse trees help in assigning similarity scores to distinguish them.
Figure 12 shows a real example with discourse trees for a reference (a) and two alternative translations, one (b) being better than the second (c). The examples are extracted from the WMT11 data set (cs-en), and the discourse trees are obtained with our automatic discourse parser. Discourse trees are presented with the unfolded format introduced in Figure 3(b).
Example of discourse trees for good and bad translations in comparison with the reference translation. Example extracted from WMT-2011 (cs-en).
Example of discourse trees for good and bad translations in comparison with the reference translation. Example extracted from WMT-2011 (cs-en).
Translation 12(b) gets a DR-lex score of 0.88, which is higher than the score for translation 12(c), 0.75. Part of the difference is explained by the fact that translation 12(b) provides better word-based translation, including complete EDU constituents (e.g., is “the greatest golf hole in Prague”). But also, 12(b) obtains many more subtree matches with the reference at the level of the discourse structure. This translation has the same discourse structure and labels as the reference, with the only exception of the top-most discourse relation (Joint vs. Attribution). This tendency is observed across the data sets, and it is quantitatively verified in previous Section 5.2 (i.e., good translations tend to share the tree structure and labels with the reference translations).
On the other hand, translation 12(c) is much more ungrammatical. This leads to inaccurate parsing, producing a discourse tree that is flatter than the reference discourse tree and that has many more inaccuracies at the discourse relation level. Consequently, the tree kernel finds fewer subtree matches, and the similarity score becomes lower.
Note that the proposed kernel-based similarity assigns the same weight to all subtree matches encountered, so it is not possible for the metric to modulate which are the most important features to distinguish better from worse translations.14 The success of the metric is based solely on the assumption (verified in Section 5.2) that better translations will exhibit discourse trees that are closer to the reference. A natural step to follow would be to try to learn with preference and convolutional kernels which of these subtree structures (understood as implicit features) help to discriminate better from worse translations. This is the approach followed by Guzmán et al. (2014a), which is also mentioned in Section 5.4.
5.4 Does Discourse Provide Relevant Information Beyond Syntax?
Discourse parsing at the sentence level relies heavily on syntactic features extracted from the syntactic parse tree. One valid question to raise is whether the sentence-level discourse structure provides any relevant information for MT evaluation apart from the syntactic relations. Note that in Section 4.3 we combined up to 18 metrics with our discourse-based evaluation metrics. Three of them use dependency parsing features (DP-HWCM-c-4, DP-HWCM-r-4, and DP-Or*; cf. Figure 5) and a fourth one uses constituency parse trees (CP-STM-4; Figure 5).15 According to the interpolation weights, the contribution of these metrics is not negligible, but it seems to be lower than that of the DR metrics. Still, this is a too indirect way of approaching the comparison.
Our previous work (Guzmán et al., 2014a) helps answer the question about the complementarity of the two sources of information in a more direct way. In that paper, we proposed a pairwise setting for learning MT evaluation metrics with preference tree kernels. The setting can incorporate syntactic and discourse information encapsulated in tree-based structures and the objective is to learn to differentiate better from worse translations by using all subtree structures as implicit features. The discourse parser we used is the same used in this article. The syntactic tree is mainly constructed using the Illinois chunker (Punyakanok and Roth, 2001). The kernel used for learning is a preference kernel (Shen and Joshi, 2003; Moschitti, 2008), which decomposes into Partial Tree Kernel (Moschitti, 2006) applications between pairs of enriched tree structures. Word unigram matching is also included in the kernel computation, thus being quite similar to DR-lex.
Table 8 shows the results obtained on the same WMT12 data set by using only discourse structures, only syntactic structures or both structures together. As we can see, the τ scores of the syntactic and the discourse variants are not very different (with a general advantage for syntax), but when put together there is a sizeable improvement in correlation for all the language pairs and overall. This is clear evidence that the discourse-based features are providing additional information, which is not included in syntax.
Kendall's (τ) segment level correlation with human judgments on WMT12 obtained by the pairwise preference kernel learning. Results are presented for each language pair and overall.
Structure . | cs-en . | de-en . | es-en . | fr-en . | Overall . |
---|---|---|---|---|---|
Syntax | 0.190 | 0.244 | 0.198 | 0.158 | 0.198 |
Discourse | 0.176 | 0.235 | 0.166 | 0.160 | 0.184 |
Syntax+Discourse | 0.210 | 0.251 | 0.240 | 0.223 | 0.231 |
Structure . | cs-en . | de-en . | es-en . | fr-en . | Overall . |
---|---|---|---|---|---|
Syntax | 0.190 | 0.244 | 0.198 | 0.158 | 0.198 |
Discourse | 0.176 | 0.235 | 0.166 | 0.160 | 0.184 |
Syntax+Discourse | 0.210 | 0.251 | 0.240 | 0.223 | 0.231 |
6. Related Work
In this section we provide a brief overview of related work on discourse in MT (Section 6.1), followed by work on MT evaluation (Section 6.2). In the latter, we cover MT evaluation in general, and in the context of discourse analysis. We also discuss our previous work on using discourse for MT evaluation.
6.1 Discourse in Machine Translation
The earliest work on using discourse in machine translation that we are aware of dates back to 2000: Marcu, Carlson, and Watanabe (2000) proposed rewriting discourse trees for MT. However, this research direction was largely ignored by the research community as the idea was well ahead of its time: Note that it came even before the current standard phrase-based SMT model was envisaged (Koehn, Och, and Marcu, 2003).
Things have changed since then, and today there is a vibrant research community interested in using discourse for MT, which has started its own biannual Workshop on Discourse in Machine Translation, DiscoMT (Webber et al., 2013, 2015; Webber, Popescu-Belis, and Tiedemann, 2017). The 2015 edition also started a shared task on cross-lingual pronoun translation (Hardmeier et al., 2015), which had a continuation at WMT 2016 (Guillou et al., 2016), and which is now being featured also at DiscoMT 2017. These shared tasks have the goals of establishing the state of the art and creating common data sets that would help future research in this area.
At this point, several discourse-related research problems have been explored in MT:
- •
consistency in translation (Carpuat, 2009; Carpuat and Simard, 2012; Ture, Oard, and Resnik, 2012; Guillou, 2013);
- •
lexical and grammatical cohesion and coherence (Tiedemann, 2010a, 2010b; Gong, Zhang, and Zhou, 2011; Hardmeier, Nivre, and Tiedemann, 2012; Voigt and Jurafsky, 2012; Wong and Kit, 2012; Ben et al., 2013; Xiong et al., 2013; Louis and Webber, 2014; Tu, Zhou, and Zong, 2014; Xiong, Zhang, and Wang, 2015);
- •
word sense disambiguation (Vickrey et al., 2005; Carpuat and Wu, 2007; Chan, Ng, and Chiang, 2007);
- •
anaphora resolution and pronoun translation (Hardmeier and Federico, 2010; Le Nagard and Koehn, 2010; Guillou, 2012; Popescu-Belis et al., 2012);
- •
handling discourse connectives (Pitler and Nenkova 2009; Becher 2011; Cartoni et al. 2011; Meyer 2011; Meyer et al. 2012; Meyer and Popescu-Belis 2012; Hajlaoui and Popescu-Belis 2012; Popescu-Belis et al. 2012; Meyer and Poláková 2013; Meyer and Webber 2013; Li, Carpuat, and Nenkova 2014; Steele 2015);
- •
full discourse-enabled MT (Marcu, Carlson, and Watanabe 2000; Tu, Zhou, and Zong 2013).
6.2 Discourse in Machine Translation Evaluation
Despite the research interest, so far most attempts to incorporate discourse-related knowledge in MT have been only moderately successful, at best.16 A common argument is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012; Meyer, 2014). Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance MT research.
The need to consider discourse phenomena in MT evaluation was also emphasized earlier by the Framework for Machine Translation Evaluation in ISLE (FEMTI) (Hovy, King, and Popescu-Belis, 2002), which defines quality models (i.e., desired MT system qualities and their metrics) based on the intended context of use.17 The suitability requirement of MT system in the FEMTI comprises discourse aspects including readability, comprehensibility, coherence, and cohesion.
In Section 4, we have suggested some simple ways to create such metrics, and we have also shown that they yield better correlation with human judgments. Indeed, we have shown that using linguistic knowledge related to discourse structures can improve existing MT evaluation metrics. Moreover, we have further proposed a state-of-the-art evaluation metric that incorporates discourse information as one of its information sources.
Research in automatic evaluation for MT is very active, and new metrics are constantly being proposed, especially in the context of the MT metric comparisons (Callison-Burch et al., 2007) and metric shared tasks that ran as part of the Workshop on Machine Translation or WMT (Callison-Burch et al. 2008; 2009; 2010; 2011; 2012; Macháček and Bojar 2013; 2014; Stanojević et al. 2015; Bojar et al. 2016), and the NIST Metrics for Machine Translation Challenge, or MetricsMATR.18 For example, at WMT15, 11 research teams submitted 46 metrics to be compared (Stanojević et al., 2015).
Many metrics at these evaluation campaigns explore ways to incorporate syntactic and semantic knowledge. This reflects the general trend in the field. For instance, at the syntactic level, we find metrics that measure the structural similarity between shallow syntactic sequences (Giménez and Màrquez, 2007; Popovic and Ney, 2007) or between constituency trees (Liu and Gildea, 2005). In the semantic case, there are metrics that exploit the similarity over named entities, predicate–argument structures (Giménez and Màrquez, 2007; Lo, Tumuluru, and Wu, 2012), or semantic frames (Lo and Wu, 2011). Finally, there are metrics that combine several lexico-semantic aspects (Giménez and Màrquez, 2010b).
As we mentioned earlier, one problem with discourse-related MT research is that it might need specialized evaluation metrics to measure progress. This is especially true for research focusing on relatively rare discourse-specific phenomena, as getting them right or wrong might be virtually “invisible” to standard MT evaluation measures such as BLEU, even when manual evaluation does show improvements (Meyer et al., 2012; Taira, Sudoh, and Nagata, 2012; Novák, Nedoluzhko, and Žabokrtský, 2013).
Thus, specialized evaluation measures have been proposed, for example, for the translation of discourse connectives (Hajlaoui and Popescu-Belis, 2012; Meyer et al., 2012; Hajlaoui, 2013) and for pronominal anaphora (Hardmeier and Federico, 2010), among others.
In comparison to the syntactic and semantic extensions of MT metrics, there have been very few previous attempts to incorporate discourse information. One example includes the semantics-aware metrics of Giménez and Màrquez (2009) and Giménez et al. (2010), which used the Discourse Representation Theory (Kamp and Reyle, 1993) and tree-based discourse representation structures (DRS) produced by a semantic parser. They calculated the similarity between the MT output and the references based on DRS subtree matching as defined in Liu and Gildea (2005), also using DRS lexical overlap and DRS morpho-syntactic overlap. However, they could not improve correlation with human judgments as evaluated on the MetricsMATR data set, which consists of 249 manually assessed segments. Compared with that previous work, here (i) we used a different discourse representation (RST), (ii) we compared discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluated on much larger data sets, for several language pairs and for multiple metrics, and (iv) we did demonstrate better correlation with human judgments.
Recently, other discourse-related extensions of MT metrics (such as BLEU, TER, and Meteor) were proposed (Wong et al., 2011; Wong and Kit, 2012), which use document-level lexical cohesion (Halliday and Hasan, 1976). In that work, lexical cohesion is achieved using word repetitions and semantically similar words such as synonyms, hypernyms, and hyponyms. For BLEU and TER, they observed improved correlation with human judgments on the MTC4 data set (900 segments) when linearly interpolating these metrics with their lexical cohesion score. However, they ignored a key property of discourse, namely, the coherence structure, which we have effectively exploited in both tuning and no-tuning scenarios. Furthermore, we have shown that the similarity in discourse trees can yield improvements in a larger number of existing MT evaluation metrics. Finally, unlike their work, which measured lexical cohesion at the document level, here we are concerned with coherence (rhetorical) structure, primarily at the sentence level.
Finally, we should note our own previous work, on which this article is based. In Guzmán et al. (2014b), we showed that using discourse can improve a number of pre-existing evaluation metrics, and in Joty et al. (2014) we presented our DiscoTK family of discourse-based metrics. In particular, the DiscoTKparty metric (discussed in Section 4.3) combined several variants of a discourse tree representation with other metrics from the Asiya MT evaluation toolkit, and yielded the best-performing metric in the WMT14 Metrics shared task. Compared with those previous publications of ours, here we provide additional detail and extensive analysis, trying to explain why discourse information is helpful for MT evaluation.
In another related publication (Guzmán et al., 2014a), we proposed a pairwise learning-to-rank approach to MT evaluation that learns to differentiate better from worse translations compared with a given reference. There, we integrated several layers of linguistic information, combing POS, shallow syntax, and discourse parse, which we encapsulated in a common tree-based structure.
We used preference re-ranking kernels to learn the features automatically. The evaluation results show that learning in the proposed framework yields better correlation with human judgments than computing the direct similarity over the same type of structures. Also, we showed that the structural kernel learning can be a general framework for MT evaluation, in which syntactic and semantic information can be naturally incorporated.
Unfortunately, learning features with preference kernels is computationally very expensive, both at training and at testing time. Thus, in a subsequent work (Guzmán et al., 2015), we used a pairwise neural network instead, where lexical, syntactic, and semantic information from the reference and the two hypotheses is compacted into small distributed vector representations and fed into a multilayer neural network that models the interaction between each of the hypotheses and the reference, as well as between the two hypotheses. This framework yielded correlation with human judgments that rivals the state of the art. In future work, we plan to incorporate discourse information in this neural framework, which we could not do initially because of the lack of discourse embeddings. However, with the availability of a neural discourse parser like the one proposed by Li, Li, and Hovy (2014), this goal is now easily achievable.
7. Conclusions
We addressed the research question of whether sentence-level discourse structure can help the automatic evaluation of machine translation. To do so, we defined several variants of a simple discourse-aware similarity metric, which use the all-subtree kernel to compute similarity between RST trees. We then used this similarity metric to automatically assess MT quality in the evaluation benchmarks from the WMT metrics shared task. We proposed to take the similarity between the discourse trees for the hypothesis and for the reference translation as an absolute measure of translation quality. The results presented here can be analyzed from several perspectives:
Applicability. The first conclusion after a series of experimental evaluations is that the sentence-level discourse structure can be successfully leveraged to evaluate translation quality. Although discourse-based metrics perform reasonably well on their own, especially at the system level, one interesting fact is that discourse information is complementary to many existing metrics for MT evaluation (e.g., BLEU, TER, Meteor) that encompass different levels of linguistic information. At a system level, this leads to systematic improvements in correlation with human judgments in the majority of the cases where the discourse-based metrics were mixed with other single metrics in a uniformly weighted linear combination. When we further tuned the combination weights via supervised learning from human-assessed pairwise examples, we obtained even better results and observed average relative gains between 22% and 35% in segment-level correlations.
Robustness. Other interesting properties we observed in our experiments have to do with the robustness of the supervised learning combination approach. The results were very stable when training and testing across several WMT data sets from different years. Additionally, the tuned metrics were quite insensitive to the source language in the translation, to the point that it was preferable to train with all training examples together rather than training source-language specific models.
External validation. Exploiting this combination approach to its best, we produced a strong combined MT evaluation metric (DiscoTKparty) composed by 23 individual metrics, including five variants of our discourse-based metric, which performed best at the WMT14 translation evaluation task, both at the system level and at the segment level. When building the state-of-the-art DiscoTKparty metric, we observed that discourse-based features are favorably weighted (e.g., DR-lex1 was ranked fourth out of the 23 metrics), having coefficients that are on par with other features such as BLEU. This tells us that the contribution of discourse-based information is significant even in the presence of such a rich diversity of sources of information. In this direction, we also presented evidence showing that the contribution from the sentence-level discourse information is beyond what the syntactic information provides to the evaluation metrics; in fact, the two linguistic dimensions collaborate well, producing cumulative gains in performance.
Understanding the Contribution of Discourse. In this article, we also presented a more qualitative analysis in order to better understand the contribution of the discourse trees in the new proposed discourse metrics. First, we conducted an ablation study, and we confirmed that all layers of information present in the discourse trees (i.e., hierarchical structure, discourse relations, and nuclearity labels) play a role and make a positive incremental contribution to the final performance. Interestingly, the most relevant piece of information is the nuclearity labels, rather than the relations.
Second, we analyzed the ability of discourse trees to discriminate between good and bad translations in practice. For that, we computed the similarity between the discourse trees of the reference translations and the discourse trees of a set of good translations, and compared it with the similarity between the discourse trees of the reference translations and a set of bad translations. Good and bad translations were selected based on existing human evaluations. This similarity was computed at different levels, including relation labels, nuclearity labels, elementary discourse units, tree depth, and so forth. We observed a systematically higher similarity to the discourse trees of good translations, in all the specific elements tested and across all language pairs. These results confirm the ability of discourse trees to characterize good translations as the ones more similar to the reference.
Limitations and Future Work. An important limitation of our study is that it is restricted to sentence-level discourse parsing. Although it is true that complex sentences with non-trivial discourse structure abound in our corpora, it is reasonable to think that there is more potential in the application of discourse parsing at the paragraph or at the document level. The main challenge in this direction is that there are no corpora available with manual annotations of the translation quality at the document level.
Second, we have applied discourse only for MT evaluation, but we would like to follow a similar path to verify whether discourse can also help machine translation itself. Our first approach will be to use discourse information to re-rank a set of candidate translations. The main challenge here is that one has to establish the links between the discourse structure of the source and that of the translated sentences, trying to promote translations that preserve discourse structure.
Finally, at the level of learning, we are working on how to jump from tuning the overall weights of a linear combination of metrics to perform learning on fine-grained features, for example, consisting of the substructures that the discourse parse tree, and other linguistic structures (syntax, semantics, etc.) contain. This way, we would be learning the features that help in identifying better translations compared to worse translations (Guzmán et al., 2014a, Guzmán et al., 2015). Our vision is to have a model that can learn combined evaluation metrics taking into account different levels of linguistic information, fine-grained features, and pre-existing measures, and which could be applied, with minor variations, to the related problems of MT evaluation, quality estimation, and reranking.
Acknowledgments
The authors would like to thank the reviewers of the previous versions of this article for their constructive and thorough comments and criticism. Thanks to their comments we have improved the article significantly.
Notes
See Carlson and Marcu (2001) for a detailed description of the discourse relations.
A demo of the parser is available at http://alt.qcri.org/demos/Discourse_Parser_Demo/. The source code of the parser is available from http://alt.qcri.org/tools/discourse-parser/.
http://www.statmt.org/wmtYY/results.html, with YY in {11, 12, 13, 14}.
A detailed description of the WMT evaluation setting can be found in (Bojar et al., 2014).
When fitting the model, we did not include a bias term, as this was harmful.
In this section we use the term segment instead of sentence as we do in the rest of the article, to be consistent with the terminology used in the MT field.
We use the WMT12 aggregation script. See also Callison-Burch et al. (2012) for a discussion and comparison of several aggregation alternatives.
These variants are described in page 19 of the Asiya manual (http://asiya.lsi.upc.edu/).
We did not apply the significance test at the system level because of the insufficient number of scores available to sample from; there were a total of 49 (language pair, system) scores for the WMT12 data.
Tuning separately for each language pair yielded slightly lower results.
Note that the Hindi–English language pair caused a similar problem for a number of other metrics at the wmt14 Shared Task competition that relied on linguistic analysis.
Note that the results of Spearman's ρ for cs-en do not follow exactly the same pattern. This instability might be due to the small number of systems for this language pair (see Table 1).
Differences between language pairs can be attributable to the particular MT systems that participated in the competition, which is a variable that we cannot control for in these experiments.
This also explains why translation 12(c) obtains a relatively high score of 0.75. There are many subtree matches coming from the words, and from some very simple and meaningless fragments in the discourse tree (e.g., NUC–Nucleus, NUC–Satellite).
Asiya syntactic metrics are described on pages 21–24 of the manual (http://asiya.lsi.upc.edu/).
A notable exception is the work of Tu, Zhou, and Zong (2013), who report up to 2.3 BLEU points of improvement for Chinese-to-English translation using an RST-based MT framework.
References
Author notes
Applied Machine Learning Group, Facebook. E-mail: [email protected].
HBKU Research Complex B1, P.O. Box 5825, Doha, Qatar. E-mail: [email protected], [email protected].