Abstract
Despite significant improvements in enhancing the quality of translation, context-aware machine translation (MT) models underperform in many cases. One of the main reasons is that they fail to utilize the correct features from context when the context is too long or their models are overly complex. This can lead to the explain-away effect, wherein the models only consider features easier to explain predictions, resulting in inaccurate translations. To address this issue, we propose a model that explains the decisions made for translation by predicting coreference features in the input. We construct a model for input coreference by exploiting contextual features from both the input and translation output representations on top of an existing MT model. We evaluate and analyze our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset, demonstrating an improvement of over 1.0 BLEU score when compared with other context-aware models.
1 Introduction
With the rapid development of machine learning techniques, the Machine Translation (MT) field has witnessed changes from exclusively probabilistic models (Brown et al., 1990; Koehn et al., 2003) to neural network based models, such as simplistic Recurrent Neural Network (RNN) based encoder-decoder models (Sutskever et al., 2014) or higher-level attention-based models (Bahdanau et al., 2015; Luong et al., 2015), and finally turn to the current state-of-the-art Transformer model (Vaswani et al., 2017) and its variations.
The quality of MT models, including RNN- based, attention-based, and Transformer models, has been improved by incorporating contextual information (Voita et al., 2018; Wang et al., 2017; and others), or linguistic knowledge (Bugliarello and Okazaki, 2020; Sennrich and Haddow, 2016; and others). In the former context-aware methods, many successful approaches focus on context selection from previous sentences (Jean et al., 2017; Wang et al., 2017) using multiple steps of translation, including additional module to refine translations produced by context-agnostic MT system, to utilize contextual information (Voita et al., 2019; Xiong et al., 2019), and encoding all context information as end-to-end frameworks (Zhang et al., 2020; Bao et al., 2021). Although they have demonstrated improved performance, there are still many cases in which their models perform incorrectly for handling, i.e., the ellipsis phenomenon in a long paragraph. One of the reasons is that their models are still unable to select the right features from context when the context is long, or the model is overly complex. Therefore, the model will easily suffer from an explain-away effect (Klein and Manning, 2002; Yu et al., 2017; Shah et al., 2020; Refinetti et al., 2023) in which a model learns to use only features which are easily exploited for prediction by discarding most of the input features.
In order to resolve the problem of selecting the right context features in the context-aware MT, we propose a model which explains decisions of translation by predicting input features. The input prediction model employs the representations of translation outputs as additional features to predict contextual features in the inputs. In this work, we employ coreference as the prediction task since it captures the relation of mentions that are necessary for the context-aware model. The prediction model is constructed on top of an existing MT model without modification in the same manner as done in multi-task learning, but it fuses information from representations used for the decisions of translation in the MT model.
Under the same settings of the English-Russian (En-Ru) dataset and the WMT document-level translation task of the English-German (En-De) dataset, our proposed technique outperforms the standard transformer-based neural machine translation (NMT) model in both sentence and context-aware models, as well as the state-of-the-art context-aware model measured by BLEU (Post, 2018), BARTScore (Yuan et al., 2021), and COMET (Rei et al., 2020), and the human-annotated test set in a paragraph (Voita et al., 2019). Additionally, in the multilingual experiments, our method shows consistent results, paralleling those in the En-Ru and En-De datasets, and proving its versatility across languages.
Further analysis shows that our coreference explanation sub-model consistently enhances the quality of translation, regardless of type of dataset size. Notably, the model demonstrates consistent improvement when additional context is incorporated, highlighting its effectiveness in handling larger context sizes. Additionally, the analysis highlights a strong correlation between the self-attention heat map and coreference clusters, underscoring the significance of our coreference prediction sub-model in capturing coreference information during the translation process. Moreover, our proposed training method proves to be effective in the coreference prediction task. We also provide a suggestion to finetune the contribution of the sub-model to optimize its impact within the overall MT system. We release our code and hyperparameters at https://github.com/hienvuhuy/TransCOREF.
2 Background
2.1 Transformer-based NMT
2.2 Context-Aware Transformer-base NMT
Several approaches can be used to produce a translated document, i.e., keeping a sliding window of size m (Tiedemann and Scherrer, 2017), joining these m sentences as a single input, translating these m sentences and selecting the last sentence as an output (m-to-m) (Zhang et al., 2020), or joining whole sentences in a document as a very long sequence and translating this sequence (Bao et al., 2021), among other methods. To simplify the definition of the context-aware NMT model, we opt for the m-to-m method and use a special character (_eos) between sentences when feeding these m sentences to the model. In this way, the context-aware translation model can still be defined as a standard sentence-wise translation in §2.1.
2.3 Coreference Resolution Task
Coreference Resolution is the task of identifying and grouping all the mentions or references of a particular entity within a given text into a cluster, i.e., a set of spans. This task has progressed significantly from its earlier approaches, which were based on hand-crafted feature systems (McCarthy and Lehnert, 1995; Aone and William, 1995), to more advanced and effective deep learning approaches based on span-ranking (Lee et al., 2017, 2018;; Kirstain et al., 2021) and for multilingual languages (Zheng et al., 2023).
3 Context-Aware MT with Coreference Information
Architecture
Training
Inference
4 Experiments
4.1 Dataset
We utilized the En-Ru dataset (Voita et al., 2019) and the widely adopted En-De benchmark dataset IWSLT 2017, as used in Maruf et al. (2019), with details provided in Table 1. We also used the multilingual TED talk dataset (Qi et al., 2018) to assess the efficacy of our proposed method across a variety of language types, including different characteristics in pronouns, word order and gender assignment with specifics delineated in Table 2.
Statistics of En-De and En-Ru datasets.
. | Avg. #Coref. Clusters . | #Samples . |
---|---|---|
. | train/valid/test . | train/valid/test . |
En-Ru | 3.1/3.0/2.9 | 1.5M/10k/10k |
En-De | 4.4/4.4/4.4 | 206k/8k/2k |
. | Avg. #Coref. Clusters . | #Samples . |
---|---|---|
. | train/valid/test . | train/valid/test . |
En-Ru | 3.1/3.0/2.9 | 1.5M/10k/10k |
En-De | 4.4/4.4/4.4 | 206k/8k/2k |
Properties of Languages in Our Experiments: WO (Word Order), PP (Pronouns Politeness), GP (Gendered Pronouns), and GA (Gender Assignment) denote language structural properties. IE (Indo-European), JAP (Japonic), ST (Sino-Tibetan), and AA (Austroasiatic) represent language families. Symbols ♢, ♡, ♣, and ♠ correspond to ‘None’, ‘Binary’, ‘Avoided’, and ‘Multiple’, respectively. The terms 3SG (Third Person Singular), 1/2/3P (First, Second, and Third Person), and 3P (Third Person) are used for pronoun references. SEM and S-F stand for Semantic and Semantic-Formal, respectively, in Gender Assignment.
. | Family . | WO . | PP . | GP . | GA . |
---|---|---|---|---|---|
English | IE | SVO | ♢ | 3SG | SEM |
Russian | IE | SVO | ♠ | 3SG | S-F |
German | IE | SOV/SVO | ♡ | 3SG | S-F |
Spanish | IE | SVO | ♡ | 1/2/3P | SEM |
French | IE | SVO | ♡ | 3SG | S-F |
Japanese | JAP | SOV | ♣ | 3P | ♢ |
Romanian | IE | SVO | ♠ | 3SG | S-F |
Mandarin | ST | SVO | ♡ | 3SG | ♢ |
Vietnamese | AA | SVO | ♠ | ♢ | ♢ |
. | Family . | WO . | PP . | GP . | GA . |
---|---|---|---|---|---|
English | IE | SVO | ♢ | 3SG | SEM |
Russian | IE | SVO | ♠ | 3SG | S-F |
German | IE | SOV/SVO | ♡ | 3SG | S-F |
Spanish | IE | SVO | ♡ | 1/2/3P | SEM |
French | IE | SVO | ♡ | 3SG | S-F |
Japanese | JAP | SOV | ♣ | 3P | ♢ |
Romanian | IE | SVO | ♠ | 3SG | S-F |
Mandarin | ST | SVO | ♡ | 3SG | ♢ |
Vietnamese | AA | SVO | ♠ | ♢ | ♢ |
The En-Ru dataset comes from OpenSubtitle 2018 (Lison et al., 2018) by sampling training instances with three context sentences after tokenization and, thus, no document boundary information is preserved. In the En-De and multilingual datasets, document boundaries are provided. To maintain consistency in our translation settings during experiments, we tokenize all texts by using MeCab1 for Japanese, Jieba2 for Chinese, VnCoreNLP (Vu et al., 2018) for Vietnamese, and the SpaCy framework3 for all other languages. We also apply a sliding window with a size of m sentences (m = 4) to each document to create a similar format to that of the En-Ru dataset. For the first m −1 sentences, which do not have enough m −1 context sentences in the m-to-m translation settings, we pad the beginning of these sentences with empty sentences, ensuring m −1 context sentences for all samples in the dataset. For preprocessing, we apply the BPE (Byte-Pair Encoding) technique from Sennrich et al. (2016) with 32K merging operations to all datasets. To identify coreference clusters for the source language, i.e., English, we leveraged the AllenNLP framework4 and employed the SpanBERT large model (Lee et al., 2018). After generating sub-word units, we adjust the word-wise indices of all members in coreference clusters using the offsets for sub-word units.
4.2 Experiment Settings
Translation Setting
In our experiments, we adopt the context-aware translation settings (m-to-m with m = 4) utilized in previous work (Zhang et al., 2020). For the context-agnostic setting, we translate each sentence individually.
Baselines Systems
We adopt the Transformer model (Vaswani et al., 2017) as our two baselines: Base Sent, which was trained on source and target sentence pairs without context, and Base Doc, which was trained with contexts in the m-to-m setting as described in §2.2. To make a fair comparison with previous work that uses similar context-aware translation settings and enhance MT system at the encoder side, we employ the G-Transformer (Bao et al., 2021), Hybrid Context (Zheng et al., 2020), and MultiResolution (Sun et al., 2022). We also compare our approach with the CoDoNMT (Lei et al., 2022) model, which also integrates coreference resolution information to improve translation quality. Note that all aforementioned baselines utilize provided open-source code. Additionally, we trained a simple variant of a context-aware Transformer model similar to Base Doc, but differ in that it incorporated a coreference embedding, alongside the existing positional embedding, directly in to the encoder side of the model (Trans+C-Embedding). This coreference embedding is derived from the original positional embedding in the encoder with the modification that all tokens within a coreference cluster share the same value as the left-most token in the same cluster. Note that it is intended as a simple baseline for a direct model as discussed in §3.
Our Systems
We evaluate our proposed inference methods, including the original inference method in Transformer without reranking (Trans+Coref) or with reranking with the score from our sub-model (Trans+Coref+RR) using the coreference resolution task as denoted in Equation 13.
Hardware
All models in our experiments were trained on a machine with the following specifications: An AMD EPYC 7313P CPU, 256GB RAM, a single NVIDIA RTX A6000 with 48GB VRAM, and CUDA version 11.3. For multilingual experiments, we used a single NVIDIA RTX 3090 GPU, Intel i9-10940X, 48GB VRAM, and CUDA version 12.1.
Hyperparameters
We use the same parameters, including the number of training epochs, learning rate, batch size, etc., for all models in our experiments. Specifically, we train all models for 40 epochs when both losses of coreference and translation in the valid set show unchanging or no improvements.
For translation tasks, we use the Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 1e −9, along with an inverse square root learning rate scheduler. All dropout values are set to 0.1, and the learning rate is set to 7e −5. We use a batch size of 128 and 32 for experiments on the English-Russian and English-German datasets, respectively. Other parameters follow those in Vaswani et al. (2017).
For coreference tasks, we adopt parameters from Kirstain et al. (2021), with some modifications to accommodate our GPU memory. We use the Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 1e −9, with a learning rate of 7e −5. Dropout value is set to 0.3, top lambda (the percentage of all spans to keep after filtering) is set to 0.4, hidden size is set to 512, and the maximum span length is set to 10. The maximum cluster values are set to 8 and 20 for the English-Russian and English-German datasets, respectively. To rerank the N-best translations, we use Equation 13 and perform a grid search on the validation set with a step size of 0.0001 to select the optimal value for β from −2 to 2.
4.3 Metrics
BLEU
We employ SacreBLEU (Post, 2018) as an automated evaluation metric to assess the quality of translations in our experiments.
BARTScore
COMET
4.4 Results
The main results of our experiments are presented in Table 3. Our results indicate that training the baseline Transformer model with both context and target sentences (Base Doc) results in better performance than training with only target sentences (Base Sent) in the En-Ru dataset. This finding is consistent with those reported by Voita et al. (2019), in which more contextual information is helpful to achieve better translation. However, in the En-De dataset, the Base Doc system performs worse compared to the Base Sent system. This discrepancy can be explained by the different methodologies used in constructing the En-De and En-Ru datasets. For the En-De datasets, both context-aware and context-agnostic datasets are compiled from the same pool of non-duplicate sentences. However, for the En-Ru datasets, the context-agnostic dataset is created by removing context sentences from the context-aware dataset (Voita et al., 2019), which results in varying numbers of non-duplicate sentences between these context-agnostic and context-aware datasets.
The results of all main experiments. BL, BS and CM are abbreviations for BLEU, BARTScore and COMET, respectively. The best performance per metric are in bold text.
. | En - Ru . | En - De . | ||||
---|---|---|---|---|---|---|
BL ↑ . | BS ↑ . | CM↑ . | BL ↑ . | BS ↑ . | CM↑ . | |
Base Sent | 29.46 | −9.695 | 82.87 | 22.76 | −6.178 | 68.06 |
Base Doc | 29.91 | −9.551 | 83.40 | 21.54 | −6.200 | 66.91 |
Hybrid Context (Zheng et al., 2020) | 29.96 | −9.590 | 83.45 | 22.05 | −6.236 | 66.97 |
G-Transformer (Bao et al., 2021) | 30.15 | −9.691 | 83.13 | 22.61 | −6.090 | 68.36 |
MultiResolution (Sun et al., 2022) | 29.85 | −9.763 | 81.76 | 22.09 | −6.099 | 67.99 |
DoCoNMT (Lei et al., 2022) | 29.92 | −9.552 | 83.03 | 22.55 | −6.197 | 67.93 |
Trans+C-Embedding | 30.13 | −9.522 | 83.43 | 22.54 | −6.092 | 68.80 |
Trans+Coref | 30.39* | −9.501† | 83.56• | 23.57** | −6.088† | 69.17◇ |
Trans+Coref+RR | 30.43* | −9.500† | 83.56• | 23.60** | −6.086† | 69.21◇ |
. | En - Ru . | En - De . | ||||
---|---|---|---|---|---|---|
BL ↑ . | BS ↑ . | CM↑ . | BL ↑ . | BS ↑ . | CM↑ . | |
Base Sent | 29.46 | −9.695 | 82.87 | 22.76 | −6.178 | 68.06 |
Base Doc | 29.91 | −9.551 | 83.40 | 21.54 | −6.200 | 66.91 |
Hybrid Context (Zheng et al., 2020) | 29.96 | −9.590 | 83.45 | 22.05 | −6.236 | 66.97 |
G-Transformer (Bao et al., 2021) | 30.15 | −9.691 | 83.13 | 22.61 | −6.090 | 68.36 |
MultiResolution (Sun et al., 2022) | 29.85 | −9.763 | 81.76 | 22.09 | −6.099 | 67.99 |
DoCoNMT (Lei et al., 2022) | 29.92 | −9.552 | 83.03 | 22.55 | −6.197 | 67.93 |
Trans+C-Embedding | 30.13 | −9.522 | 83.43 | 22.54 | −6.092 | 68.80 |
Trans+Coref | 30.39* | −9.501† | 83.56• | 23.57** | −6.088† | 69.17◇ |
Trans+Coref+RR | 30.43* | −9.500† | 83.56• | 23.60** | −6.086† | 69.21◇ |
(*) and (*) indicate statistical significance (Koehn, 2004) at p < 0.02 and p < 0.01, respectively, compared to the Base Doc system and all other baseline systems. (◇), (†), and (•) signify statistical significance at p < 0.05 compared to all baselines, all except Trans+C-Embedding and G-Transformer, and all except Trans+C-Embedding, Hybrid Context, and G-Transformer, respectively.
When comparing our systems with the Transformer model (Base Doc), our approaches, both Trans+Coref and Trans+Coref+RR, have proven effective in enhancing translation quality by explaining the decision of translation through predicting coreference information. This is demonstrated by the superior BLEU scores (+0.52 in En-Ru and +2.06 in En-De for the Trans+ Coref+RR), BARTScore, and COMET observed when comparing across different settings and language pairs.
Compared to the G-Transformer system described in Bao et al. (2021), our system shows an improvement in both inference approaches (Trans+Coref and Trans+Coref+RR). In the En-Ru dataset, our system achieves a higher BLEU score by +0.24, while in the En-De dataset, it demonstrates a larger improvement of +1.14 in the same metric (Trans+Coref). Additionally, our method outperforms the G-Transformer in terms of the BARTScore and COMET for both the En-Ru and En-De datasets. One possible explanation for these results is that the G- Transformer is specifically designed to map each sentence in the source language to only a single sentence in the target language during both training and inference steps. This design choice helps mitigating issues related to generating very long sequences. However, when the dataset size is small, as in the case of the En-De dataset, the G-Transformer encounters difficulties in selecting useful information. In contrast, our approach effectively selects useful information indirectly through the coreference explanation sub-model, especially when dealing with small-sized datasets, which allows our system to outperform under the scenarios with limited dataset size. Our method also surpasses the Transformer model with additional position embedding (Trans+C-Embedding), which relied on coreference information using a direct modeling approach.
In the results of the multilingual TED talk dataset in Table 4, where we compare our proposed method to Transformer models and the best baselines in Table 3, our method also surpasses other baselines within +1.0 to +2.3 BLEU scores. These findings provide further evidence that our approach is effective in improving translation quality and can be applied to diverse language types.
The results of multilingual dataset in the BLEU metric. The highest results are in bold text.
. | Es ↑ . | Fr ↑ . | Ja↑ . | Ro ↑ . | Zh ↑ . | Vi ↑ . |
---|---|---|---|---|---|---|
Base Sent | 37.23 | 37.75 | 12.11 | 24.35 | 12.38 | 31.74 |
Base Doc | 36.22 | 36.89 | 10.13 | 23.27 | 11.66 | 31.22 |
G-Transformer | 36.46 | 37.88 | 12.27 | 24.63 | 12.07 | 32.69 |
Trans+Coref | 38.13* | 39.01* | 12.93* | 25.56* | 13.18* | 33.51* |
. | Es ↑ . | Fr ↑ . | Ja↑ . | Ro ↑ . | Zh ↑ . | Vi ↑ . |
---|---|---|---|---|---|---|
Base Sent | 37.23 | 37.75 | 12.11 | 24.35 | 12.38 | 31.74 |
Base Doc | 36.22 | 36.89 | 10.13 | 23.27 | 11.66 | 31.22 |
G-Transformer | 36.46 | 37.88 | 12.27 | 24.63 | 12.07 | 32.69 |
Trans+Coref | 38.13* | 39.01* | 12.93* | 25.56* | 13.18* | 33.51* |
With statistically significance (Koehn, 2004) at p < 0.01 compared to other systems.
We provide an example of translations from our systems as well as other baseline systems in Table 5. In this example, the correct translation of the phrase in the last sentence, моей командой (my team), is determined by identifying which word refers to “my”, in this case, i and me. Both the G-Transformer and Trans+C-Embedding systems fail to capture these mentions and consequently produce an incorrect translation, мою команду. Despite correctly translating моей, the Base Doc system’s phrase встретимся в моей команде is grammatically incorrect and deviates from the original English “meet my team”. Conversely, our systems capture this reference accurately, yielding a translation consistent with the reference.
5 Analysis
Contribution of Coreference Explanation
We conducted experiments by adjusting the value of α in Equation 12 during the training of the Trans+Coref without reranking. The experimental results in Table 6 indicate that for medium-sized corpora, selecting a value of α that is either too small or too large negatively impacts translation quality. The optimal range for α is 0.8 ≤ α ≤ 2.
Ablation results on the En-Ru dataset with different weight α. The highest result is in bold text.
. | α . | BLEU ↑ . |
---|---|---|
Base Sent | − | 29.46 |
Base Doc | − | 29.91 |
Trans+Coref | 0.8 | 30.36 |
1.0 | 30.31 | |
2.0 | 30.39 | |
3.0 | 30.27 | |
4.0 | 30.15 | |
10.0 | 30.00 |
. | α . | BLEU ↑ . |
---|---|---|
Base Sent | − | 29.46 |
Base Doc | − | 29.91 |
Trans+Coref | 0.8 | 30.36 |
1.0 | 30.31 | |
2.0 | 30.39 | |
3.0 | 30.27 | |
4.0 | 30.15 | |
10.0 | 30.00 |
Conditioning on Source and Target Language
Evaluation of Trans+Enc and Trans+Coref systems using BLEU and MUC* metrics on the validating set of the En-De dataset. The highest results are in bold text.
. | BLEU ↑ . | P↑ . | R↑ . | F1↑ . |
---|---|---|---|---|
Trans+Enc | 22.98 | 85.02 | 75.69 | 80.08 |
Trans+Coref | 23.57 | 82.63 | 78.31 | 80.41 |
. | BLEU ↑ . | P↑ . | R↑ . | F1↑ . |
---|---|---|---|---|
Trans+Enc | 22.98 | 85.02 | 75.69 | 80.08 |
Trans+Coref | 23.57 | 82.63 | 78.31 | 80.41 |
The MUC metric counts the changes required to align the system’s entity groupings with the gold-standard, focusing on adjustments to individual references.
Figure 1 displays the entity heat maps, which illustrate the behavior of self-attention in different systems in the translation sub-model. In the Base Doc system, self-attention primarily concentrates on local sentences while disregarding information between sentences. In contrast, the Trans+Enc system exhibits the ability to focus on inter-sentences. However, when it comes to tokens within coreference clusters, the focused values are incorrect for certain clusters, such as [I_0, William]. On the other hand, the Trans+Coref system not only exhibits inter-sentential focus in its self-attention heat map but also accurately depicts the focused values for tokens within coreference clusters.
Entity heat maps of self-attentions: (a) Base Doc, (b) Trans+Enc and (c) Trans+Coref.
Entity heat maps of self-attentions: (a) Base Doc, (b) Trans+Enc and (c) Trans+Coref.
Figure 2 demonstrates the entity heat maps in the coreference sub-model. In the Trans+Enc system, self-attention mainly concentrates on entities within the local scope and immediate adjacent sentences. However, when comparing these high attention values with links in the coreference clusters, a significant proportion is found to be incorrect, i.e., [Grandma; I_1]. On the other hand, the self-attention in Trans+Coref exhibits a more balanced distribution of focus across all entities within the input. This balanced distribution results in considerably fewer errors when compared to self-attention in the Trans+Enc system. These findings align with the MUC metric (Vilain et al., 1995), which is based on the minimum number of missing links in the response entities compared to the key entities, with details, particularly the F1 score, provided in Table 7. Note that we use reference translations to form in Equation (11) for identifying coreference clusters. Additionally, we generate gold label coreference clusters using the AllenNLP framework, as discussed in Section 4.1.
Entity heat maps of self-attentions in the coreference resolution sub-model.
Impact of the Context Size
We conducted experiments with the Coref (Trans+Coref) and the Transformer (Base Doc) systems by exploring different context sizes in m-to-m settings ranging from 2 to 4. The experimental results in Figure 3 demonstrate that the Base Doc system significantly drops the translation quality when the context gets longer, while Trans+Coref consistently achieves gains as we incorporate more context. This result also indicates the use of the coreference sub-model is able to capture contextual information better than the baseline.
Translation results on En-De datasets with different m-to-m translation settings from m = 2 to m = 4. The result in the m = 1 setting serves as the Base Sent reference. The α in Equation 12 is set to 4.0.
Translation results on En-De datasets with different m-to-m translation settings from m = 2 to m = 4. The result in the m = 1 setting serves as the Base Sent reference. The α in Equation 12 is set to 4.0.
Impact of Coreference Explanation
We conduct experiments by reranking all translation hypotheses with varying beam sizes during inference by the Equation (13) to assess the impact of coreference explanation sub-model on the En-Ru dataset (Voita et al., 2019). Figure 4 illustrates the results of our experiments measured by BLEU score. Our findings indicate that reranking with the sub-model Coref yields improved results, with differences ranging from 0.02 to 0.09. We also report oracle BLEU score in Figure 5, which is measured by selecting a hypothesis sentence that gives the maximum sentence-BLEU scores among potential hypotheses, to verify the potentially correct translations in an N-best list. The results of this experiment with differences ranging from 0.2 to 0.4 suggest that using the sub-model Coref has more potential to generate correct translations. Despite the relatively minor difference in
The results with N-best variants on the En-Ru dataset (Voita et al., 2019).
The results with N-best variants using the oracle BLEU metric on the En-Ru dataset (Voita et al., 2019).
The results with N-best variants using the oracle BLEU metric on the En-Ru dataset (Voita et al., 2019).
the oracle BLEU score between the Trans+Coref and the Base Doc systems, indicating a similarity in their candidate space, the beam search process yields better results with the Trans+Coref when compared with the Base Doc system. This reflects the differences in BLEU scores between Trans+Coref and Base Doc. The performance gap in the BLEU score between the Trans+Coref and Trans+Coref+RR could potentially be further maximized by incorporating the coreference resolution during the beam search at the expense of more computational costs. We intend to explore this possibility in future research.
To further understand the impact of the coreference explanation sub-model on translation results, we perform an experiment on the contrastive test in Voita et al. (2019), which contains human-labeled sentences to evaluate discourse phenomena and relies on the source text only, to verify whether our method can solve phenomena at the document level. Table 8 presents the results this experiment, which indicate that our system outperforms the Base Doc system in all aspects. These results demonstrate the significant contribution of the coreference explanation sub-model to the MT system.
Experimental results on the contrastive test (Voita et al., 2019). D, EI, EV and L are abbreviations for Deixis, Ellipsis Infl, Ellipsis Vp and Lexical Cohesion, respectively. Note that we only utilized the text described in §4.1, while other studies may incorporate additional sentence-level bilingual and monolingual texts associated with Voita et al. (2019).
. | D ↑ . | EI ↑ . | EV ↑ . | L ↑ . |
---|---|---|---|---|
Base Doc | 83.32 | 70.20 | 62.20 | 46.0 |
Trans+Coref | 85.64 | 71.20 | 65.2 | 46.4 |
. | D ↑ . | EI ↑ . | EV ↑ . | L ↑ . |
---|---|---|---|---|
Base Doc | 83.32 | 70.20 | 62.20 | 46.0 |
Trans+Coref | 85.64 | 71.20 | 65.2 | 46.4 |
Impact of Coreference Accuracy
We carried out experiments to assess the impact of varying accuracies within the external coreference framework, which was reported in 80.4% of the F1 score on the MUC metric for the English CoNLL-2012 shared task in Lee et al. (2018), on the overall translation quality. This was achieved by randomly omitting members from coreference clusters while ensuring that each valid cluster retained a minimum of two members, i.e., removing you_1 from the cluster [you_0, you_1, I_1] in Figure 1.
Table 9 presents the outcomes of these experiments, where a slight reduction in translation quality is observed as members of coreference clusters are randomly dropped. Remarkably, even with the omission of up to half of the cluster members, the results continue to exceed the performance of the Base Doc system. This implies that our method could be robust and effective, particularly for languages with limited accuracy in coreference resolution tasks.
Experimental results on dropping coreference clusters on the En-De dataset. RR means reranking with the coreference sub-model using Equation 13.
. | Pruning (%) . | BLEU ↑ . | |
---|---|---|---|
−RR . | +RR . | ||
Base Doc | − | 21.54 | − |
Trans+Coref | 0 | 23.57 | 23.60 |
10 | 23.43 | 23.44 | |
20 | 23.40 | 23.41 | |
30 | 23.29 | 23.29 | |
50 | 22.86 | 22.86 |
. | Pruning (%) . | BLEU ↑ . | |
---|---|---|---|
−RR . | +RR . | ||
Base Doc | − | 21.54 | − |
Trans+Coref | 0 | 23.57 | 23.60 |
10 | 23.43 | 23.44 | |
20 | 23.40 | 23.41 | |
30 | 23.29 | 23.29 | |
50 | 22.86 | 22.86 |
Impact of the Corpus Size
We randomly sampled training instances from the En-Ru dataset and varied the sample sizes to 200,000 (comparable size to the En-De dataset), 500,000, and 1,000,000. Subsequently, we evaluated the contribution of the Coref sub-model (Trans+Coref) and the Transformer (Base Doc) on these datasets of different sample sizes. Figure 6 illustrates the results of these experiments. Our proposed system outperforms the Transformer model (Base Doc) across all sample sizes in the test set. Notably, this improvement is not limited to the small dataset size setting but similar trends are observed for medium-sized datasets. These results indicate that our system consistently outperforms the transformer model and achieves improved translation qualities regardless of the dataset size.
Translation results on the En-Ru dataset (Voita et al., 2019) with different sample sizes.
Translation results on the En-Ru dataset (Voita et al., 2019) with different sample sizes.
Remaining Challenges and Unresolved Questions
While our proposed method and existing works enhance translation accuracy for certain linguistic phenomena, challenges persist, particularly in handling deixis. Unlike straightforward scenarios where additional context aids in accurately translating deictic terms (e.g., determining the speakers in a conversation to correctly translate the words I and You), some instances require a comprehensive understanding of the provided text’s content to achieve correct pronoun translation. Consider the following example from the test data of the English-Vietnamese dataset (Qi et al., 2018): ”Oh my god! you’re right! who can we[chúng ta] sue? Now Chris is a really brilliant lawyer, but he knew almost nothing about patent law and certainly nothing about genetics. I knew something about genetics, but I wasn’t even a lawyer, let alone a patent lawyer. So clearly we[chúng tôi] had a lot to learn before we[chúng ta] could file a lawsuit.” In this context, the English word we is translated as either chúng tôi (we[chúng tôi]) or chúng ta (we[chúng ta]), reflecting the exclusion or inclusion of the listener. This example underscores the importance of contextual nuances in translating pronouns like we or us from English to Vietnamese, where the choice between chúng tôi and chúng ta is critical.
Building on the insights from the described example, we extracted all samples that presented similar linguistic challenges, in which a correctly translated sample must ensure that every instance of the word we is accurately translated. Table 10 presents the accuracy of translating the word we into the correct Vietnamese. While our method surpasses other baseline models in performance, it still exhibits lower accuracy in comparison to the deixis-related outcomes of the contrastive test for Russian (Voita et al., 2019). This discrepancy highlights the phenomenon as a significant challenge that warrants further investigation.
Computational Cost
We present a detailed comparison of the parameter count and training time per epoch for our proposed method alongside other baselines in Table 11. When compared to the G-Transformer, our method uses fewer parameters, takes less time to train, and yet achieves better performance. On the other hand, the Base Doc system uses the fewest parameters and trains the quickest, but its results are notably underperforming.
Number of parameters (in million), training time for one epoch (in seconds), and results of systems (in the BLEU metric) on the En-De dataset.
. | No. of Params . | Training Time . | En-De ↑ . |
---|---|---|---|
Base Doc | 92.03 | 407 | 21.54 |
MultiResolution | 92.03 | 610 | 22.09 |
G-Transformer | 101.48 | 566 | 22.61 |
Hybrid Context | 65.78 | 1,776 | 22.05 |
CoDoNMT | 92.03 | 638 | 22.55 |
Trans+Coref | 98.59 | 503 | 23.57 |
. | No. of Params . | Training Time . | En-De ↑ . |
---|---|---|---|
Base Doc | 92.03 | 407 | 21.54 |
MultiResolution | 92.03 | 610 | 22.09 |
G-Transformer | 101.48 | 566 | 22.61 |
Hybrid Context | 65.78 | 1,776 | 22.05 |
CoDoNMT | 92.03 | 638 | 22.55 |
Trans+Coref | 98.59 | 503 | 23.57 |
6 Related Work
Multi-task learning has primarily been utilized in MT tasks to integrate external knowledge into MT systems. Luong et al. (2016), Niehues and Cho (2017), and Eriguchi et al. (2017) have employed multi-task learning with different variations of shared weights of encoders, decoders, or attentions between tasks to effectively incorporate parsing knowledge into sequence-to-sequence MT systems.
For incorporating coreference cluster information, Ohtani et al. (2019), Xu et al. (2021), and Lei et al. (2022) incorporate coreference cluster information to improve their NMT models. Ohtani et al. (2019) integrates coreference cluster information into a graph-based NMT approach to enhance the information. Similarly, Xu et al. (2021) uses the information to connect words across different sentences and incorporates other parsing information to construct a graph at the document-level, resulting in an improvement in translation quality. Lei et al. (2022) employs coreference information to construct cohesion maskings and fine-tunes sentence MT systems to produce more cohesive outputs. On the other hand, Stojanovski and Fraser (2018) and Hwang et al. (2021) leverage coreference cluster information through augmented steps. They either add noise to construct a coreference-augmented dataset or use coreference information to create a contrastive dataset and train their MT systems on these enhanced datasets to achieve better translation performance. For context-aware MT, Kuang et al. (2018) and Tu et al. (2018) focus on utilizing memory-augmented neural networks, which store and retrieve previously translated parts in NMT systems. These approaches help unify the translation of objects, names, and other elements across different sentences in a paragraph. In contrast, Xiong et al. (2019) and Voita et al. (2019) develop a multiple-pass decoding method inspired by the Deliberation Network (Xia et al., 2017) to address coherence issues, i.e., deixis and ellipsis in paragraphs. They first translate the source sentences in the first pass and then correct the translations to improve coherence in the second pass. Mansimov et al. (2020) introduce a self-training technique, similar to domain self-adaptation, to develop a document-level NMT system. Meanwhile, various methods aim to encapsulate contextual information, i.e., hierachical attention (Maruf et al., 2019), multiple-attention mechanism (Zhang et al., 2020; Bao et al., 2021), and recurrent memory unit (Feng et al., 2022).7 In a data augmentation approach, Bao et al. (2023) diversify training data for the target side language, rather than only using a single human translation for each source document.
Recently, Wang et al. (2023) have shown that state-of-the-art Large Language Models (LLMs), i.e., GPT-4 (OpenAI et al., 2024), outperform traditional translation models in context-aware MT. In other approaches, Wu et al. (2024) and Li et al. (2024) have developed effective fine-tuning and translation methods for lightweight LLMs; however, the efficacy of NMT models can exceed that of lightweight LLMs, varying by language pair.
7 Conclusion
This study presents a context-aware MT model that explains the translation output by predicting coreference clusters in the source side. The model comprises two sub-models, a translation sub-model and a coreference resolution sub-model, with no modifications to the translation model. The coreference resolution sub-model predicts coreference clusters by fusing the representation from both the encoder and decoder to capture relations in the two languages explicitly. Under the same settings of the En-Ru, En-De, and the multilingual datasets, and following analyses on the coreference sub-model’s contributions, the impacts of context and corpus size, as well as the type of information utilized in the sub-model, our proposed method has proven effective in enhancing translation quality.
Limitations
In this study, the hidden dimension size in the coreference resolution sub-model is smaller than typical state-of-the-art systems, i.e., 512 vs. 2048, potentially limiting its accuracy and negatively impacting the quality of translation. Additionally, this study requires fine-tuning for a certain hyperparameter that combines the coreference resolution sub-model and the translation model to achieve satisfactory results.
Acknowledgments
The authors are grateful to the anonymous reviewers and the action editor who provided many insightful comments that improve the paper. This work was supported by JSPS KAKENHI grant number JP21H05054.
Notes
COMET-20 model (wmt20-COMET-da).
In Feng et al. (2022), they provided source code without instructions. We tried to reuse and reimplement their method; however, we cannot reproduce their results in any efforts. They did not reply our emails for asking training details. We therefore decided not to include their results in Table 3.
References
Author notes
Action Editor: Emily Pitler