Abstract
Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation. Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label (i.e., the candidate target word is treated as a label). Since the context hidden vector itself does not take the label into account and it is projected to the label through a linear classifier, the model cannot sufficiently leverage valuable information from the source sentence as verified in our experiments, which eventually hinders its overall performance. To alleviate this issue, this work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence. Unfortunately, training and inference suffer from efficiency and effectiveness challenges, therefore we employ three simple yet effective strategies to put our model into practice. Experiments on four standard benchmarks demonstrate that our reranking-based approach achieves substantial improvements (about 6.07%) over the previous state-of-the-art model. Further analyses show that each strategy of our approach contributes to the final performance.1
1 Introduction
Computer-aided Translation (CAT) (Barrachina et al., 2009; Santy et al., 2019; Huang et al., 2021), which enables the leveraging of machine translation systems (Bahdanau et al., 2015; Vaswani et al., 2017) to improve the efficiency of the human translation process, has seen increasing interest in recent years. In this work, we study a crucial yet challenging task in CAT: Word-Level AutoCompletion (WLAC) (Li et al., 2021), which aims at yielding word-level suggestions based on context pieces provided by human (Figure 1(a)).
(a) Illustration of the WLAC task in De⇒En. Suppose that a user has input a source sentence x, partial translations (cl, cr) and is now typing some characters (s). A well-trained WLAC model is expected to suggest “disease” to complete s. The expected translation for x is “And disease is the common enemy of these desperate people.” (b) Attention weights from “[MASK]” to words in x of the baseline method. (c) Attention weights from “disease” to words in x of our energy-based model. (Color intensity reflects the strength of attention weights.)
(a) Illustration of the WLAC task in De⇒En. Suppose that a user has input a source sentence x, partial translations (cl, cr) and is now typing some characters (s). A well-trained WLAC model is expected to suggest “disease” to complete s. The expected translation for x is “And disease is the common enemy of these desperate people.” (b) Attention weights from “[MASK]” to words in x of the baseline method. (c) Attention weights from “disease” to words in x of our energy-based model. (Color intensity reflects the strength of attention weights.)
Previous research includes statistical methods (Huang et al., 2015) and neural methods (Santy et al., 2019; Li et al., 2021). With the help of word alignment toolkits (Och and Ney, 2003; Dyer et al., 2013), statistical approaches build a translation table and use it to predict the target word. More recently, Li et al. (2021) use a Transformer-based classification model, which firstly encodes the input context to a hidden vector and then maps the hidden vector into the candidate target word through a linear classifier. This strong baseline method achieves the state-of-the-art (SOTA) performance.
In the aforementioned classification paradigm, the hidden vector of the input context inherently does not take the candidate target word into consideration. As a result, it may not effectively leverage valuable information carried by the candidate target word when occurring in the input context, as shown in Figure 1(b). Specifically, given the input context and human typed characters “d”, the user may tend to type “disease” (“Krankheit” in German). However, through visualizing attention weights, it shows that the baseline method captures more information from “gemeinsame” and “verzweifelten” than that from the most informative word “Krankheit” in the source side, which may underestimate the model score of the ground-truth word “disease” and thereby leads to incorrect prediction.
To alleviate the above issue, we formalize the WLAC task with an energy-based model (Ranzato et al., 2006; LeCun et al., 2006) based on Transformer, where the hidden vector is defined on top of both the candidate target word and the input context through a deep energy function. Furthermore, with the help of deep neural networks, the energy-based function is expected to capture sufficient information for each candidate target word through the attention mechanism. In this way, the energy function is able to capture informative context (i.e., “Krankheit”) to evaluate the target word (i.e., “disease”), and thereby the score from the energy-based model is more reliable, as shown in Figure 1(c).
Unfortunately, training and inference with the energy-based model suffer from efficiency and effectiveness challenges due to the normalization term in the model. To alleviate the effect of these barriers, we systematically incorporate three simple yet effective strategies inspired by previous studies: (1) a negative sampling method for efficient training (Ma and Collins, 2018; Li et al., 2019a; Xu et al., 2022), (2) a reranking paradigm as an approximate proxy for efficient inference (Shen et al., 2004; Nogueira and Cho, 2019; Bhattacharyya et al., 2021), and (3) a pre-training method for effective training (Lee et al., 2021a). Experiments on four standard benchmarks demonstrate that the energy-based model is indeed better at capturing informative signals for the prediction of a candidate target word and thereby yields substantial improvements over strong baselines.
To sum up, our contribution is three-fold:
We point out that the previous SOTA model for the WLAC task suffers from an issue, i.e., it can not sufficiently leverage the valuable information from the source sentence for word prediction.
We propose an energy-based model to alleviate this issue and we employ three simple yet effective strategies to put it into practice.
We comprehensively evaluate our approach on four benchmarks, and our approach achieves substantial improvements (about 6.07%) over the previous SOTA model.
2 Preliminary
In this section, we review the setting of the WLAC task and introduce the state-of-the-art baseline method, which will be reused in Section 3.
2.1 WLAC Task
Notations
Let x = (x1, x2,…, xT) be a source sentence, s = (s1, s2,…, sk) be a sequence of human typed characters and c = (cl, cr) be translation context where cl = (cl,1, cl,2,…, cl, m) and cr = (cr,1, cr,2,…, cr, n). cl and cr are on the left- and right-hand side of s, respectively. Figure 1(a) illustrates the examples for x, cl, cr, and s.
Task Definition
Given the input tuple (x, c, s), the WLAC task aims at predicting the target word w, which starts with s and is the most appropriate to be placed between cl and cr (Li et al., 2021). In partial translation consisting of cl, w, and cr, w is not necessary to be consecutive to cl, m and cr,1. Figure 1(a) gives an illustrative example. To be more general in real-world scenarios, the WLAC task further assumes that cl and cr can be empty, which leads to following four translation context types:
Zero-context: both cl and cr are empty;
Prefix: cr is empty;
Suffix: cl is empty;
Bi-context: both cl and cr are not empty.
It is noteworthy that context types described above are general and encompass context of several conventional translation scenarios, such as prefix-decoding for left-to-right interactive machine translation (IMT) (Knowles and Koehn, 2016) and post-editing (Lee et al., 2021b; Yang et al., 2022). To elaborate, in prefix-decoding, the context falls into the special case of prefix, where cr is empty and cl is consecutive to w. In post-editing, the context corresponds to the special case of bi-context, where both cl and cr are consecutive to w.
2.2 Baseline Method
Li et al. (2021) cast WLAC as a word prediction task. Generally, they decompose the WLAC task into two steps: (1) Model the distribution of the target word w using x and c via a Word Prediction Model (WPM); (2) Predict the most appropriate word which starts with s according to the conditional distribution. Their method achieves state-of-the-art performance.
The comparison between the network architectures for the baseline method WPM (a) and the energy-based model (b). In the baseline model, h[MASK] does not capture the information from “disease” whereas h[disease] does in the energy-based model. Note that “Target Encoder” is a variant of the Transformer decoder which can capture bidirectional information on the target side.
The comparison between the network architectures for the baseline method WPM (a) and the energy-based model (b). In the baseline model, h[MASK] does not capture the information from “disease” whereas h[disease] does in the energy-based model. Note that “Target Encoder” is a variant of the Transformer decoder which can capture bidirectional information on the target side.
3 Energy-based Model
3.1 Motivation
As shown in Equation (2) in Section 2.2, the baseline WPM essentially maps the hidden vector of the input context (i.e., h[MASK]) into the candidate target word to predict the most appropriate target word for [MASK]. Furthermore, according to the model architecture of the baseline WPM, the context hidden vector h[MASK] does not take the candidate target word into consideration (Liu et al., 2016; Li et al., 2018). Therefore, it might be difficult for h[MASK] to make full use of sufficient information from the source side for accurately predicting the ground-truth target word. Intuitively, the above issue for the baseline WPM in Equation (1) can be demonstrated from the example in Figure 1(b), where we use attention weights to visualize source words which are mostly used in h[MASK].2 From this figure, we see that h[MASK] uses more information from “gemeinsame” and “verzweifelten” than that from “Krankheit”. Therefore, such a model may underestimate the score for the ground-truth word “disease”, which aligns to “Krankheit” on the source side. Consequently, the baseline WPM may not successfully predict the ground-truth word, leading to sub-optimal performance.
In response to the above issue, this paper proposes an energy-based model which enables defining the hidden vector on top of both the candidate target word and the input context through an energy function. Our intuition is that with the help of deep neural networks (e.g., attention networks), the energy function is expected to capture more valuable information from the source sentence, which makes the model score more reliable to evaluate contributions for w.
3.2 Model Definition
The energy-based model in Equation (3) is very general, because the energy function S(w, x, c) can be any function. For example, as a special case, if we set S(w, x, c) = Pb(w|x, c), the energy-based model is then reduced to Equation (1) because the normalization term is 1. Since this paper aims to alleviate the insufficient usage of source sentence information for Pb, it seeks another definition of the energy function to define the hidden vector on top of both the candidate target word w and the input context (x, c).
Theoretically, there are many ways to define the energy function S(w, x, c). In this paper, in practice, we adopt the way to define S(w, x, c) very similar to Pb in model architecture with minimal modifications and almost the same number of parameters as Pb. As a result, it could indicate that the potential improvement derived from the energy-based model is not significantly attributed to the complex model architecture of S(w, x, c), but rather to define the hidden vector on top of both the candidate target word w and the input context (x, c).
We believe that the energy function S can adequately exploit contextual information from (x, c). This belief is exemplified in Figure 1(c).3 In this figure, after visualizing attention weights to source words, the energy function S is able to capture more information from “Krankheit” to evaluate the target word “disease”. Therefore S(disease, x, c) is more reliable than baseline score Pb(disease|x, c), which inadequately make use of the signal from “Krankheit” as shown in Figure 1(b).
3.3 Challenges
However, it is far from trivial to make the energy-based model achieve the effect as shown in Figure 1(c) and further deliver excellent performance on the WLAC task due to the following efficiency and effectiveness challenges.
Efficiency
The first challenge is the efficiency in both training and inference. During training, maximizing the log-likelihood for Equation (3) needs the calculation of the value of the normalization term. During inference, it needs to enumerate all candidate words from vocabulary . Unfortunately, the energy function S sacrifices the parallel computation for all : One has to feed all candidate target words to the network architecture independently for each w. However, since is too large, such exhaustive computation is infeasible in practice. Consequently, this makes both training and inference challenging for the energy-based model.
Effectiveness
Second, in our preliminary experiments, optimizing the energy-based model from scratch does not work well, and its final performance is significantly worse than the baseline Pb. One possible reason is that it is more difficult to train the energy-based model. Training the energy-based model involves an approximate method to shrink the subset for the normalization term, and this may induce a risk that the informative negative examples are excluded in the shrunk subset (Ma and Collins, 2018; Xu et al., 2022). Therefore, it is easy to get trapped in local optimization when training the energy-based model from scratch.
4 Training and Inference
To relieve the aforementioned challenges, we systematically employ three simple yet effective methods inspired by previous studies. First, we employ negative sampling to address the normalization computation during the training (Ma and Collins, 2018; Li et al., 2019a; Xu et al., 2022); similarly, during the inference, we adopt a reranking paradigm, where the energy-based model is used as a reranker over a small subset of candidates (Shen et al., 2004; Nogueira and Cho, 2019; Bhattacharyya et al., 2021). Moreover, we harness a conditional mask bilingual language modeling pre-training strategy for parameter initialization (Lee et al., 2021a).
4.1 Efficient Training and Inference
Efficient Training via Negative Sampling
In this paper, we try different settings for . As the first setting, is defined by the uniform distribution over . Although sampling from this distribution is efficient and even does not introduce extra computation, it cannot ensure the hard negatives are sampled with a high probability. Thus it is not promising to speed up the convergence in our experiments. Hence, as the second setting, is instantiated by the baseline model Pb. Furthermore, according to our empirical results, it will achieve better performance by replacing the sampling operation in Equation (4) with the top-K operation over the distribution Pb(w|x, c).
Efficient Inference via Reranking
As described before, due to the definition of the energy function S(w, x, c), it is too costly to evaluate S(w, x, c) for all w. Thus, it is infeasible to exactly predict the best w such that S(w, x, c) is maximal. Similar to the top-K operation in the training stage, we adopt it in the inference stage as an approximation. Specifically, the inference process by the energy-based model includes the following two steps:
- Obtain the top-K subset denoted by Ω(s, K) according to Pb(w|x, c), where each element also satisfies the constraint s:
- Output the target word in terms of the energy function as follows:(5)
4.2 Weight Initialization via Pre-training
Recently, pre-trained language models have made exceptional success in numerous natural language processing tasks (Devlin et al., 2019; Lewis et al., 2020; Ouyang et al., 2022). One of their advantages is that they can learn general and contextual representations to boost the downstream tasks (Li et al., 2022, 2023a; Shi et al., 2023). Inspired by this, we propose to use our limited supervised bilingual data to conduct a small-scale pre-training for the energy-based model to yield better weight initialization.
5 Experiments
In this section, we first describe the experimental setup. Then we report the main results and analyze the proposed approach.
5.1 Experimental Setup
Datasets
We experiment on four language pairs: Zh⇒En, En⇒Zh, De⇒En and En⇒De. For training on Zh⇒En and En⇒Zh, we use the training set from the LDC corpus,4 which consists of 1.25M sentence pairs. For training on De⇒En and En⇒De, we use the preprocessed WMT14 dataset by Stanford,5 which consists of 4.5M sentence pairs. We use the standard validation and test sets released by Li et al. (2021).6 Specifically, for Zh⇒En and En⇒Zh, they construct validation set from NIST02 and test set from NIST05 and NIST06. For De⇒En and En⇒De, they extract validation set from newstest13 and test set from newstest14.
In order to construct simulated training data, we follow the same strategy as Li et al. (2021) to sample target words, human typed characters and translation context, which aims at avoiding sampling trivial instances. Statistics of the average length of target words and human typed characters on validation sets are shown in Table 1. As we can see, in general, target words are long and human typed characters are short, which poses a challenge for the WLAC task. In addition, we also conduct a frequency analysis of each word in training set across four language pairs. Following this, words are categorized into ten intervals based on their frequency. Finally, we calculate the proportion of target words in validation sets corresponding to each frequency interval. The result is presented in Figure 3. Figure 3 indicates a non-uniform distribution of target words across different frequency intervals. This data composition basically reflects demands encountered in real-world scenarios, where non-high frequency words are more challenging for WLAC models.
Statistics of average length of target words and human typed characters on Zh⇔En and De⇔En validation sets. T.W. and H.T.C. are short for target words and human typed characters, respectively.
. | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . |
---|---|---|---|---|
T.W. | 6.42 | 2.22 | 6.22 | 7.19 |
H.T.C. | 2.00 | 2.05 | 1.95 | 2.20 |
. | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . |
---|---|---|---|---|
T.W. | 6.42 | 2.22 | 6.22 | 7.19 |
H.T.C. | 2.00 | 2.05 | 1.95 | 2.20 |
The proportion of different frequency intervals on Zh⇔En and De⇔En validation datasets. Interval 1 and Interval 10 denote the most frequent interval and the most infrequent interval, respectively.
The proportion of different frequency intervals on Zh⇔En and De⇔En validation datasets. Interval 1 and Interval 10 denote the most frequent interval and the most infrequent interval, respectively.
Baselines
We compare our model with the following baseline models:
TransTable: A statistical method inspired by Huang et al. (2015). They create a word-level translation table with a word alignment toolkit.7 During the inference stage, they use the translation table to obtain translations of all source words and filter out invalid candidate words through human typed characters. Ultimately, they pick the candidate word with the highest frequency as the prediction.
Trans-PE: A Transformer-based baseline inspired by Langlais et al. (2000) and Santy et al. (2019). They first train a vanilla Transformer on training set. While testing, they only feed the left translation context to the Transformer decoder. Then they conduct a next-word prediction task with human typed characters as hard constraints to get the prediction word.
Trans-NPE: The only difference between this method between Trans-PE is that there is no position encoding layer in the decoder of Trans-NPE. They apply average pooling to the representations of all translation context words. And then, they use the pooled representation to predict the target word.
Pb: The word prediction model defined in Equation (1), which is the state-of-the-art model of the WLAC task.
Trans-BPE: Inspired by De Cao et al. (2021); Yang et al. (2022), we also implement a new Transformer-based baseline over subwords. Specifically, we apply BPE to segment words into subwords. During the inference stage, we adopt Prefix-Constrained Beam Search (De Cao et al., 2021) to generate outputs which start with human typed characters. This model is expected to be capable of defining the hidden vector on top of previously generated subwords and the input context to predict the next subword.
Implementation Details
We implement our energy-based model on top of the Transformer-Base architecture (Vaswani et al., 2017) implemented in Fairseq toolkit (Ott et al., 2019).8 The source encoder is a stack of 6 Transformer encoder blocks. The target encoder is also composed of 6 blocks, each of which is a Transformer encoder block with an additional cross-attention layer between the multi-head self-attention layer and feed-forward layer. The vocabulary size is 60K for Chinese, 50K for German, and 50K for English. As for the implementation of Trans-BPE, we adopt the Transformer-Base architecture and make adjustments to the input of Transformer Encoder. Specifically, we feed the concatenation of the source context, target context, and placeholder [MASK] to the Transformer Encoder, and adopt segment embedding to distinguish different languages as Yang et al. (2022). The vocabulary size is 32K for both Zh⇔En and De⇔En. For a fair comparison, we also re-implement Pb with the same hyperparameter settings as the energy-based model.
For the above models, we set dmodel = 512, dhidden = 2048, nhead = 8 and pdropout = 0.1. The learning rate is set as 0.0005, and the warmup step is set as 4,000 steps. All models are trained with 4096 tokens per batch for a maximum of 50,000 steps with the Adam optimizer (Kingma and Ba, 2015) on 8 NVIDIA V100 GPUs. We update the model parameters after accumulating 2 gradients for Trans-BPE and 1 gradient for Pb and Ours. Models are selected with the best accuracy on the validation set. We repeat the main experiment 5 times by using different random seeds.
5.2 Main Results
Evaluation on Word Prediction by ACC
Table 2 lists the main results on four language pairs. From the table, we can make three observations: First, statistical and intuitive Transformer-based methods (#1-3) perform poorly on all language pairs. We speculate that this is because these approaches can not make full use of the information from the input context (e.g., source sentence). Second, Trans-BPE outperforms Pb on average accuracy. The reason behind this could be attributed to the effectiveness of Trans-BPE to leveraging more valuable source sentence information than Pb, which we will elaborate on in Section 5.4. Third, our energy-based model (#7) improves over the previous SOTA performance by an average of 6.07 accuracy points across all language pairs, which demonstrates its effectiveness. Furthermore, in Table 3 and Table 4, we report the detailed results of different systems on four translation context types on the Zh⇔En and De⇔En validation sets. We can find that our energy-based model can almost achieve performance improvement on each translation context type, except for De⇒En prefix context, and finally results in overall performance in Table 2.
The main results of different systems on Zh⇔En and De⇔En datasets. The results in this table are the average accuracy across four translation context types (i.e., zero-context, prefix, suffix and bi-context). ‘†’: results are reported in previous work. ‘*’: results are implemented by ourselves, which is the average of 5 runs with different random seeds. The best and the second-best results are in bold and underlined fonts, respectively.
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . | ||||
---|---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NIST05 . | NIST06 . | NT13 . | NT14 . | NT13 . | NT14 . | ||
1 | TransTable† | 41.40 | 39.78 | 28.00 | 26.99 | 37.43 | 36.64 | 32.99 | 31.12 |
2 | Trans-PE† | 34.51 | 35.50 | 32.23 | 34.88 | 34.45 | 33.02 | 31.51 | 30.65 |
3 | Trans-NPE† | 35.97 | 36.78 | 34.31 | 36.19 | 36.69 | 36.01 | 33.25 | 31.30 |
4 | Pb† | 55.54 | 55.85 | 53.64 | 54.25 | 57.84 | 56.75 | 56.91 | 52.68 |
5 | Pb* | 55.52 | 56.57 | 53.89 | 54.24 | 59.11 | 56.99 | 56.89 | 53.80 |
6 | Trans-BPE* | 57.29 | 57.80 | 53.82 | 55.93 | 61.44 | 59.95 | 55.41 | 54.80 |
7 | Ours* | 65.61 | 65.44 | 60.43 | 61.25 | 64.62 | 63.13 | 62.23 | 60.24 |
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . | ||||
---|---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NIST05 . | NIST06 . | NT13 . | NT14 . | NT13 . | NT14 . | ||
1 | TransTable† | 41.40 | 39.78 | 28.00 | 26.99 | 37.43 | 36.64 | 32.99 | 31.12 |
2 | Trans-PE† | 34.51 | 35.50 | 32.23 | 34.88 | 34.45 | 33.02 | 31.51 | 30.65 |
3 | Trans-NPE† | 35.97 | 36.78 | 34.31 | 36.19 | 36.69 | 36.01 | 33.25 | 31.30 |
4 | Pb† | 55.54 | 55.85 | 53.64 | 54.25 | 57.84 | 56.75 | 56.91 | 52.68 |
5 | Pb* | 55.52 | 56.57 | 53.89 | 54.24 | 59.11 | 56.99 | 56.89 | 53.80 |
6 | Trans-BPE* | 57.29 | 57.80 | 53.82 | 55.93 | 61.44 | 59.95 | 55.41 | 54.80 |
7 | Ours* | 65.61 | 65.44 | 60.43 | 61.25 | 64.62 | 63.13 | 62.23 | 60.24 |
The detailed results for each translation context type of different systems on Zh⇔En validation set.
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | TransTable† | 41.91 | 44.99 | 44.19 | 43.28 | 43.59 | 29.73 | 32.80 | 29.73 | 29.61 | 30.46 |
2 | Trans-PE† | 29.84 | 38.61 | 26.08 | 48.06 | 35.64 | 30.64 | 34.97 | 22.67 | 38.95 | 31.80 |
3 | Trans-NPE† | 37.36 | 40.43 | 29.50 | 44.42 | 37.92 | 36.10 | 43.05 | 32.00 | 45.79 | 39.23 |
4 | Pb† | 59.91 | 60.71 | 55.35 | 62.30 | 59.56 | 61.39 | 61.73 | 53.87 | 63.78 | 60.19 |
5 | Pb* | 58.59 | 63.34 | 54.35 | 68.21 | 61.12 | 60.47 | 62.94 | 53.40 | 67.40 | 61.05 |
6 | Trans-BPE* | 60.14 | 64.03 | 55.24 | 69.84 | 62.31 | 61.89 | 62.54 | 55.02 | 69.26 | 62.18 |
7 | Ours* | 68.13 | 70.32 | 66.45 | 75.56 | 70.12 | 68.63 | 69.16 | 59.91 | 71.80 | 67.37 |
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | TransTable† | 41.91 | 44.99 | 44.19 | 43.28 | 43.59 | 29.73 | 32.80 | 29.73 | 29.61 | 30.46 |
2 | Trans-PE† | 29.84 | 38.61 | 26.08 | 48.06 | 35.64 | 30.64 | 34.97 | 22.67 | 38.95 | 31.80 |
3 | Trans-NPE† | 37.36 | 40.43 | 29.50 | 44.42 | 37.92 | 36.10 | 43.05 | 32.00 | 45.79 | 39.23 |
4 | Pb† | 59.91 | 60.71 | 55.35 | 62.30 | 59.56 | 61.39 | 61.73 | 53.87 | 63.78 | 60.19 |
5 | Pb* | 58.59 | 63.34 | 54.35 | 68.21 | 61.12 | 60.47 | 62.94 | 53.40 | 67.40 | 61.05 |
6 | Trans-BPE* | 60.14 | 64.03 | 55.24 | 69.84 | 62.31 | 61.89 | 62.54 | 55.02 | 69.26 | 62.18 |
7 | Ours* | 68.13 | 70.32 | 66.45 | 75.56 | 70.12 | 68.63 | 69.16 | 59.91 | 71.80 | 67.37 |
The detailed results for each translation context type of different systems on De⇔En validation set.
# . | Systems . | De ⇒ En . | En ⇒ De . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | Pb | 57.52 | 61.59 | 51.01 | 66.32 | 59.11 | 54.63 | 60.83 | 48.51 | 63.58 | 56.89 |
2 | Trans-BPE | 61.88 | 65.35 | 50.68 | 67.84 | 61.44 | 52.25 | 60.94 | 46.60 | 61.85 | 55.41 |
3 | Ours | 61.47 | 68.01 | 58.47 | 70.54 | 64.62 | 57.17 | 67.01 | 56.45 | 68.28 | 62.23 |
# . | Systems . | De ⇒ En . | En ⇒ De . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | Pb | 57.52 | 61.59 | 51.01 | 66.32 | 59.11 | 54.63 | 60.83 | 48.51 | 63.58 | 56.89 |
2 | Trans-BPE | 61.88 | 65.35 | 50.68 | 67.84 | 61.44 | 52.25 | 60.94 | 46.60 | 61.85 | 55.41 |
3 | Ours | 61.47 | 68.01 | 58.47 | 70.54 | 64.62 | 57.17 | 67.01 | 56.45 | 68.28 | 62.23 |
Human Evaluation
It is also crucial to assess the actual improvement in effectiveness of our approach via human evaluation. However, performing comprehensive human evaluations can be resource-intensive in terms of labor. As a compromise, we randomly sample 400 examples from the original Zh⇒En and En⇒Zh NIST05 test sets, with 100 instances for each translation context type. We then collect predictions from three models: Pb, TransBPE, and Ours. Subsequently, we enlist two professional evaluators to assess the appropriateness of predictions of these models. The human evaluators are presented with the input context, human typed characters, as well as each prediction. The predictions, originating from different models, are anonymized to the evaluators. The human evaluators are asked to assign binary scores for each prediction, where a score of ‘1’ indicates appropriateness, while ‘0’ signifies inappropriateness. Results of human evaluation are presented in Table 5. The Cohen’s kappa is 0.92 between the two translators, which is a relatively high agreement. Table 5 demonstrates that our energy-based model retains an advantage over previous methods under human evaluation. What’s more, one detail worth noting is that, compared to results in Table 2, all models exhibit an improvement in performance when evaluated manually. This can be attributed to the fact that the accuracy metric only considers the top-1 prediction, while other predictions may also be valid. To ensure consistency with prior research, we utilize accuracy as the evaluation metric in the following sections.
The detailed results of different systems under the Zh⇒En and En⇒Zh human evaluation setting. The results in the table represent the average rating scores from two evaluators.
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | Pb | 81.50 | 82.50 | 87.00 | 83.00 | 83.50 | 79.50 | 84.00 | 86.50 | 83.50 | 83.38 |
2 | Trans-BPE | 80.00 | 84.00 | 86.50 | 94.00 | 86.13 | 86.00 | 84.50 | 89.50 | 80.00 | 85.00 |
3 | Ours | 90.50 | 87.00 | 88.00 | 94.50 | 90.00 | 86.50 | 87.00 | 93.50 | 88.50 | 88.88 |
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Prefix . | Suffix . | Zero. . | Bi. . | Overall . | Prefix . | Suffix . | Zero. . | Bi. . | Overall . | ||
1 | Pb | 81.50 | 82.50 | 87.00 | 83.00 | 83.50 | 79.50 | 84.00 | 86.50 | 83.50 | 83.38 |
2 | Trans-BPE | 80.00 | 84.00 | 86.50 | 94.00 | 86.13 | 86.00 | 84.50 | 89.50 | 80.00 | 85.00 |
3 | Ours | 90.50 | 87.00 | 88.00 | 94.50 | 90.00 | 86.50 | 87.00 | 93.50 | 88.50 | 88.88 |
5.3 Ablation Studies
Negative Sampling for Training
As we state in Section 3, negative sampling in the training stage can affect the performance of the energy-based model. We consider two sampling distributions (the uniform distribution and the distribution of Pb) and three negative sampling strategies, i.e., random sampling, top-p sampling and top-K sampling. We compare them on Zh⇒En dataset. During the inference stage, we use Pb to recall top-8 predicted words as candidate target words for these models trained with different negative sampling techniques.
We report the results in Table 7. We can observe that the random sampling strategy from the uniform distribution is not as effective as the other three sampling configurations from Pb. We conjecture that negative samples by random sampling on the uniform distribution could be too trivial to recognize hard negatives, which may hinder the performance of the energy-based model. While sampling according to Pb (i.e., the other three strategies) can sample hard negatives and facilitate the training of the energy-based model.
K-best Size in Inference
We further analyze the impact of candidate word set size during the inference with the energy-based model. Figure 4 shows that, as K increases, the accuracy improvement increases rapidly from K = 1 to K = 4 and starts to saturate after K = 4. The recall of the ground-truth word shares the same trend as accuracy: It first improves sharply, then increases slowly and reaches a relatively high value. So for the efficiency and effectiveness trade-off, we choose to use K = 8 as our candidate word set size in all experiments during the inference.
Accuracy of our energy-based model and recall of ground-truth word with different K on Zh⇒En NIST02 dataset (a) and De⇒En NT13 dataset (b). Experiments are conducted in the bi-context scenario.
Accuracy of our energy-based model and recall of ground-truth word with different K on Zh⇒En NIST02 dataset (a) and De⇒En NT13 dataset (b). Experiments are conducted in the bi-context scenario.
Weight Initialization
Our energy-based model is pre-trained by a CMBLM pre-training strategy. Therefore, its improvements might come from two aspects, including 1) the energy-based model and 2) better initialization weights and representations learned from the CMBLM pre-training task. Hence, we perform further studies to quantify the contribution of each component of our approach. To this end, we conduct two experiments: we replace the CMBLM pre-training by initializing the weights from the baseline WPM Pb; and we apply the CMBLM pre-training on top of Pb and compare it with the energy-based model with the CMBLM pre-training. We evaluate all these methods on Zh⇒En dataset and De⇒En dataset and present the results in Table 6.
Performance of weight initialization on Zh⇒En and De⇒En datasets. The results in this table are the average accuracy across four translation context types.
Systems . | Zh ⇒ En . | De ⇒ En . | ||||||
---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NT13 . | NT14 . | |||||
Acc. . | △ . | Acc. . | △ . | Acc. . | △ . | Acc. . | △ . | |
Pb | 55.52 | − | 56.57 | − | 59.11 | − | 56.99 | − |
w/ CMBLM | 59.45 | +3.93 | 60.67 | +4.10 | 60.83 | +1.72 | 59.33 | +2.34 |
Ours w/ Pb Init | 58.09 | +2.57 | 58.54 | +1.97 | 60.15 | +1.04 | 58.03 | +1.04 |
w/ CMBLM | 65.61 | +10.09 | 65.44 | +8.87 | 64.62 | +5.51 | 63.13 | +6.14 |
Systems . | Zh ⇒ En . | De ⇒ En . | ||||||
---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NT13 . | NT14 . | |||||
Acc. . | △ . | Acc. . | △ . | Acc. . | △ . | Acc. . | △ . | |
Pb | 55.52 | − | 56.57 | − | 59.11 | − | 56.99 | − |
w/ CMBLM | 59.45 | +3.93 | 60.67 | +4.10 | 60.83 | +1.72 | 59.33 | +2.34 |
Ours w/ Pb Init | 58.09 | +2.57 | 58.54 | +1.97 | 60.15 | +1.04 | 58.03 | +1.04 |
w/ CMBLM | 65.61 | +10.09 | 65.44 | +8.87 | 64.62 | +5.51 | 63.13 | +6.14 |
The results of different negative sampling strategies on Zh⇒En. The results in this table are the average accuracy across four translation context types.
Dist. . | Strategy . | NIST02 . | NIST05 . | NIST06 . |
---|---|---|---|---|
Uniform | Random | 66.71 | 62.22 | 62.92 |
Pb | Random | 69.10 | 64.97 | 64.47 |
Top-p | 69.55 | 64.84 | 64.97 | |
Top-K | 70.12 | 65.61 | 65.44 |
Dist. . | Strategy . | NIST02 . | NIST05 . | NIST06 . |
---|---|---|---|---|
Uniform | Random | 66.71 | 62.22 | 62.92 |
Pb | Random | 69.10 | 64.97 | 64.47 |
Top-p | 69.55 | 64.84 | 64.97 | |
Top-K | 70.12 | 65.61 | 65.44 |
The results in Table 6 illustrate that: First, initializing the weights of the energy-based model with Pb is not as effective as initializing with the CMBLM pre-training strategy. Second, although both Pb and our energy-based model benefit from the CMBLM pre-training strategy, the gain for the energy-based model is much larger. These observations demonstrate that a simple pre-training method can not activate the potential of the energy-based model and the CMBLM pre-training strategy succeeds.
5.4 Analysis
Evaluation on Prefix-Decoding and Post-Editing Settings
Although our work mainly focuses on four translation context types in the WLAC task, we also explore whether the energy-based model would still improve performance on two common translation scenarios including prefix-decoding widely used in left-to-right interactive machine translation and post-editing as stated in Section 2.1. To this end, we implement Pb, Trans-BPE and Ours on these two scenarios with the same parameter configuration in Section 5.1. As for the construction of validation sets and test sets, we adopt the same simulation method as Li et al. (2021) other than that the target word must be consecutive to target context. Table 8 shows the results of Pb, Trans-BPE, and Ours on prefix-decoding and post-editing scenarios. As we can see, Ours can further improve average accuracy points across all language pairs by 3.22 on post-decoding and by 2.68 on post-editing, demonstrating the effectiveness of our energy-based model.
The main results of different systems on Zh⇔En and De⇔En datasets under prefix-decoding and post-editing settings.
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . | ||||
---|---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NIST05 . | NIST06 . | NT13 . | NT14 . | NT13 . | NT14 . | ||
Prefix-Decoding | |||||||||
1 | Pb | 79.57 | 78.85 | 73.45 | 74.95 | 81.41 | 79.15 | 76.09 | 73.38 |
2 | Trans-BPE | 80.96 | 78.63 | 74.47 | 75.28 | 81.99 | 79.63 | 77.66 | 74.23 |
3 | Ours | 83.73 | 83.21 | 77.34 | 79.10 | 84.13 | 82.60 | 78.68 | 76.73 |
Post-Editing | |||||||||
1 | Pb | 85.30 | 86.95 | 80.11 | 80.93 | 86.79 | 83.70 | 83.86 | 79.82 |
2 | Trans-BPE | 85.95 | 87.53 | 81.96 | 80.73 | 87.81 | 84.84 | 85.01 | 80.93 |
3 | Ours | 89.74 | 90.16 | 84.09 | 84.16 | 89.85 | 87.04 | 86.99 | 83.02 |
# . | Systems . | Zh ⇒ En . | En ⇒ Zh . | De ⇒ En . | En ⇒ De . | ||||
---|---|---|---|---|---|---|---|---|---|
NIST05 . | NIST06 . | NIST05 . | NIST06 . | NT13 . | NT14 . | NT13 . | NT14 . | ||
Prefix-Decoding | |||||||||
1 | Pb | 79.57 | 78.85 | 73.45 | 74.95 | 81.41 | 79.15 | 76.09 | 73.38 |
2 | Trans-BPE | 80.96 | 78.63 | 74.47 | 75.28 | 81.99 | 79.63 | 77.66 | 74.23 |
3 | Ours | 83.73 | 83.21 | 77.34 | 79.10 | 84.13 | 82.60 | 78.68 | 76.73 |
Post-Editing | |||||||||
1 | Pb | 85.30 | 86.95 | 80.11 | 80.93 | 86.79 | 83.70 | 83.86 | 79.82 |
2 | Trans-BPE | 85.95 | 87.53 | 81.96 | 80.73 | 87.81 | 84.84 | 85.01 | 80.93 |
3 | Ours | 89.74 | 90.16 | 84.09 | 84.16 | 89.85 | 87.04 | 86.99 | 83.02 |
Evaluation on Usage of Informative Context
As we have claimed in Section 3, our motivation is that the energy-based model is capable of capturing more informative context for word prediction, which thereby leads to better performance eventually. In addition to the intuitive example in Figure 1(c), we design an automatic metric to verify our motivation. This metric is inspired by the word alignment error rate for the cross-attention in the Transformer (Li et al., 2019b; Garg et al., 2019). Specifically, as shown in Figure 1(c), the metric (alignment recall@n) is defined as the recall rate of the informative source word “Krankhof Type-II errors and eit” by the top-n source words according to the attention score by the Transformer architecture. For each ground-truth target word, e.g., “disease” in Figure 1(c), the informative source word is defined by the manually annotated word alignment.
We use the human-annotated alignment data on Zh⇔En NIST05 dataset and conduct experiments in the bi-context scenario. We compare the alignment recall@n between Pb, Trans-BPE and Ours in Figure 5. As we can see, the alignment recall@1 of Ours is higher than Pb by 60 points and when n is small, it always maintains this advantage. What’s more, Trans-BPE also achieves better alignment recall@n than Pb. This may serve as quantitative evidence that introducing subwords or the entire candidate target word into the modeling of hidden vectors with the input context, as implemented in Trans-BPE and Ours, can make more use of informative context than Pb (De Cao et al., 2021). And results illustrated in the Figure 5 also reveal that our energy-based model might be more effective in leveraging informative context than Trans-BPE.
Alignment recall@n on Zh⇔En NIST05 dataset with n ranging from 1 to 8. Experiments are conducted in the bi-context scenario.
Alignment recall@n on Zh⇔En NIST05 dataset with n ranging from 1 to 8. Experiments are conducted in the bi-context scenario.
Error Analysis
After conducting the human evaluation in Section 5.2, we proceed to inspect incorrect instances of Pb and Ours in Zh⇒En test examples.
Furthermore, we summarize incorrect instances into three distinct categories: (1) Semantic discrepancy error (Type-I): The model erroneously suggests irrelevant words. These words lack semantic relevance to source sentences other than starting with the same human typed characters. (2) Repetition error (Type-II): The model suggests words that convey semantics of source sentences, however, these words already appear within the target context. (3) Morphological error (Type-III): The model suggests incorrect cognates of target words.9 In the forthcoming Case Study section, we will present illustrative examples representing each of these three error categories.
In Table 9, we present quantitative results of error occurrences for Pb and Ours. In terms of the total error quantity, Ours exhibits a lower number of errors. Notably, for both methods, the most common error type is semantic discrepancy error. Comparatively, Ours demonstrates a notable ability to rectify 25 instances (31.65%) of Type-I errors, 20 instances (68.97%) of Type-II errors, and 14 instances (70.00%) of Type-III errors that are present in Pb. Furthermore, Ours exhibits significantly fewer instances in repetition and morphological errors. However, it is essential to acknowledge that the Ours approach also introduces new incorrect instances in each type that are not originally observed in Pb.
Quantitative results of error occurrences between Pb and Ours. The numbers in parentheses represent the quantity of errors, which are initially presented in Pb and subsequently rectified by Ours. Type-I means “semantic discrepancy error”. Type-II means “repetition error”. Type-III means “morphological error”.
Systems . | Type-I . | Type-II . | Type-III . | Total . |
---|---|---|---|---|
Pb | 79 | 29 | 20 | 128 |
Ours | 57 (−25) | 11 (−20) | 9 (−14) | 77 |
Systems . | Type-I . | Type-II . | Type-III . | Total . |
---|---|---|---|---|
Pb | 79 | 29 | 20 | 128 |
Ours | 57 (−25) | 11 (−20) | 9 (−14) | 77 |
Case Study
We provide this case study to better illustrate the advantages of Ours over Pb in utilizing contextual information, thereby leading to enhanced semantic information for word-level autocompletion. Figure 6 presents cases where Pb yields errors while Ours predicts correctly. Furthermore, Figure 7 illustrates their attention weights which depict the connection between the predicted word and the source words.
Three cases of Pb and Ours in Zh⇒En test set. Human typed characters are in underlined fonts.
Three cases of Pb and Ours in Zh⇒En test set. Human typed characters are in underlined fonts.
Attention weights from the predicted word to source words of three cases in Figure 6. denotes source words aligned with the ground-truth target word.
Attention weights from the predicted word to source words of three cases in Figure 6. denotes source words aligned with the ground-truth target word.
In case 1 (Type-I), Pb tends to suggest “suffice”, which is not consistent with semantics expressed by the source sentence other than starting with human typed characters “suf”. In contrast, Ours succeeds in completing “suf” to “suffer”. Through visualizing attention weights in Figure 7, we can find that Ours may have the merit of leveraging more information from the valuable source context (e.g., the aligned word “饱受”). In case 2 (Type-II), Pb completes “so” to “social”, which has already been translated in target context. With the leverage of interactions between candidate target words and input context, Ours successfully suggests “services”. In case 3 (Type-III), Pb suggests the cognates of target words (i.e. “problematic”). Whereas, according to the information captured in the energy-based model, Ours succeeds in suggesting the noun “problems”, which are more appropriate. Although our model has substantially alleviated aforementioned cases, it is not flawless. One such instance is that, during the inference stage, the effectiveness of Ours is influenced by the baseline recall rate.
Running Latency Comparison
Table 10 summarizes the training and inference latency of Pb, Trans-BPE, and Ours on Zh⇒En validation dataset. The results indicate that the training and inference latency of Ours is comparatively higher than that of Pb (approximately 2.0 times and 1.5 times, respectively). This discrepancy in latency can be attributed to the inherent necessity of Ours to get candidate words from Pb and subsequently rerank them, which demands additional computational time. In comparison to the more potent auto-regressive model, Trans-BPE, Ours exhibits a lower inference latency while concurrently delivering better performance. As a result, our approach achieves a desirable balance between performance and processing speed.
Training and inference latency comparison on Zh⇒En validation set. “ms/sample” represents millisecond per sample. The evaluation of inference is based on a single NVIDIA V100 GPU, batch size is set to 1, beam size for Trans-BPE is set to 3 and K-best size for Ours is 8. The training latency of Ours does not include the training time of Pb.
Systems . | Training (hours) . | Inference (ms/sample) . |
---|---|---|
Pb | 4.19 (1.0×) | 30.01 (1.0×) |
Ours | 8.28 (2.0×) | 46.17 (1.5×) |
Trans-BPE | 4.99 (1.2×) | 56.71 (1.9×) |
Systems . | Training (hours) . | Inference (ms/sample) . |
---|---|---|
Pb | 4.19 (1.0×) | 30.01 (1.0×) |
Ours | 8.28 (2.0×) | 46.17 (1.5×) |
Trans-BPE | 4.99 (1.2×) | 56.71 (1.9×) |
5.5 Applying WLAC into Human–Computer Interactive Translation
Setup and Evaluation
As stated in the previous sections, one advantage of WLAC is that it is able to increase the efficiency of human input in interactive machine translation. To exemplify the usefulness of WLAC, we apply the WLAC models into IMT. Specifically, we first implement a practical IMT model following Huang et al. (2021) which is based on lexical constrained decoding (Hokamp and Liu, 2017) and thus enables the flexible input from users. Then, we apply three WLAC models (Pb, Trans-BPE, and Ours) into the IMT model, leading to three IMT systems named by IMT-Pb, IMT-Trans-BPE, and IMT-Ours. As a direct baseline, the IMT system without WLAC is denoted by IMT-Raw.
For efficiency evaluation in IMT, the standard metric, the number of keystrokes from a human translator (Nepveu et al., 2004; Bender et al., 2005), is used for all IMT systems. To ensure a fair comparison in efficiency, we enforce all human inputted words to be the same for all IMT systems and thus all these IMT systems yield the same translation outputs. We randomly select a subset consisting of 200 source sentences from Zh⇒En NIST05 as x due to intensive human efforts in IMT experiments. On this subset, the standard NMT obtains 50.13 BLEU points and all IMT systems achieve 56.02 BLEU points thanks to human interactions.
Experiment Results
Table 11 presents the total and average number of keystrokes across different IMT systems. Notably, the employment of WLAC systems significantly reduces the number of keystrokes in comparison to the IMT-Raw baseline without WLAC. Furthermore, in comparison to other systems, our proposed IMT-Ours system attains a minimal number of keystrokes relative to other systems. This observation is reinforced in Figure 8, which depicts the distribution of the number of keystrokes across different systems. We can see that most of the keystrokes of Ours are less than 3 (constituting approximately 84.5% of cases), leading to a reduction in the number of keystrokes and offering input convenience for users.
Efficiency for IMT systems with WLAC or not in terms of total and average number of keystrokes. IMT-Raw denotes the IMT system without WLAC function and other systems respectively denote IMT systems with corresponding WLAC models.
Systems . | WLAC . | Keystrokes . | |
---|---|---|---|
Total . | Average . | ||
IMT-Ours | ✔ | 478 | 2.39 |
IMT-Trans-BPE | 686 | 3.43 | |
IMT-Pb | 704 | 3.52 | |
IMT-Raw | ✗ | 1320 | 6.60 |
Systems . | WLAC . | Keystrokes . | |
---|---|---|---|
Total . | Average . | ||
IMT-Ours | ✔ | 478 | 2.39 |
IMT-Trans-BPE | 686 | 3.43 | |
IMT-Pb | 704 | 3.52 | |
IMT-Raw | ✗ | 1320 | 6.60 |
Proportion of the number of keystrokes in different IMT systems with and without WLAC models.
Proportion of the number of keystrokes in different IMT systems with and without WLAC models.
6 Related Work
Computer-aided Translation
Computer-aided Translation (CAT) (Langlais et al., 2000; Barrachina et al., 2009; Green et al., 2014; Knowles and Koehn, 2016; Santy et al., 2019; Lee et al., 2021b) has the merit of leveraging advantages of machine translation systems to facilitate human translation process. Word-level AutoCompletion (WLAC) is an important feature of interactive CAT (Casacuberta et al., 2022) and it plays an important role in CAT. Huang et al. (2015) leverage useful source-side knowledge to complete the target word. Li et al. (2021) propose a strong word prediction model (WPM) and try to leverage both source-side and target-side information. However, as stated in Section 1, these methods may still inadequately leverage the valuable information from the source sentence. To fill this gap, we introduce an energy-based model to enable the hidden vector to capture more valuable information.
Reranking
Reranking has been long researched in natural language processing tasks (Shen et al., 2004; Collins and Koo, 2005; Charniak and Johnson, 2005). Recently, the retrieval-then-reranking framework has served as the de facto paradigm (Nogueira and Cho, 2019; Zhang et al., 2022) in text retrieval. To yield high-quality answers, answer reranking is also widely employed in question answering (Wang et al., 2018; Iyer et al., 2021), dialogue systems (Li et al., 2023b), and reasoning (Kazemi et al., 2023; Zhu et al., 2023a, b). In machine translation, with the purpose of alleviating the mismatch between maximum likelihood estimation and the desired metric (e.g., BLEU), Bhattacharyya et al. (2021) and Lee et al. (2021a) propose to train an energy-based model to rerank candidate translations generated by NMT models. In this work, we are in line with prior findings that reranking is a conceptually simple yet empirically powerful framework. However, we pay more attention to leveraging valuable source sentence information in the WLAC task and corresponding training and inference challenges of the energy-based model for reranking.
Input Method
In recent years, with the advance of neural networks, the input method has shown significant progress in being effective (Huang et al., 2018; Zhang et al., 2019; Tan et al., 2022). However, most current research has concentrated on the monolingual scenarios, without sufficient consideration of how to utilize source-side information in bilingual settings (Li, 2012; Huang et al., 2015). Our work, which centers on the word-level autocompletion task to reduce keystrokes, is a new exploration of bilingual input methods. We believe that combining our approach with other input method technologies could significantly enhance the productivity of human translators. We leave this as a potential direction for future research.
7 Conclusion
Word-level AutoCompletion is a critical yet challenging task in Computer-aided Translation. Existing work casts this task as a classification problem. However, it cannot make full use of the contextual information from the input context for its prediction. To alleviate such issue, we introduce a reranking perspective by an energy-based model, which directly defines the energy function on top of the input context and the candidate target word. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach on four standard benchmarks: It achieves about 6.07% improvements over the strongest baseline.
Acknowledgments
This work was partly supported by the National Natural Science Foundation of China (grant no. U1903213) and the Shenzhen Science and Technology Program (JSGG20220831093004008). We extend our thanks to annotators for their substantial contributions to this project. Additionally, we would like to convey our appreciation to the TACL editors and anonymous reviewers for their valuable feedback, which significantly enhanced the paper’s quality.
Notes
Our codes are available at https://github.com/yc1999/energy_wlac.
In our preliminary experiments, we also employed other methods to attribute source words that are mostly used (e.g., the prediction difference method [Li et al., 2019b]). The conclusions drawn from these alternative methods align closely with those obtained using attention weights. This suggests that, in the context of the WLAC task, the model’s utilization of source-side information can be consistently reflected through various effective attribution methods. In this paper, we opt to utilize attention weights for easier description.
Note that this example is not cherry-picked and more quantitative analyses will be shown in the later experiments.
The total training set is composed of LDC2002E18, LDC2003E07, LDC2003E14, and part of LDC2004T07-08 and LDC2005T06 from https://www.ldc.upenn.edu.
It is important to note that some instances might involve valid morphological transformations for the target word, which we do not categorize as errors.
References
Author notes
Work done during internship at Tencent AI Lab.
Action Editor: André F. T. Martins