Abstract
Neural machine translation (NMT) has shown great success as a new alternative to the traditional Statistical Machine Translation model in multiple languages. Early NMT models are based on sequence-to-sequence learning that encodes a sequence of source words into a vector space and generates another sequence of target words from the vector. In those NMT models, sentences are simply treated as sequences of words without any internal structure. In this article, we focus on the role of the syntactic structure of source sentences and propose a novel end-to-end syntactic NMT model, which we call a tree-to-sequence NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our proposed model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. We have empirically compared the proposed model with sequence-to-sequence models in various settings on Chinese-to-Japanese and English-to-Japanese translation tasks. Our experimental results suggest that the use of syntactic structure can be beneficial when the training data set is small, but is not as effective as using a bi-directional encoder. As the size of training data set increases, the benefits of using a syntactic tree tends to diminish.
1. Introduction
Machine translation has traditionally been one of the most complex language processing tasks, but recent advances of neural machine translation (NMT) make it possible to perform translation using a simple end-to-end architecture. In the Encoder-Decoder model (Cho et al. 2014b; Sutskever, Vinyals, and Le 2014), a recurrent neural network (RNN) called an encoder reads the whole sequence of source words to produce a fixed-length vector, and then another RNN called a decoder generates a sequence of target words from the vector. The Encoder-Decoder model has been extended with an attention mechanism (Bahdanau, Cho, and Bengio 2015; Luong, Pham, and Manning 2015), which allows the model to jointly learn soft alignments between the source words and the target words. Recently, NMT models have achieved state-of-the-art results in a variety of language pairs (Wu et al. 2016; Zhou et al. 2016; Gehring et al. 2017; Vaswani et al. 2017).
In this work, we consider how to incorporate syntactic information into NMT. Figure 1 illustrates the phrase structure of an English sentence, which is represented as a binary tree. Each node of the tree corresponds to a grammatical phrase of the English sentence. Figure 1 also shows its translation in Japanese. The two languages are linguistically distant from each other in many respects; they have different syntactic constructions, and words and phrases are defined in different lexical units. In this example, the Japanese word “” is aligned with the English word “movie.” The indefinite article “a” in English, however, is not explicitly translated into any Japanese words. One way to solve this mismatch problem is to consider the phrase structure of the English sentence and align the phrase “a movie” with the Japanese word “.” The verb phrase of “went to see a movie last night” is also related to the eight-word sequence “.”
Since Yamada and Knight (2001) proposed the first syntax-based alignment model, various approaches to leveraging the syntactic structures have been adopted in statistical machine translation (SMT) models (Liu, Liu, and Lin 2006). In SMT, it is known that incorporating source-side syntactic constituents into the models improves word alignment (Yamada and Knight 2001) and translation accuracy (Liu, Liu, and Lin 2006; Neubig and Duh 2014). However, the aforementioned NMT models do not allow one to perform this kind of alignment.
To take advantage of syntactic information on the source side, we propose a syntactic NMT model. Following the phrase structure of a source sentence, we encode the sentence recursively in a bottom–up fashion to produce a sentence vector by using a tree-structured recursive neural network (RvNN) (Pollack 1990) as well as a sequential RNN (Elman 1990). We also introduce an attention mechanism to let the decoder generate each target word while aligning the input phrases and words with the output.
This article extends our conference paper on tree-to-sequence NMT (Eriguchi, Hashimoto, and Tsuruoka 2016) in two significant ways. In addition to an English-to-Japanese translation task, we have newly experimented with our tree-to-sequence NMT model in a Chinese-to-Japanese translation task, and observed that a bi-directional encoder was more effective than our tree-based encoder in both tasks. We also provide detailed analyses of our model and discuss the differences between the syntax-based and sequence-based NMT models. The article is structured as follows. We explain the basics of sequence-to-sequence NMT models in Section 2 and define our proposed tree-to-sequence NMT model in Section 3. After introducing the experimental design in Section 4, we first conduct experiments on two different tasks of {Chinese, English}-to-Japanese translation on a small scale and a series of analyses to understand the underlying key components in our proposed method in Section 5. Moreover, we report large-scale experimental results and analyses in the English-to-Japanese translation task in Section 6. In Section 7, we survey recent studies related to the syntax-based NMT models and conclude in Section 8 by summarizing the contributions of our work.
2. Sequence-to-Sequence Neural Machine Translation Model
2.1 Model Description
The sequence-to-sequence NMT models are built based on the idea of an Encoder-Decoder model, where an encoder converts each input sequence x = (x1, x2, ⋯, xn) into a vector space, and a decoder generates an output sequence y = (y1, y2, ⋯, ym) from the vector, following the conditional probability of Pθ(yj|y<j, x). Here n is the input length, m is the output length, and θ denotes the model parameters. Figure 2 shows an illustration of the sequence-to-sequence NMT model.
The RNN units, RNNenc in Equation (1), are often implemented with gated recurrent units (GRUs) (Cho et al. 2014b) or long short-term memory (LSTM) units (Hochreiter and Schmidhuber 1997; Gers, Schmidhuber, and Cummins 2000), and we use LSTM units in this article. Each LSTM unit contains four types of gates and two different types of hidden states, that is, a hidden unit ht ∈ ℝd×1 and a memory cell ct ∈ ℝd×1 at the time step t.
2.2 Training Model Parameters
3. Tree-to-Sequence Neural Machine Translation
The tree-to-sequence NMT model is a variant of the Encoder-Decoder model. On the source side, a tree-based encoder is constructed to represent the syntactic structure as well as sequential data. The tree-based encoder is a hybrid of the sequence-based encoder and a binary tree-based encoder modeled, respectively, with the RNN and an RvNN. Whereas the RNN-based encoder computes the uni-directional information on input sentences along the time series and discards the explicit syntactic information, the tree-based encoder directly leverages the syntactic structures of the sentences to compute phrase vectors. The overview of the tree-to-sequence NMT model is shown in Figure 3.
3.1 Tree-Based Encoder
Our proposed tree-based encoder is a natural extension of the conventional sequential encoder because Tree-LSTM is a generalization of the chain-structured LSTM (Tai, Socher, and Manning 2015). Our encoder differs from the original Tree-LSTM in the calculation of the LSTM units for the leaf nodes, where we first compute word-level encoding with a (uni-directional) sequential encoder and then use the encodings as leaf nodes that are the inputs for the Tree-based encoder for phrase nodes. The motivation is to construct the phrase nodes in a context-sensitive way, which, for example, allows the model to compute different representations for multiple occurrences of the same word in a sentence because the sequential word-level encodings are computed in the context of the previous units. This ability contrasts with the original Tree-LSTM units, in which the leaves are composed only of the word embeddings without any contextual information.
3.2 Decoding Method with Tree-Based Encoder
3.3 Sampling-Based Approximation to the NMT Models
The computational cost in the softmax layer in Equation (6) occupies most of the training time because the cost increases linearly with the size of the vocabulary. A variety of approaches addressing this problem have been proposed, including negative sampling methods such as BlackOut sampling (Ji et al. 2016) and noise-contrastive estimation (NCE) (Gutmann and Hyvärinen 2012), and binary code prediction (Oda et al. 2017). BlackOut has been shown to be effective in training RNN language models even with one-million-word vocabulary on CPUs.
4. Experimental Design
4.1 Experimental Settings
Chinese-to-Japanese.
We used the Asian Scientific Paper Excerpt Corpus (ASPEC) (Nakazawa et al. 2016) for the Chinese-to-Japanese translation provided by the Workshop of Asian Translation 2015 (WAT2015). Following the official preprocessing steps,1 we tokenized the data by using KyTea (Neubig, Nakata, and Mori 2011) as a Japanese segmenter and Stanford Segmenter as a Chinese word segmenter.2 We discarded a sentence pair when either of the sentences had more than 50 words. We used the Stanford Parser (Levy and Manning 2003)3 as an external parser and parsed the Chinese sentences. We did not use any more specific information output by the parsers such as phrase labels. If a sentence fails to be parsed, it is re-parsed by a simpler probabilistic context-free grammar parser. We used a script4 to convert the parse trees to their corresponding binary trees because our tree-to-sequence model assumes binary trees as inputs. We used the first 100,000 parallel sentences from the training data to investigate the effectiveness of the proposed NMT model. The vocabularies were composed of the words appearing in the training data more than or equal to N times. We set N = 2 for Japanese and N = 3 for Chinese. The out of vocabulary words were mapped to the special token “unk,” and we inserted another special symbol “EOS” at the end of all the sentences. The vocabulary sizes of Chinese and Japanese are 29,011 and 32,640, respectively. Table 1 shows the statistics on the data set of the Chinese-to-Japanese translation task.
English-to-Japanese.
We used the ASPEC corpus (Nakazawa et al. 2016) for the English-to-Japanese translation task provided by WAT2015. We used Enju (Miyao and Tsujii 2008), a head-driven phrase structure grammar (Sag, Wasow, and Bender 2003) parser for English, and the tokenization in English follows the Enju parser.5 The English corpus was lowercased. We followed the official preprocessing for the Japanese corpus as in the Chinese-to-Japanese experimental settings. We used Enju only to obtain a binary phrase structure for each source-side sentence. We removed the sentence pairs in which either of the sentences is longer than 50 words. Enju returns either success or failure after parsing an English sentence. When the Enju parser fails to parse a sentence, it is treated as a sequence of words in the proposed model.6Table 2 shows the statistics on the data set of English-to-Japanese translation task.
We build a small training corpus by extracting the first 100,000 parallel sentences from the training data in the English-to-Japanese translation task. The vocabularies were constructed with the words appearing in the training data no less than N times as well as in the Chinese-to-Japanese translation task. We set N = 2 and N = 5 for the small and large training data set, respectively. The out of vocabulary words were mapped to “unk,” and “EOS” was inserted at the end of each sentence. The vocabulary sizes of English and Japanese are (87,796; 65,680) and (25,456; 23,509) in the large and small training data set, respectively. When training the models on the large data set, we follow the same parameter setting except that our proposed model has 512-dimensional word embeddings and d-dimensional hidden units (d ∈ {512, 768, 1,024}). K is set to 2,500. When utilizing beam search to generate a target sentence for a source sentence at test time, we selected the optimal beam width found on the development data set.
We evaluated the models by two automatic evaluation metrics, RIBES (Isozaki et al. 2010) and BLEU (Papineni et al. 2002) following WAT 2015. We used the KyTea-based evaluation script for the translation results.7 The RIBES score is a metric based on rank correlation coefficients with word precision, and this score is known to have stronger correlation with human judgments than BLEU in translation between English and Japanese, as discussed in Isozaki et al. (2010). The BLEU score is based on n-gram word precision and a brevity penalty for outputs shorter than the references.
4.2 Training Details
We conduct experiments with the sequence-to-sequence NMT model as a baseline and our proposed model described in Sections 2 and 3, respectively. Each model has 256-dimensional hidden units and word embeddings. The biases, softmax weights, and BlackOut weights are initialized with zeros. The hyperparameter β of BlackOut is set to 0.4 as recommended by Ji et al. (2016). The number of negative samples K in BlackOut was set to K ∈ {500, 2000}. Here, we shared the negative samples of each target word in a sentence in training time, following Hashimoto and Tsuruoka (2017).
Following Józefowicz, Zaremba, and Sutskever (2015), we initialize the forget gate biases of LSTM and Tree-LSTM with 1.0. The remaining model parameters in the NMT models in our experiments are uniformly initialized in [−0.1, 0.1]. The model parameters are optimized by plain SGD with the mini-batch size of 128. The initial learning rate of SGD is 1.0. When the development loss becomes worse per epoch, we halve the learning rate from the next epoch until it converges. When INF/NAN values appear in a mini-batch during training, we skip the mini-batch training. Gradient norms are clipped to 3.0 to avoid exploding gradient problems (Pascanu, Mikolov, and Bengio 2012).
4.3 Beam Search with Penalized Length
We used the beam search to generate a target sentence for a source sentence at test time. We set the beam width to 20, unless otherwise stated.
5. Experimental Results and Discussion on Small Data Sets
In this section, we first report the experimental results on small data sets for Chinese-to-Japanese and English-to-Japanese translation tasks. To understand the effectiveness of our proposed methods, we conduct a series of analyses on the proposed beam search, sequential LSTM units for the tree-based encoder, and the model capacity.
5.1 Experimental Results on Small Data Sets
Tables 3 and 4 summarize the experimental results of translation accuracy in the Chinese-to-Japanese and English-to-Japanese translation tasks, respectively. Each table reports the values of perplexity, RIBES, BLEU, and the training time on the development data with the NMT models. We conducted the experiments with our proposed methods and the baseline of the sequence-to-sequence model using BlackOut and the original softmax. We generated each translation by our proposed beam search with a beam size of 20. In both tables, we report the experimental results obtained by feeding the reversed inputs into the sequence-to-sequence NMT models (shown as “w / reverse inputs” in tables), and by utilizing the bi-directional encoders. We ran the bootstrap re-sampling method (Koehn 2004) and observed a statistical significant difference on both RIBES and BLEU scores between the proposed method and the sequence-to-sequence NMT model with reversed inputs in all the settings. The symbol † indicates that our proposed model significantly outperforms the sequence-to-sequence NMT model both without and with reversed inputs except the model “w / bi-directional encoder” in the corresponding settings on K (p < 0.05).
As the negative sample size K increases, we can see that both the proposed model (Tree-to-Sequence NMT) and the sequence-to-sequence NMT models (Sequence-to-Sequence NMT, Sequence-to-Sequence NMT w / reverse inputs, and Sequence-to-Sequence NMT w / bi-directional encoder) improve the translation accuracy in both of the evaluation metrics. The gains in the Chinese-to-Japanese translation task are +1.1 RIBES and +1.4 BLEU for the proposed model, and +0.5 RIBES, +0.8 BLEU for the sequence-to-sequence model, +1.3 RIBES and +1.2 BLEU for the reversed inputs model, and +1.1 RIBES and +0.7 BLEU for the bi-directional NMT model; we also observe similar trends in the English-to-Japanese translation task. Because the BlackOut sampling is an approximation method for softmax, the models are expected to obtain better accuracy when the negative examples (K) increase. Compared to the results obtained by the models using the softmax layer, the tree-to-sequence NMT model with K = 2,000 achieves a competitive score in the Chinese-to-Japanese translation task. We also found that better perplexity does not always lead to better translation scores with BlackOut, as shown in Table 3. One of the possible reasons is that BlackOut distorts the target word distribution by the modified unigram-based negative sampling where frequent words can be treated as negative samples multiple times at each training step.
As to the results of the sequence-to-sequence NMT models, reversing the word order in the input sentence decreases the scores in Chinese-to-Japanese and English-to-Japanese translation, which contrasts with the results of other language pairs reported in previous work (Sutskever, Vinyals, and Le 2014; Luong, Pham, and Manning 2015).
The rightmost column of Time in Table 4 reports the training time (min.) per epoch of each NMT model. Although the result of softmax is better than those of BlackOut (K = 500, 2,000), the training time of softmax per epoch is about 13 times longer than that of BlackOut even with the small data set.8 By taking syntactic information into consideration, our proposed model improves the scores, outperforming the sequence-based NMT models with reverse inputs in both tasks. The gains of BLEU are larger in the English-to-Japanese translation task than in the Chinese-to-English task. However, extending the sequential encoder to the bi-directional one improves both evaluation scores by taking a slightly additional time for training.
5.2 Effects of Beam Search with Length Penalty
Table 5 shows the results on the development data of our proposed method with BlackOut (K = 2,000) by the simple beam search and our proposed beam search. The beam size is set to {10, 15, 20} in the simple beam search, and to 20 in our proposed search. The brevity penalty value in BLEU denotes the ratio of the hypothesis length over the reference length. We can see that our proposed search outperforms the simple beam search in BLEU score without losing the score of RIBES. Unlike RIBES, the BLEU score is sensitive to the beam size and becomes lower as the beam size increases. We found that the brevity penalty had a relatively small impact on the BLEU score in the simple beam search as the beam size increased. Our search method works better than the simple beam search by keeping long sentences in the candidates with a large beam size.
5.3 Discussion on Small Data Sets
Here, we conduct detailed analyses to explore what kind of component is the key in the proposed methods.
Effects of the Sequential LSTM Units.
We investigated the effects of the sequential LSTM units at the leaf nodes in our proposed tree-based encoder. Table 6 shows the result on the development data of our proposed encoder and that of an attention-based tree-based encoder without sequential LSTM units. For this evaluation, we used the 1,779 sentences that were successfully parsed by Enju because the encoder without sequential LSTM units always requires a parse tree. The results show that our proposed encoder considerably outperforms the encoder without sequential LSTM units, suggesting that the sequential LSTM units at the leaf nodes contribute to the context-aware construction of the phrase representations in the tree.
. | RIBES . | BLEU . |
---|---|---|
Chinese-to-Japanese | ||
with sequential LSTM units | 78.0 | 27.0 |
without sequential LSTM units | 71.0 | 23.0 |
English-to-Japanese | ||
with sequential LSTM units | 76.4 | 25.7 |
without sequential LSTM units | 71.7 | 20.8 |
. | RIBES . | BLEU . |
---|---|---|
Chinese-to-Japanese | ||
with sequential LSTM units | 78.0 | 27.0 |
without sequential LSTM units | 71.0 | 23.0 |
English-to-Japanese | ||
with sequential LSTM units | 76.4 | 25.7 |
without sequential LSTM units | 71.7 | 20.8 |
Effects of Model Capacity.
Here, we conduct a control experiment of model parameters to investigate whether the proposed tree-to-sequence NMT model has the benefit of more parameters than the sequence-to-sequence NMT model does. We set the hidden and embedding dimension sizes of the sequence-to-sequence NMT model to d = 269 and d = 272 in the Chinese-to-Japanese and English-to-Japanese translation tasks, respectively, to make the number of parameters of the model equal to that of the tree-to-sequence NMT model. The numbers of parameters of the models are 25.6M and 21.4M in the setup. Table 7 reports the results on both evaluation scores, and we used the proposed beam search method with a width of 20. As a result in the English-to-Japanese translation task, the proposed model obtained better RIBES and BLEU scores than the sequence-to-sequence NMT model both with and without reversed input with the same number of parameters. On the other hand, the same significant improvement is observed only over the sequence-to-sequence NMT model with reversed inputs in the Chinese-to-Japanese translation task.
Comparison with Dependency-to-Sequence NMT models.
Here, we have the comparison of the performance with a syntax-based NMT model, which incorporates source dependency trees into the encoder. We used the dependency-to-sequence NMT architecture proposed by Hashimoto and Tsuruoka (2017), where the model architecture is considered as a one-layered graph convolutional neural network and the source input texts with its predicted dependency are used to generate a translation. Following the setup of Hashimoto and Tsuruoka (2017), we first pretrained the dependency-to-sequence NMT model in a dependency parsing task and then trained it in a machine translation task. We used the same experimental settings and training procedures described in Section 4. Table 8 reports the performance of a dependency-to-sequence NMT model and the proposed tree-to-sequence NMT model. We let the decoder output the translation on the development data set by using the beam search with penalty length with a beam width of 20. We confirmed that the proposed tree-to-sequence NMT model outperforms the dependency-to-sequence NMT model on RIBES and BLEU scores in both translation tasks.
. | RIBES . | BLEU . |
---|---|---|
Chinese-to-Japanese | ||
Proposed: Tree-to-Sequence NMT | 77.8 | 26.9 |
Dependency-to-Sequence NMT reimplementation of Hashimoto and Tsuruoka (2017) | 77.5 | 25.5 |
English-to-Japanese | ||
Proposed: Tree-to-Sequence NMT | 76.0 | 23.9 |
Dependency-to-Sequence NMT reimplementation of Hashimoto and Tsuruoka (2017) | 74.9 | 22.9 |
. | RIBES . | BLEU . |
---|---|---|
Chinese-to-Japanese | ||
Proposed: Tree-to-Sequence NMT | 77.8 | 26.9 |
Dependency-to-Sequence NMT reimplementation of Hashimoto and Tsuruoka (2017) | 77.5 | 25.5 |
English-to-Japanese | ||
Proposed: Tree-to-Sequence NMT | 76.0 | 23.9 |
Dependency-to-Sequence NMT reimplementation of Hashimoto and Tsuruoka (2017) | 74.9 | 22.9 |
6. Experimental Results and Discussion on Large Data Set
6.1 Experimental Results on Large Data Set
Table 9 shows the experimental results of RIBES and BLEU scores achieved by the trained models on the large data set.9 The results of the other systems are the ones reported in Nakazawa et al. (2015). All of our proposed models show similar performance regardless of the value of d. Our ensemble model is composed of the three models with d = 512,768, and 1,024, and it shows the best RIBES score among all systems. As for the time required for training, our implementation needs about 8 hours to perform one epoch on the large training data set with d = 512. It would take about 8 days without using the BlackOut sampling.10
Comparison with the NMT Models.
Compared with our reimplementation of the sequence-to-sequence NMT model (Luong, Pham, and Manning 2015), we did not observe a significant difference from the tree-to-sequence model (d = 512). The model of Zhu (2015) is an NMT model (Bahdanau, Cho, and Bengio 2015) with a bi-directional LSTM encoder, and uses 1,024-dimensional hidden units and 1,000-dimensional word embeddings. The model of Lee et al. (2015) is also an NMT model with a bi-directional GRU encoder, and uses 1,000-dimensional hidden units and 200-dimensional word embeddings. Both models are sequence-based NMT models. Our single proposed model with d = 512 outperforms Zhu (2015)’s sequence-to-sequence NMT model with a bi-directional LSTM encoder by +1.96 RIBES and by +2.86 BLEU scores.
Comparison with the SMT models.
The baseline SMT systems are phrase-based, hierarchical phrase-based, and tree-to-string systems in WAT 2015 (Nakazawa et al. 2015), and the system submitted by Neubig, Morishita, and Nakamura (2015) achieves the best performance among the tree-to-string SMT models. Each of our proposed models outperforms the SMT models in RIBES and achieved almost as high performance as Neubig, Morishita, and Nakamura (2015)’s system in BLEU.
6.2 Qualitative Analysis in English-to-Japanese Translation
We illustrate the translations of test data by our model with d = 512 and several attention relations when decoding a sentence. In Figure 4 and 5, an English sentence represented as a binary tree is translated into Japanese, and several attention relations between English words or phrases and Japanese words are shown with the highest attention score α. The additional attention relations are also illustrated for comparison. We see that the target words softly aligned with source words and phrases.
In Figure 4, the Japanese word “” means “liquid crystal,” and it has a high attention score (α = 0.41) with the English phrase “liquid crystal for active matrix.” This is because the j-th target hidden unit sj has the contextual information about the previous words y<j including “” (“for active matrix” in English). The Japanese word “” is softly aligned with the phrase “the cells” with the highest attention score (α = 0.35). In Japanese, there is no definite article like “the” in English, and it is usually aligned with null, as described in Section 1.
In Figure 5, in the case of the Japanese word “” (“showed” in English), the attention score with the English phrase “showed excellent performance” (α = 0.25) is higher than that with the English word “showed” (α = 0.01). The Japanese word “” (“of” in English) is softly aligned with the phrase “of Si dot MOS capacitor” with the highest attention score (α = 0.30). It is because our attention mechanism takes each word of the previous context of the Japanese phrases “” (“excellent performance” in English) and “” (“Si dot MOS capacitor” in English) into account and softly aligned the target words with the whole phrase when translating the English verb “showed” and the preposition “of.” Our proposed model can thus flexibly learn the attention relations between English and Japanese.
We observed that our model translated the word “active” into “,” a synonym of the reference word “.” We also found similar examples in other sentences, where our model outputs synonyms of the reference words, for instance, “ and “” (“female” in English) and “NASA” and “” (“National Aeronautics and Space Administration” in English). These translations are penalized in terms of BLEU scores, but they do not necessarily mean that the translations were wrong. This point may be supported by the fact that the NMT models were highly evaluated in WAT 2015 by crowd sourcing (Nakazawa et al. 2015).
6.3 Error Analyses on Sequence-to-Sequence and Tree-to-Sequence NMT Models
To investigate what kinds of errors are harmful in the NMT models, we took 100 examples randomly from the development data in the English-to-Japanese translation task and classified the error types found in the generated translations obtained by the sequence-to-sequence NMT model and tree-to-sequence NMT models described in Section 6. In Table 10, we can see that both NMT models suffered from the word-level undertranslation and wrong translation issues with the highest error ratio. As for the phrase-level undertranslation and wrong translation, the tree-to-sequence NMT model successfully decreased the number of the errors. However, the sequence-to-sequence NMT model has fewer errors of repetition than the tree-to-sequence NMT model. The reason for this can be that the phrase-based attention mechanism enhances the decoder to generate a translation for specific phrases. The phrase-based repetitions are not trained through teacher forcing training, where the models are allowed to because optimized on a word-scale translation rather than a phrase-scale.
. | Seq-to-Seq NMT (%) . | Tree-to-Seq NMT (%) . |
---|---|---|
Undertranslation (word) | 29 (38.2%) | 29 (42.6%) |
Undertranslation (phrase) | 16 (21.1%) | 8 (11.8%) |
Wrong translation (word) | 22 (28.9%) | 17 (25.0%) |
Wrong translation (phrase) | 2 (2.6%) | 0 (0.0%) |
Repetition | 7 (9.2%) | 14 (20.6%) |
. | Seq-to-Seq NMT (%) . | Tree-to-Seq NMT (%) . |
---|---|---|
Undertranslation (word) | 29 (38.2%) | 29 (42.6%) |
Undertranslation (phrase) | 16 (21.1%) | 8 (11.8%) |
Wrong translation (word) | 22 (28.9%) | 17 (25.0%) |
Wrong translation (phrase) | 2 (2.6%) | 0 (0.0%) |
Repetition | 7 (9.2%) | 14 (20.6%) |
In Table 11, we show some examples of the phrase-level undertranslation and wrong translation outputs by the sequence-to-sequence NMT model and investigate how using a syntactic tree helps the proposed model to resolve ambiguity in the input sentence as discussed in Section 3.1. Our proposed tree-to-sequence NMT model correctly translated the original source sentence “First, after static electricity of human body is released a Eathernet card is fixed.” into “, , (“First, after static electricity of human body is released, an Eathernet card was fixed.” in English). By using the English parser, the parsed tree clearly identifies that the source sentence is composed of two main large-scale phrase components of “after static electricity of human body is released” and “a Eathernet card is fixed.” In order to investigate if feeding an obvious readable source sentence gives positive effect to the sequence-to-sequence NMT model, we manually fix the original source sentence in the three types of A, B, and C described in Table 10. However, the outputs are not correctly translated even after manually adding the comma after the word “released” to make the sentence boundary clear as shown in the source sentence A or after dynamically modifying the sentence structure as shown in the source sentences B and C.
6.4 Analysis of Attention Mechanism in Tree-to-Sequence NMT Model
Chinese-to-Japanese Translation.
Figure 6 is a translation example on the development data set obtained by using our proposed model introduced in Section 5 and shows the attention relations between the source words or phrases and the target words with the highest attention score α. The source Chinese sentence is represented as a binary tree. We let our trained model generate a translated sentence in Japanese. Some target words are related to the source phrases with higher attention scores than the source words; for instance, the first generated target Japanese word “ (energy)” is related to the phrase “ (promotion of energy industry)” with a higher attention score α = 0.66 than the translated word “ (energy)” with α = 0.22. Our attention mechanism also aligns one Japanese word “ (promotion)” with the source word “” with the highest attention score α = 0.95, and we can see that the attention mechanism in our proposed model flexibly learns which of the source words and phrases should be attended to more.
When generating a sentence, the attention mechanism provides the attention scores to the source units. We therefore counted the source targets with the five highest attention scores at each generated word to verify which of the source words or phrases are attended to more frequently in the Chinese-to-Japanese translation task. Table 12 shows the trends of the attention destinations toward either the source words or source phrases in both translation tasks. In the case of the Chinese-to-Japanese translation task, there exist a total of the 71,422 generated words on the development data. Here, the source words dominate the attention destinations rather than the source phrases at any thresholds of α with large difference. Our trained model uses the syntactic information but pays more attention to source words than source phrases. One possible reason is that both Chinese and Japanese share many Chinese characters, and thus it would be easy to perform word-by-word translation by aligning words of the same meanings.
. | Chinese-to-Japanese . | English-to-Japanese . | ||||
---|---|---|---|---|---|---|
Wrd (%) . | Phr (%) . | Small Data set . | Large Data set . | |||
Wrd (%) . | Phr (%) . | Wrd (%) . | Phr (%) . | |||
α > 0.2 | 18.9 | 5.5 | 11.2 | 10.8 | 9.1 | 8.1 |
α > 0.4 | 10.6 | 2.4 | 5.3 | 3.6 | 3.4 | 2.3 |
α > 0.6 | 6.1 | 1.0 | 2.9 | 2.5 | 1.5 | 0.5 |
α > 0.8 | 2.6 | 0.3 | 1.3 | 0.8 | 0.3 | 0.0 |
. | Chinese-to-Japanese . | English-to-Japanese . | ||||
---|---|---|---|---|---|---|
Wrd (%) . | Phr (%) . | Small Data set . | Large Data set . | |||
Wrd (%) . | Phr (%) . | Wrd (%) . | Phr (%) . | |||
α > 0.2 | 18.9 | 5.5 | 11.2 | 10.8 | 9.1 | 8.1 |
α > 0.4 | 10.6 | 2.4 | 5.3 | 3.6 | 3.4 | 2.3 |
α > 0.6 | 6.1 | 1.0 | 2.9 | 2.5 | 1.5 | 0.5 |
α > 0.8 | 2.6 | 0.3 | 1.3 | 0.8 | 0.3 | 0.0 |
We utilized the tree-to-sequence NMT models described in Section 5 with the BlackOut sampling method (K = 2,000) and in Section 6 with K = 2,500 for the English-to-Japanese translation. Here, the total numbers of generated words on the development data are 48,169 and 47,869 words by using the NMT models trained on the small training data and the large training data described in Table 2, respectively. Compared to the Chinese-to-Japanese translation task, the percentages of the source phrases as a destination become close to the source words at any α > n(n = 0.2, 0.4, 0.6, 0.8) within 2% for both cases. The attention mechanism in the English-to-Japanese translation is more frequently used to pay attention to the source phrases. As the training data increase, we can see that the source phrases are less likely to be paid attention to at each threshold of α.
7. Related Work
Kalchbrenner and Blunsom (2013) were the first to propose an end-to-end NMT model using convolutional neural networks (CNNs) for a source encoder and using RNNs for a target decoder. The encoder-decoder model can be seen as an extension of their model, and it replaces the CNNs with RNNs using GRU units (Cho et al. 2014b) or LSTM units (Sutskever, Vinyals, and Le 2014). Sutskever, Vinyals, and Le (2014) have shown that making the input sequences reversed is effective in a French-to-English translation task, and the technique has also proven effective in translation tasks between other European language pairs (Luong, Pham, and Manning 2015). All of the NMT models mentioned here are based on sequential encoders. The attention mechanism (Bahdanau, Cho, and Bengio 2015) has promoted the NMT models onto the next stage so that NMT models generate longer translated sentences by softly aligning the target words with the source words. Luong, Pham, and Manning (2015) refined the attention model so that it can dynamically focus on local windows rather than the entire sentence. They also proposed a more effective attention path in the calculation of the NMT models.
To incorporate structural information into the NMT models, Cho et al. (2014a) proposed to jointly learn structures inherent in source-side languages but did not report improvement of translation performance. These studies motivated us to investigate the role of syntactic structures explicitly given by existing syntactic parsers in the NMT models. Incorporating the syntactic structures into the NMT models has been actively studied recently. Chen et al. (2017) have extended our work by using bi-directional RNNs and RvNNs. Bastings et al. (2017) apply graph CNNs to encode the dependency trees. The previous work relies on syntactic trees provided by parsers and has a demerit of introducing parse errors into the models, whereas another trend is to learn and enhance latent structures or relations between any input units in source sentences in the NMT models (Bradbury and Socher 2017; Vaswani et al. 2017; Yoon Kim and Rush 2017). Hashimoto and Tsuruoka (2017) and Tran and Bisk (2018) reported an interesting observation that the latent parsed trees learned through the training are not always consistent with those obtained by the existing parsers, which suggests that there might exist a favorable structure of a source sentence depending on a task of interest.
Although it is relatively easy to encode a source syntactic tree into a vector space, decoding a target sentence with a parse tree is known as a challenging task and has been recently explored in sentence generation tasks (Dong and Lapata 2016) and language modeling tasks (Dyer et al. 2016). Wu et al. (2017) and Eriguchi, Tsuruoka, and Cho (2017) have proposed hybrid models that jointly learn to parse and translate by using target-side dependency trees, and Aharoni and Goldberg (2017) serialize a target parse tree to a sequence of units in order to apply it to the sequence-to-sequence NMT model. As we can see in the recent active studies of syntax-based approaches in the MT area, we believe that incorporating structural biases into NLP models is a promising research direction.
8. Conclusion
We propose a syntactic approach that extends the existing sequence-to-sequence NMT models. We focus on source-side phrase structures and build a tree-based encoder following the parse trees. Our proposed tree-based encoder is a natural extension of the sequential encoder model, where the leaf units of the tree-LSTM in the encoder can work together with the original sequential LSTM encoder. Moreover, the proposed attention mechanism allows the tree-based encoder to align not only the input words but also the input phrases with the output words. Experimental results show that the English-to-Japanese translation task benefits from incorporating syntactic trees more than the Chinese-to-Japanese translation task, and using the bi-directional encoder better improves the translation accuracy and achieves the best scores in both tasks. As the training data set becomes larger, we observed that the tree-to-sequence gives the smaller improvements on the WAT 2015 English-to-Japanese translation task. Our analyses on the tree-to-sequence models reveal different trends with the sequence-to-sequence models, and we have also shown that our proposed model flexibly learns which source words or phrases are attended to.
Acknowledgments
This work was supported by JST CREST grants JPMJCR1513 and JSPS KAKENHI grants 15J12597, 16H01715, and 17J09620.
Notes
We used a FULL SVM model of KyTea (jp-0.4.2-utf8-1.mod) and a Chinese Penn Treebank model of Stanford Segmentor version 2014-06-16 as suggested in WAT 2015.
The Stanford Parser version is 2017-06-09 with a Chinese-Factored model.
We used TreeBinarizer.java available in https://github.com/stanfordnlp/CoreNLP.
Contrary to Stanford Parser, Enju returns a binarized tree.
When the Enju parser fails to parse a sentence because of “sentence length limit exceeded,” we let the sentence be parsed again with an additional option of “-W 200” to increase the limit size of sentences up to 200. We found one sentence in both the development data and test data that is parsed again with the option.
We applied Qiao et al. (2017)’s approach, which effectively uses the cache to speed up the computations of matrices on CPUs.
We found two sentences that end without EOS with d = 512, and then we decoded them again with the beam size of 1,000 following (Zhu 2015).
We run the experiments on multi-core CPUs; 16 threads on Intel Xeon CPU E5-2667 v3 @ 3.20 GHz.
References
Author notes
The work was done when the first and second authors were at the University of Tokyo.