Abstract
We introduce an Edit-Based TransfOrmer with Repositioning (EDITOR), which makes sequence generation flexible by seamlessly allowing users to specify preferences in output lexical choice. Building on recent models for non-autoregressive sequence generation (Gu et al., 2019), EDITOR generates new sequences by iteratively editing hypotheses. It relies on a novel reposition operation designed to disentangle lexical choice from word positioning decisions, while enabling efficient oracles for imitation learning and parallel edits at decoding time. Empirically, EDITOR uses soft lexical constraints more effectively than the Levenshtein Transformer (Gu et al., 2019) while speeding up decoding dramatically compared to constrained beam search (Post and Vilar, 2018). EDITOR also achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer on standard Romanian-English, English-German, and English-Japanese machine translation tasks.
1 Introduction
Neural machine translation (MT) architectures (Bahdanau et al., 2015; Vaswani et al., 2017) make it difficult for users to specify preferences that could be incorporated more easily in statistical MT models (Koehn et al., 2007) and have been shown to be useful for interactive machine translation (Foster et al., 2002; Barrachina et al., 2009) and domain adaptation (Hokamp and Liu, 2017). Lexical constraints or preferences have previously been incorporated by re-training NMT models with constraints as inputs (Song et al., 2019; Dinu et al., 2019) or with constrained beam search that drastically slows down decoding (Hokamp and Liu, 2017; Post and Vilar, 2018).
In this work, we introduce a translation model that can seamlessly incorporate users’ lexical choice preferences without increasing the time and computational cost at decoding time, while being trained on regular MT samples. We apply this model to MT tasks with soft lexical constraints. As illustrated in Figure 1, when decoding with soft lexical constraints, user preferences for lexical choice in the output language are provided as an additional input sequence of target words in any order. The goal is to let users encode terminology, domain, or stylistic preferences in target word usage, without strictly enforcing hard constraints that might hamper NMT’s ability to generate fluent outputs.
Our model is an Edit-Based TransfOrmer with Repositioning (EDITOR), which builds on recent progress on non-autoregressive sequence generation (Lee et al., 2018; Ghazvininejad et al., 2019).1 Specifically, the Levenshtein Transformer (Gu et al., 2019) showed that iteratively refining output sequences via insertions and deletions yields a fast and flexible generation process for MT and automatic post-editing tasks. EDITOR replaces the deletion operation with a novel reposition operation to disentangle lexical choice from reordering decisions. As a result, EDITOR exploits lexical constraints more effectively and efficiently than the Levenshtein Transformer, as a single reposition operation can subsume a sequence of deletions and insertions. To train EDITOR via imitation learning, the reposition operation is defined to preserve the ability to use the Levenshtein edit distance (Levenshtein, 1966) as an efficient oracle. We also introduce a dual-path roll-in policy, which lets the reposition and deletion models learn to refine their respective outputs more effectively.
Experiments on Romanian-English, English-German, and English-Japanese MT show that EDITOR achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer (Gu et al., 2019) on the standard MT tasks and exploits soft lexical constraints better: It achieves significantly better translation quality and matches more constraints with faster decoding speed than the Levenshtein Transformer. It also drastically speeds up decoding compared with lexically constrained decoding algorithms (Post and Vilar, 2018). Furthermore, results highlight the benefits of soft constraints over hard ones—EDITOR with soft constraints achieves translation quality on par or better than both EDITOR and Levenshtein Transformer with hard constraints (Susanto et al., 2020).
2 Background
Non-Autoregressive MT
Although autoregressive models that decode from left-to-right are the de facto standard for many sequence generation tasks (Cho et al., 2014; Chorowski et al., 2015; Vinyals and Le, 2015), non-autoregressive models offer a promising alternative to speed up decoding by generating a sequence of tokens in parallel (Gu et al., 2018; van den Oord et al., 2018; Ma et al., 2019). However, their output quality suffers due to the large decoding space and strong independence assumptions between target tokens (Ma et al., 2019; Wang et al., 2019). These issues have been addressed via partially parallel decoding (Wang et al., 2018; Stern et al., 2018) or multi-pass decoding (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019). This work adopts multi-pass decoding, where the model generates the target sequences by iteratively editing the outputs from previous iterations. Edit operations such as substitution (Ghazvininejad et al., 2019) and insertion-deletion (Gu et al., 2019) have reduced the quality gap between non-autoregressive and autoregressive models. However, we argue that these operations limit the flexibility and efficiency of the resulting models for MT by entangling lexical choice and reordering decisions.
Reordering vs. Lexical Choice
EDITOR’s insertion and reposition operations connect closely with the long-standing view of MT as a combination of a translation or lexical choice model, which selects appropriate translations for source units given their context, and reordering model,which encourages the generation of a target sequence order appropriate for the target language. This view is reflected in architectures ranging from the word-based IBM models (Brown et al., 1990), sentence-level models that generate a bag of target words that is reordered to construct a target sentence (Bangalore et al., 2007), or the Operation Sequence Model (Durrani et al., 2015; Stahlberg et al., 2018), which views translation as a sequence of translation and reordering operations over bilingual minimal units. By contrast, autoregressive NMT models (Bahdanau et al., 2015; Vaswani et al., 2017) do not explicitly separate lexical choice and reordering, and previous non-autoregressive models break up reordering into sequences of other operations. This work introduces the reposition operation, which makes it possible to move words around during the refinement process, as reordering models do. However, we will see that reposition differs from typical reordering to enable efficient oracles for training via imitation learning, and parallelization of edit operations at decoding time (Section 3).
MT with Soft Lexical Constraints
NMT models lack flexible mechanisms to incorporate users preferences in their outputs. Lexical constraints have been incorporated in prior work via 1) constrained training where NMT models are trained on parallel samples augmented with constraint target phrases in both the source and target sequences (Song et al., 2019; Dinu et al., 2019), or 2) constrained decoding where beam search is modified to include constraint words or phrases in the output (Hokamp and Liu, 2017; Post and Vilar, 2018). These mechanisms can incorporate domain-specific knowledge and lexicons which is particularly helpful in low-resource cases (Arthur et al., 2016; Tang et al., 2016). Despite their success at domain adaptation for MT (Hokamp and Liu, 2017) and caption generation (Anderson et al., 2017), they suffer from several issues: Constrained training requires building dedicated models for constrained language generation, while constrained decoding adds significant computational overhead and treats all constraints as hard constraints which may hurt fluency. In other tasks, various constraint types have been introduced by designing complex architectures tailored to specific content or style constraints (Abu Sheikha and Inkpen, 2011; Mei et al., 2016), or via segment-level “side-constraints” (Sennrich et al., 2016a; Ficler and Goldberg, 2017; Agrawal and Carpuat, 2019), which condition generation on users’ stylistic preferences, but do not offer fine-grained control over their realization in the output sequence. We refer the reader to Yvon and Abdul Rauf (2020) for a comprehensive review of the strengths and weaknesses of current techniques to incorporate terminology constraints in NMT.
Our work is closely related to Susanto et al. (2020)’s idea of applying the Levenshtein Transformer to MT with hard terminology constraints. We will see that their technique can directly be used by EDITOR as well (Section 3.3), but this does not offer empirical benefits over the default EDITOR model (Section 4.3).
3 Approach
3.1 The EDITOR Model
We cast both constrained and unconstrained language generation as an iterative sequence refinement problem modeled by a Markov Decision Process , where a state y in the state space corresponds to a sequence of tokens y = (y1, y2, …, yL) from the vocabulary up to length L, and is the initial sequence For standard sequence generation tasks, y0 is the empty sequence (〈s〉, 〈/s〉). For lexically constrained generation tasks, y0 consists of the words to be used as constraints (〈s〉, c1, …, cm,〈/s〉).
At the k-th decoding iteration, the model takes as input yk−1, the output from the previous iteration, chooses an action to refine the sequence into , and receives a reward . The policy π maps the input sequence yk−1 to a probability distribution over the action space . Our model is based on the Transformer encoder-decoder (Vaswani et al.2017) and we extract the decoder representations (h1,…,hn) to make the policy predictions. Each refinement action is based on two basic operations: reposition and insertion.
Reposition
For each position i in the input sequence y1...n, the reposition policy πrps(r | i,y) predicts an index r ∈ [0,n]: If r > 0, we place the r-th input token yr at the i-th output position, otherwise we delete the token at that position (Figure 2). We constrain πrps(1 | 1,y) = πrps(n | n,y) = 1 to maintain sequence boundaries. Note that reposition differs from typical reordering because 1) it makes it possible to delete tokens, and 2) it places tokens at each position independently, which enables parallelization at decoding time. In principle, the same input token can thus be placed at multiple output positions. However, this happens rarely in practice as the policy predictor is trained to follow oracle demonstrations which cannot contain such repetitions by design.2
Insertion
Following Gu et al. (2019), the insertion operation consists of two phases: (1) placeholder insertion: Given an input sequence y1...n, the placeholder predictor πplh(p | i,y) predicts the number of placeholders p ∈ [0,Kmax] to be inserted between two neighboring tokens (yi,yi+1);3 (2) token prediction: Given the output of the placeholder predictor, the token predictor πtok(t | i,y) replaces each placeholder with an actual token.
Action
3.2 Dual-Path Imitation Learning
We train EDITOR using imitation learning (Daumé III et al., 2009; Ross et al., 2011; Ross and Bagnell, 2014) to efficiently explore the space of valid action sequences that can reach a reference translation. The key idea is to construct a roll-in policy πin to generate sequences to be refined and a roll-out policy πout to estimate cost-to-go for all possible actions given each input sequence. The model is trained to choose actions that minimize the cost-to-go estimates. We use a search-based oracle policy π* as the roll-out policy and train the model to imitate the optimal actions chosen by the oracle.
Next, we describe how the reposition operation is incorporated in the roll-in policy (Section 3.2.1) and the oracle roll-out policy (Section 3.2.2).
3.2.1 Dual-Path Roll-in Policy
As shown in Figure 3, the roll-in policies and for the reposition and insertion policy predictors are stochastic mixtures of the noised reference sequences and the output sequences sampled from their corresponding dual policy predictors. Figure 4 shows an example for creating the roll-in sequences: We first create the initial sequence y0 by applying random word dropping (Gu et al., 2019) and random word shuffle (Lample et al., 2018) with probability of 0.5 and maximum shuffle distance of 3 to the reference sequence y*, and produce the roll-in sequences for each policy predictor as follows:
- Reposition: The roll-in policy is a stochastic mixture of the initial sequence y0 and the output sequence by applying one iteration of the oracle placeholder insertion policy p*∼ π* and the model’s token prediction policy to y0:where the mixture factor β ∈ [0,1] and random variable u ∼ Uniform(0,1).(6)
- Insertion: The roll-in policy is a stochastic mixture of the initial sequence y0 and the output sequence by applying one iteration of the model’s reposition policy to y0:where the mixture factor α ∈ [0,1] and random variable u ∼ Uniform(0,1).(7)
While Gu et al. (2019) define roll-in using only the model’s insertion policy, we call our approach dual-path because roll-in creates two distinct intermediate sequences using the model’s reposition or insertion policy. This makes it possible for the reposition and insertion policy predictors to learn to refine one another’s outputs during roll-out, mimicking the iterative refinement process used at inference time.4
3.2.2 Oracle Roll-Out Policy
Policy
Algorithm
The reposition and insertion operations used in EDITOR are designed so that the Levenshtein edit distance algorithm (Levenshtein, 1966) can be used as the oracle. The reposition operation (Section 3.1) can be split into two distinct types of operations: (1) deletion and (2) replacing a word with any other word appearing in the input sequence, which is a constrained version of the Levenshtein substitution operation. As a result, we can use dynamic programming to find the optimal action sequence in O(|y||y*|) time. By contrast, the Levenshtein Transformer restricts the oracle and model to insertion and deletion operations only. While in principle substitutions can be performed indirectly by deletion and re-insertion, our results show the benefits of using the reposition variant of the substitution operation.
3.3 Inference
During inference, we start from the initial sequence y0. For standard sequence generation tasks, y0 is an empty sequence, whereas for lexically constrained generation y0 is a sequence of lexical constraints. Inference then proceeds in the exact same way for constrained and unconstrained tasks. The initial sequence is refined iteratively by applying a sequence of actions (a1,a2,…) = (r1,p1,t1 ; r2,p2,t2 ; …). We greedily select the best action at each iteration given the model policy in Equations (1) to (3). We stop refining if 1) the output sequences from two consecutive iterations are the same (Gu et al., 2019), or 2) the maximum number of decoding steps is reached (Lee et al., 2018; Ghazvininejad et al., 2019).5
Incorporating Soft Constraints
Although EDITOR is trained without lexical constraints, it can be used seamlessly for MT with constraints without any change to the decoding process except using the constraint sequence as the initial sequence.
Incorporating Hard Constraints
We adopt the decoding technique introduced by Susanto et al. (2020) to enforce hard constraints at decoding time by prohibiting deletion operations on constraint tokens or insertions within a multi-token constraints.
4 Experiments
We evaluate the EDITOR model on standard (Section 4.2) and lexically constrained machine translation (Sections 4.3–4.4).
4.1 Experimental Settings
Dataset
Following Gu et al. (2019), we experiment on three language pairs spanning different language families and data conditions (Table 1): Romanian-English (Ro-En) from WMT16 (Bojar et al., 2016), English-German (En-De) from WMT14 (Bojar et al., 2014), and English-Japanese (En-Ja) from WAT2017 Small-NMT Task (Nakazawa et al., 2017). We also evaluate EDITOR on the two En-De test sets with terminology constraints released by Dinu et al. (2019). The test sets are subsets of the WMT17 En-De test set (Bojar et al., 2017) with terminology constraints extracted from Wiktionary and IATE.6 For each test set, they only select the sentence pairs in which the exact target terms are used in the reference. The resulting Wiktionary and IATE test sets contain 727 and 414 sentences respectively. We follow the same preprocessing steps in Gu et al. (2019): We apply normalization, tokenization, true-casing, and BPE (Sennrich et al., 2016b) with 37k and 40k operations for En-De and Ro-En. For En-Ja, we use the provided subword vocabularies (16,384 BPE per language from SentencePiece [Kudo and Richardson, 2018]).
Experimental Conditions
We train and evaluate the following models in controlled conditions to thoroughly evaluate EDITOR:
Auto-Regressive Transformers (AR) built using Sockeye (Hieber et al., 2017) and fairseq (Ott et al., 2019). We report AR baselines with both toolkits to enable fair comparisons when using our fairseq-based implementation of EDITOR and Sockeye- based implementation of lexically constrained decoding algorithms (Post and Vilar, 2018).
Non Auto-Regressive Transformers (NAR) In addition to EDITOR, we train a Levenshtein Transformer (LevT) with approximately the same number of parameters. Both are implemented using fairseq.
Model and Training Configurations
All models adopt the base Transformer architecture (Vaswani et al., 2017) with dmodel = 512, dhidden = 2048, nheads = 8, nlayers = 6, and pdropout = 0.3. For En-De and Ro-En, the source and target embeddings are tied with the output layer weights (Press and Wolf, 2017; Nguyen and Chiang, 2018). We add dropout to embeddings (0.1) and label smoothing (0.1). AR models are trained with the Adam optimizer (Kingma and Ba, 2015) with a batch size of 4096 tokens. We checkpoint models every 1000 updates. The initial learning rate is 0.0002, and it is reduced by 30% after 4 checkpoints without validation perplexity improvement. Training stops after 20 checkpoints without improvement. All NAR models are trained using Adam (Kingma and Ba, 2015) with initial learning rate of 0.0005 and a batch size of 64,800 tokens for maximum 300,000 steps.7 We select the best checkpoint based on validation BLEU (Papineni et al., 2002). All models are trained on 8 NVIDIA V100 Tensor Core GPUs.
Knowledge Distillation
We apply sequence-level knowledge distillation from autoregressive teacher models as widely used in non-autoregressive generation (Gu et al., 2018; Lee et al., 2018; Gu et al., 2019). Specifically, when training the non-autoregressive models, we replace the reference sequences y* in the training data with translation outputs from the AR teacher model (Sockeye, with beam = 4).8 We also report the results when applying knowledge distillation to autoregressive models.
Evaluation
We evaluate translation quality via case-sensitive tokenized BLEU (as in Gu et al. (2019))9 and RIBES (Isozaki et al., 2010), which is more sensitive to word order differences. Before computing the scores, we tokenize the German and English outputs using Moses and Japanese outputs using KyTea.10 For lexically constrained decoding, we report the constraint preservation rate (CPR) in the translation outputs.
We quantify decoding speed using latency per sentence. It is computed as the average time (in ms) required to translate the test set using batch size of one (excluding the model loading time) divided by the number of sentences in the test set.
4.2 MT Tasks
Because our experiments involve two different toolkits, we first compare the same Transformer AR models built with Sockeye and with fairseq: The AR models achieve comparable decoding speed and translation quality regardless of toolkit—the Sockeye model obtains higher BLEU than the fairseq model on Ro-En and En-De but lower on En-Ja (Table 2). Further comparisons will therefore center on the Sockeye AR model to better compare EDITOR with the lexically constrained decoding algorithm (Post and Vilar, 2018).
. | . | Distill . | Beam . | Params . | BLEU ↑ . | RIBES ↑ . | Latency (ms) ↓ . |
---|---|---|---|---|---|---|---|
Ro-En | AR (fairseq) | 4 | 64.5M | 32.0 | 83.8 | 357.14 | |
AR (sockeye) | 4 | 64.5M | 32.3 | 83.6 | 369.82 | ||
AR (sockeye) | 10 | 64.5M | 32.5 | 83.8 | 394.52 | ||
AR (sockeye) | ✓ | 10 | 64.5M | 32.9 | 84.2 | 371.75 | |
NAR: LevT | ✓ | – | 90.9M | 31.6 | 84.0 | 98.81 | |
NAR: EDITOR | ✓ | – | 90.9M | 31.9 | 84.0 | 93.20 | |
En-De | AR (fairseq) | 4 | 64.9M | 27.1 | 80.4 | 363.64 | |
AR (sockeye) | 4 | 64.9M | 27.3 | 80.2 | 308.64 | ||
AR (sockeye) | 10 | 64.9M | 27.4 | 80.3 | 332.73 | ||
AR (sockeye) | ✓ | 10 | 64.9M | 27.6 | 80.5 | 363.52 | |
NAR: LevT | ✓ | – | 91.1M | 26.9 | 81.0 | 113.12 | |
NAR: EDITOR | ✓ | – | 91.1M | 26.9 | 80.9 | 105.37 | |
En-Ja | AR (fairseq) | 4 | 62.4M | 44.9 | 85.7 | 292.40 | |
AR (sockeye) | 4 | 62.4M | 43.4 | 85.1 | 286.83 | ||
AR (sockeye) | 10 | 62.4M | 43.5 | 85.3 | 311.38 | ||
AR (sockeye) | ✓ | 10 | 62.4M | 42.7 | 85.1 | 295.32 | |
NAR: LevT | ✓ | – | 106.1M | 42.4 | 84.5 | 143.88 | |
NAR: EDITOR | ✓ | – | 106.1M | 42.3 | 85.1 | 96.62 |
. | . | Distill . | Beam . | Params . | BLEU ↑ . | RIBES ↑ . | Latency (ms) ↓ . |
---|---|---|---|---|---|---|---|
Ro-En | AR (fairseq) | 4 | 64.5M | 32.0 | 83.8 | 357.14 | |
AR (sockeye) | 4 | 64.5M | 32.3 | 83.6 | 369.82 | ||
AR (sockeye) | 10 | 64.5M | 32.5 | 83.8 | 394.52 | ||
AR (sockeye) | ✓ | 10 | 64.5M | 32.9 | 84.2 | 371.75 | |
NAR: LevT | ✓ | – | 90.9M | 31.6 | 84.0 | 98.81 | |
NAR: EDITOR | ✓ | – | 90.9M | 31.9 | 84.0 | 93.20 | |
En-De | AR (fairseq) | 4 | 64.9M | 27.1 | 80.4 | 363.64 | |
AR (sockeye) | 4 | 64.9M | 27.3 | 80.2 | 308.64 | ||
AR (sockeye) | 10 | 64.9M | 27.4 | 80.3 | 332.73 | ||
AR (sockeye) | ✓ | 10 | 64.9M | 27.6 | 80.5 | 363.52 | |
NAR: LevT | ✓ | – | 91.1M | 26.9 | 81.0 | 113.12 | |
NAR: EDITOR | ✓ | – | 91.1M | 26.9 | 80.9 | 105.37 | |
En-Ja | AR (fairseq) | 4 | 62.4M | 44.9 | 85.7 | 292.40 | |
AR (sockeye) | 4 | 62.4M | 43.4 | 85.1 | 286.83 | ||
AR (sockeye) | 10 | 62.4M | 43.5 | 85.3 | 311.38 | ||
AR (sockeye) | ✓ | 10 | 62.4M | 42.7 | 85.1 | 295.32 | |
NAR: LevT | ✓ | – | 106.1M | 42.4 | 84.5 | 143.88 | |
NAR: EDITOR | ✓ | – | 106.1M | 42.3 | 85.1 | 96.62 |
Table 2 also shows that knowledge distillation has a small and inconsistent impact on AR models (Sockeye): It yields higher BLEU on Ro-En, close BLEU on En-De, and lower BLEU on En-Ja.11 Thus, we use the AR models trained without distillation in further experiments.
Next, we compare the NAR models against the AR (Sockeye) baseline. As expected, both EDITOR and LevT achieve close translation quality to their AR teachers with 2–4 times speedup. BLEU differences are small (Δ < 1.1), as in prior work (Gu et al., 2019). The RIBES trends are more surprising: Both NAR models significantly outperform the AR models (Sockeye) on RIBES, except for En-Ja, where EDITOR and the AR models significantly outperforms LevT. This illustrates the strength of EDITOR in word reordering.
Finally, results confirm the benefits of EDITOR’s reposition operation over LevT: Decoding with EDITOR is 6–7% faster than LevT on Ro-En and En-De, and 33% faster on En-Ja —a more distant language pair which requires more reordering but no inflection changes on reordered words—with no statistically significant difference in BLEU nor RIBES, except for En-Ja, where EDITOR significantly outperforms LevT on RIBES. Overall, EDITOR is shown to be a good alternative to LevT on standard machine translation tasks and can also be used to replace the AR models in settings where decoding speed matters more than small differences in translation quality.
4.3 MT with Lexical Constraints
We now turn to the main evaluation of EDITOR on machine translation with lexical constraints.
Experimental Conditions
We conduct a controlled comparison of the following approaches:
NAR models: EDITOR and LevT view the lexical constraints as soft constraints, provided via the initial target sequence. We also explore the decoding technique introduced in Susanto et al. (2020) to support hard constraints.
AR models: They use the provided target words as hard constraints enforced at decoding time by an efficient form of constrained beam search: dynamic beam allocation (DBA) (Post and Vilar, 2018).12
Crucially, all models, including EDITOR, are the exact same models evaluated on the standard MT tasks above, and do not need to be trained specifically to incorporate constraints.
We define lexical constraints as Post and Vilar (2018): For each source sentence, we randomly select one to four words from the reference as lexical constraints. We then randomly shuffle the constraints and apply BPE to the constraint sequence. Different from the terminology test sets in Dinu et al. (2019), which contain only several hundred sentences with mostly nominal constraints, our constructed test sets are larger and include lexical constraints of all types.
Main Results
Table 3 shows that EDITOR exploits the soft constraints to strike a better balance between translation quality and decoding speed than other models. Compared to LevT, EDITOR preserves 7–17% more constraints and achieves significantly higher translation quality (+1.1–2.5 on BLEU and +1.6–1.8 on RIBES) and faster decoding speed. Compared to the AR model with beam = 4, EDITOR yields significantly higher BLEU (+1.0–2.2) and RIBES (+4.1–6.9) with 3–4 times decoding speedup. After increasing the beam to 10, EDITOR obtains lower BLEU but comparable RIBES with 6–7 times decoding speedup.13 Note that AR models treat provided words as hard constraints and therefore achieve over 99% CPR by design, while NAR models treat them as soft constraints.
. | . | Distill . | Beam . | BLEU ↑ . | RIBES ↑ . | CPR ↑ . | Latency (ms) ↓ . |
---|---|---|---|---|---|---|---|
Ro-En | AR + DBA (sockeye) | 4 | 31.0 | 79.5 | 99.7 | 436.26 | |
AR + DBA (sockeye) | 10 | 34.6 | 84.5 | 99.5 | 696.68 | ||
NAR: LevT | ✓ | – | 31.6 | 83.4 | 80.3 | 121.80 | |
+ hard constraints | ✓ | – | 27.7 | 78.4 | 99.9 | 140.79 | |
NAR: EDITOR | ✓ | – | 33.1 | 85.0 | 86.8 | 108.98 | |
+ hard constraints | ✓ | – | 28.8 | 81.2 | 95.0 | 136.78 | |
En-De | AR + DBA (sockeye) | 4 | 26.1 | 74.7 | 99.7 | 434.41 | |
AR + DBA (sockeye) | 10 | 30.5 | 81.9 | 99.5 | 896.60 | ||
NAR: LevT | ✓ | – | 27.1 | 80.0 | 75.6 | 127.00 | |
+ hard constraints | ✓ | – | 24.9 | 74.1 | 100.0 | 134.10 | |
NAR: EDITOR | ✓ | – | 28.2 | 81.6 | 88.4 | 121.65 | |
+ hard constraints | ✓ | – | 25.8 | 77.2 | 96.8 | 134.10 | |
En-Ja | AR + DBA (sockeye) | 4 | 44.3 | 81.6 | 100.0 | 418.71 | |
AR + DBA (sockeye) | 10 | 48.0 | 85.9 | 100.0 | 736.92 | ||
NAR: LevT | ✓ | – | 42.8 | 84.0 | 74.3 | 161.17 | |
+ hard constraints | ✓ | – | 39.7 | 77.4 | 99.9 | 159.27 | |
NAR: EDITOR | ✓ | – | 45.3 | 85.7 | 91.3 | 109.50 | |
+ hard constraints | ✓ | – | 43.7 | 82.6 | 96.4 | 132.71 |
. | . | Distill . | Beam . | BLEU ↑ . | RIBES ↑ . | CPR ↑ . | Latency (ms) ↓ . |
---|---|---|---|---|---|---|---|
Ro-En | AR + DBA (sockeye) | 4 | 31.0 | 79.5 | 99.7 | 436.26 | |
AR + DBA (sockeye) | 10 | 34.6 | 84.5 | 99.5 | 696.68 | ||
NAR: LevT | ✓ | – | 31.6 | 83.4 | 80.3 | 121.80 | |
+ hard constraints | ✓ | – | 27.7 | 78.4 | 99.9 | 140.79 | |
NAR: EDITOR | ✓ | – | 33.1 | 85.0 | 86.8 | 108.98 | |
+ hard constraints | ✓ | – | 28.8 | 81.2 | 95.0 | 136.78 | |
En-De | AR + DBA (sockeye) | 4 | 26.1 | 74.7 | 99.7 | 434.41 | |
AR + DBA (sockeye) | 10 | 30.5 | 81.9 | 99.5 | 896.60 | ||
NAR: LevT | ✓ | – | 27.1 | 80.0 | 75.6 | 127.00 | |
+ hard constraints | ✓ | – | 24.9 | 74.1 | 100.0 | 134.10 | |
NAR: EDITOR | ✓ | – | 28.2 | 81.6 | 88.4 | 121.65 | |
+ hard constraints | ✓ | – | 25.8 | 77.2 | 96.8 | 134.10 | |
En-Ja | AR + DBA (sockeye) | 4 | 44.3 | 81.6 | 100.0 | 418.71 | |
AR + DBA (sockeye) | 10 | 48.0 | 85.9 | 100.0 | 736.92 | ||
NAR: LevT | ✓ | – | 42.8 | 84.0 | 74.3 | 161.17 | |
+ hard constraints | ✓ | – | 39.7 | 77.4 | 99.9 | 159.27 | |
NAR: EDITOR | ✓ | – | 45.3 | 85.7 | 91.3 | 109.50 | |
+ hard constraints | ✓ | – | 43.7 | 82.6 | 96.4 | 132.71 |
Results confirm that enforcing hard constraints increases CPR but degrades translation quality compared to the same model using soft constraints: For LevT, it degrades BLEU by 2.2–3.9 and RIBES by 5.0–6.6. For EDITOR, it degrades BLEU by 1.6–4.3 and RIBES by 3.1–4.4 (Table 3). By contrast, EDITOR with soft constraints strikes a better balance between translation quality and constraint preservation.
The strengths of EDITOR hold when varying the number of constraints (Figure 5). For all tasks and models, adding constraints helps BLEU up to a certain point, ranging from 4 to 10 words. When excluding the slower AR model (beam = 10), EDITOR consistently reaches the highest BLEU score with 2–10 constraints: EDITOR outperforms LevT and the AR model with beam = 4. Consistent with Post and Vilar (2018), as the number of constraints increases, the AR model needs larger beams to reach good performance. When the number of constraints increases to 10, EDITOR yields higher BLEU than the AR model on En-Ja and Ro-En, even after incurring the cost of increasing the AR beam to 10.
Are EDITOR improvements limited to preserving constraints better? We verify that this is not the case by computing the target word F1 binned by frequency (Neubig et al., 2019). Figure 6 shows that EDITOR improves over LevT across all test frequency classes and closes the gap between NAR and AR models: The largest improvements are obtained for low and medium frequency words—on En-De and En-Ja, the largest improvements are on words with frequency between 5 and 1000, while on Ro-En, EDITOR improves more on words with frequency between 5 and 100. EDITOR also improves F1 on rare words (frequency in [0,5]), but not as much as for more frequent words.
We now conduct further analysis to better understand the factors that contribute to EDITOR’s advantages over LevT.
Impact of Reposition
We compare the average number of basic edit operations (Section 3.1) of different types used by EDITOR and LevT on each test sentence (averaged over the 5 runs): Reposition (excluding deletion for controlled comparison with LevT), deletion, and insertion performed by LevT and EDITOR at decoding time. Table 4 shows that LevT deletes tokens 2–3 times more often than EDITOR, which explains its lower CPR than EDITOR. LevT also inserts tokens 1.2–1.6 times more often than EDITOR and performs 1.4 times more edit operations on En-De and En-Ja. On Ro-En, LevT performs −4% fewer edit operations in total than EDITOR but is overall slower than EDITOR, since multiple operations can be done in parallel at each action step. Overall, EDITOR takes 3–40% fewer decoding iterations than LevT. These results suggest that reposition successfully reduces redundancy in edit operations and makes decoding more efficient by replacing sequences of insertions and deletions with a single repositioning step.
. | Repos. . | Del. . | Ins. . | Total . | Iter. . |
---|---|---|---|---|---|
Ro-En | |||||
LevT | 0.00 | 4.61 | 33.05 | 37.67 | 2.01 |
EDITOR | 8.13 | 2.50 | 28.68 | 39.31 | 1.81 |
En-De | |||||
LevT | 0.00 | 7.13 | 45.45 | 52.58 | 2.14 |
EDITOR | 5.85 | 4.01 | 28.75 | 38.61 | 2.07 |
En-Ja | |||||
LevT | 0.00 | 5.24 | 32.83 | 38.07 | 2.93 |
EDITOR | 4.73 | 1.69 | 21.64 | 28.06 | 1.76 |
. | Repos. . | Del. . | Ins. . | Total . | Iter. . |
---|---|---|---|---|---|
Ro-En | |||||
LevT | 0.00 | 4.61 | 33.05 | 37.67 | 2.01 |
EDITOR | 8.13 | 2.50 | 28.68 | 39.31 | 1.81 |
En-De | |||||
LevT | 0.00 | 7.13 | 45.45 | 52.58 | 2.14 |
EDITOR | 5.85 | 4.01 | 28.75 | 38.61 | 2.07 |
En-Ja | |||||
LevT | 0.00 | 5.24 | 32.83 | 38.07 | 2.93 |
EDITOR | 4.73 | 1.69 | 21.64 | 28.06 | 1.76 |
Furthermore, Figure 7 illustrates how reposition increases flexibility in exploiting lexical constraints, even when they are provided in the wrong order. While LevT generates an incorrect output by using constraints in the provided order, EDITOR’s reposition operation helps generate a more fluent and adequate translation.
Impact of Dual-Path Roll-In
Ablation experiments (Table 5) show that EDITOR benefits greatly from dual-path roll-in. Replacing dual-path roll-in with the simpler roll-in policy used in Gu et al. (2019), the model’s translation quality drops significantly (by 0.9–1.3 on BLEU and 0.6–1.9 on RIBES) with fewer constraints preserved and slower decoding. It still achieves better translation quality than LevT thanks to the reposition operation: specifically, it yields significantly higher BLEU and RIBES on Ro-En, comparable BLEU and significantly higher RIBES on En-De, and comparable RIBES and significantly higher BLEU on En-Ja than LevT.
. | BLEU↑ . | RIBES↑ . | CPR↑ . | Lat. ↓ . |
---|---|---|---|---|
Ro-En | ||||
EDITOR | 33.1 | 85.0 | 86.8 | 108.98 |
-dual-path | 32.2 | 84.4 | 74.8 | 119.61 |
LevT | 31.6 | 83.4 | 80.3 | 121.80 |
En-De | ||||
EDITOR | 28.2 | 81.6 | 88.4 | 121.65 |
-dual-path | 27.2 | 80.4 | 78.7 | 130.85 |
LevT | 27.1 | 80.0 | 75.6 | 127.00 |
En-Ja | ||||
EDITOR | 45.3 | 85.7 | 91.3 | 109.50 |
-dual-path | 44.0 | 83.9 | 80.0 | 154.10 |
LevT | 42.8 | 84.0 | 74.3 | 161.17 |
. | BLEU↑ . | RIBES↑ . | CPR↑ . | Lat. ↓ . |
---|---|---|---|---|
Ro-En | ||||
EDITOR | 33.1 | 85.0 | 86.8 | 108.98 |
-dual-path | 32.2 | 84.4 | 74.8 | 119.61 |
LevT | 31.6 | 83.4 | 80.3 | 121.80 |
En-De | ||||
EDITOR | 28.2 | 81.6 | 88.4 | 121.65 |
-dual-path | 27.2 | 80.4 | 78.7 | 130.85 |
LevT | 27.1 | 80.0 | 75.6 | 127.00 |
En-Ja | ||||
EDITOR | 45.3 | 85.7 | 91.3 | 109.50 |
-dual-path | 44.0 | 83.9 | 80.0 | 154.10 |
LevT | 42.8 | 84.0 | 74.3 | 161.17 |
4.4 MT with Terminology Constraints
We evaluate EDITOR on the terminology test sets released by Dinu et al. (2019) to test its ability to incorporate terminology constraints and to further compare it with prior work (Dinu et al., 2019; Post and Vilar, 2018; Susanto et al., 2020).
Compared to Post and Vilar (2018) and Dinu et al. (2019), EDITOR with soft constraints achieves higher absolute BLEU, and higher BLEU improvements over its counterpart without constraints (Table 6). Consistent with previous findings by Susanto et al. (2020), incorporating soft constraints in LevT improves BLEU by +0.3 on Wiktionary and by +0.4 on IATE. Enforcing hard constraints as in Susanto et al. (2020) increases the term usage by +8–10% and improves BLEU by +0.3–0.6 over LevT using soft constraints.14 For EDITOR, adding soft constraints improves BLEU by +0.5 on Wiktionary and +0.9 on IATE, with very high term usages (96.8% and 97.1% respectively). EDITOR thus correctly uses the provided terms almost all the time when they are provided as soft constraints, so there is little benefit to enforcing hard constraints instead: They help close the small gap to reach 100% term usage and do not improve BLEU. Overall, EDITOR achieves on par or higher BLEU than LevT with hard constraints.
. | Wiktionary . | IATE . | ||
---|---|---|---|---|
. | Term%↑ . | BLEU↑ . | Term%↑ . | BLEU↑ . |
Prior Results | ||||
Base Trans. | 76.9 | 26.0 | 76.3 | 25.8 |
Post18 | 99.5 | 25.8 | 82.0 | 25.3 |
Dinu19 | 93.4 | 26.3 | 94.5 | 26.0 |
Base LevT | 81.1 | 30.2 | 80.3 | 29.0 |
Susanto20 | 100.0 | 31.2 | 100.0 | 30.1 |
Our Results | ||||
LevT | 84.3 | 28.2 | 83.9 | 27.9 |
+ soft constraints | 90.5 | 28.5 | 92.5 | 28.3 |
+ hard constraints | 100.0 | 28.8 | 100.0 | 28.9 |
EDITOR | 83.5 | 28.8 | 83.0 | 27.9 |
+ soft constraints | 96.8 | 29.3 | 97.1 | 28.8 |
+ hard constraints | 99.8 | 29.3 | 100.0 | 28.9 |
. | Wiktionary . | IATE . | ||
---|---|---|---|---|
. | Term%↑ . | BLEU↑ . | Term%↑ . | BLEU↑ . |
Prior Results | ||||
Base Trans. | 76.9 | 26.0 | 76.3 | 25.8 |
Post18 | 99.5 | 25.8 | 82.0 | 25.3 |
Dinu19 | 93.4 | 26.3 | 94.5 | 26.0 |
Base LevT | 81.1 | 30.2 | 80.3 | 29.0 |
Susanto20 | 100.0 | 31.2 | 100.0 | 30.1 |
Our Results | ||||
LevT | 84.3 | 28.2 | 83.9 | 27.9 |
+ soft constraints | 90.5 | 28.5 | 92.5 | 28.3 |
+ hard constraints | 100.0 | 28.8 | 100.0 | 28.9 |
EDITOR | 83.5 | 28.8 | 83.0 | 27.9 |
+ soft constraints | 96.8 | 29.3 | 97.1 | 28.8 |
+ hard constraints | 99.8 | 29.3 | 100.0 | 28.9 |
Results also suggest that EDITOR can handle phrasal constraints even though it relies on token-level edit operations, since it achieves above 99% term usage on the terminology test sets where 26–27% of the constraints are multi-token.
5 Conclusion
We introduce EDITOR, a non-autoregressive transformer model that iteratively edits hypotheses using a novel reposition operation. Reposition combined with a new dual-path imitation learning strategy helps EDITOR generate output sequences that flexibly incorporate user’s lexical choice preferences. Extensive experiments show that EDITOR exploits soft lexical constraints more effectively than the Levenshtein Transformer (Gu et al., 2019) while speeding up decoding dramatically compared to constrained beam search (Post and Vilar, 2018). Results also confirm the benefits of using soft constraints over hard ones in terms of translation quality. EDITOR also achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer on three standard MT tasks. These promising results open several avenues for future work, including using EDITOR for other generation tasks than MT and investigating its ability to incorporate more diverse constraint types into the decoding process.
Acknowledgments
We thank Sweta Agrawal, Kianté Brantley, Eleftheria Briakou, Hal Daumé III, Aquia Richburg, François Yvon, the TACL reviewers, and the CLIP lab at UMD for their helpful and constructive comments. This research is supported in part by an Amazon Web Services Machine Learning Research Award and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Notes
Empirically, fewer than 1% of tokens are repositioned to more than one output position.
In our implementation, we set Kmax = 255.
Different from the inference process, we generate the roll-in sequences by applying the model’s reposition or insertion policy for only one iteration.
Following Stern et al. (2019), we also experiment with adding penalty for inserting “empty” placeholders during inference by subtracting a penalty score γ = [0, 3] from the logits of zero in Equation (2) to avoid overly short outputs. However, preliminary experiments show that zero penalty score achieves the best performance.
Available at https://www.wiktionary.org/ and https://iate.europa.eu.
Our preliminary experiments and prior work show that NAR models require larger training batches than AR models.
This teacher model was selected for a fairer comparison on MT with lexical constraints.
Kasai et al. (2020) found that AR models can benefit from knowledge distillation but with a Transformer large model as a teacher, while we use the Transformer base model.
Although the beam pruning option in Post and Vilar (2018) is not used here (since it is not supported in Sockeye anymore), other Sockeye updates improve efficiency. Constrained decoding with DBA is 1.8–2.7 times slower than unconstrained decoding here, while DBA is 3 times slower when beam = 10 in Post and Vilar (2018).
Post and Vilar (2018) show that the optimal beam size for DBA is 20. Our experiment on En-De shows that increasing the beam size from 10 to 20 improves BLEU by 0.7 at the cost of doubling the decoding time.