EDITOR: an Edit-Based Transformer with Repositioning for Neural Machine Translation with Soft Lexical Constraints

We introduce an Edit-Based Transformer with Repositioning (EDITOR), which makes sequence generation flexible by seamlessly allowing users to specify preferences in output lexical choice. Building on recent models for non-autoregressive sequence generation (Gu et al., 2019), EDITOR generates new sequences by iteratively editing hypotheses. It relies on a novel reposition operation designed to disentangle lexical choice from word positioning decisions, while enabling efficient oracles for imitation learning and parallel edits at decoding time. Empirically, EDITOR uses soft lexical constraints more effectively than the Levenshtein Transformer (Gu et al., 2019) while speeding up decoding dramatically compared to constrained beam search (Post and Vilar, 2018). EDITOR also achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer on standard Romanian-English, English-German, and English-Japanese machine translation tasks.


Introduction
Neural machine translation (MT) architectures Vaswani et al., 2017) make it difficult for users to specify preferences that could be incorporated more easily in statistical MT models (Koehn et al., 2007) and have been shown to be useful for interactive machine translation (Foster et al., 2002;Barrachina et al., 2009) and domain adaptation (Hokamp and Liu, 2017). Lexical constraints or preferences have previously been incorporated by re-training NMT models with constraints as inputs (Song et al., 2019;Dinu et al., 2019) or with constrained beam search that drastically slows down decoding (Hokamp and Liu, 2017;Post and Vilar, 2018).
In this work, we introduce a translation model that can seamlessly incorporate users' lexical unconstrained MT output constraints: plague ankle The 29-year-old has been plagued with a troublesome ankle for two years.
The 29-year-old has struggled for two years with problems in the bullying.

hard-constrained MT output
The 29-year-old has been plague for two years with problems in the ankle.

soft-constrained MT output
The 29-year-old has struggled for two years with problems in the ankle. source reference Figure 1: Romanian to English MT example. Unconstrained MT incorrectly translates "gleznȃ" to "bullying". Given constraint words "plague" and "ankle", soft-constrained MT correctly uses "ankle" and avoids disfluencies introduced by using "plague" as a hard constraint in its exact form.
choice preferences without increasing the time and computational cost at decoding time, while being trained on regular MT samples. We apply this model to MT tasks with soft lexical constraints. As illustrated in Figure 1, when decoding with soft lexical constraints, user preferences for lexical choice in the output language are provided as an additional input sequence of target words in any order. The goal is to let users encode terminology, domain or stylistic preferences in target word usage, without strictly enforcing hard constraints that might hamper NMT's ability to generate fluent outputs.
Our model is an Edit-Based TransfOrmer with Repositioning (EDITOR), which builds on recent progress on non-autoregressive sequence generation (Lee et al., 2018;Ghazvininejad et al., 2019). 1 Specifically, the Levenshtein Transformer (Gu et al., 2019) showed that iteratively refining output sequences via insertions and deletions yields a fast and flexible generation process for MT and automatic post-editing tasks. EDITOR replaces the deletion operation with a novel reposition opera-tion to disentangle lexical choice from reordering decisions. As a result, EDITOR exploits lexical constraints more effectively and efficiently than the Levenshtein Transformer, as a single reposition operation can subsume a sequence of deletions and insertions. To train EDITOR via imitation learning, the reposition operation is defined to preserve the ability to use the Levenshtein edit distance (Levenshtein, 1966) as an efficient oracle. We also introduce a dual-path roll-in policy which lets the reposition and deletion models learn to refine their respective outputs more effectively.
Experiments on Romanian-English, English-German, and English-Japanese MT show that EDITOR achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer (Gu et al., 2019) on the standard MT tasks and exploit soft lexical constraints better: it achieves significantly better translation quality and matches more constraints with faster decoding speed than the Levenshtein Transformer. It also drastically speeds up decoding compared to lexically constrained decoding algorithms (Post and Vilar, 2018). Furthermore, Results highlight the benefits of soft constraints over hard ones -EDITOR with soft constraints achieves translation quality on par or better than both EDITOR and Levenshtein Transformer with hard constraints (Susanto et al., 2020).

Background
Non-Autoregressive MT While autoregressive models that decode from left-to-right are the de facto standard for many sequence generation tasks (Cho et al., 2014;Chorowski et al., 2015;Vinyals and Le, 2015), non-autoregressive models offer a promising alternative to speed up decoding by generating a sequence of tokens in parallel (Gu et al., 2018;van den Oord et al., 2018;Ma et al., 2019). However, their output quality suffers due to the large decoding space and strong independence assumptions between target tokens (Ma et al., 2019;. These issues have been addressed via partially parallel decoding (Wang et al., 2018;Stern et al., 2018) or multi-pass decoding (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019). This work adopts multipass decoding, where the model generates the target sequences by iteratively editing the outputs from previous iterations. Edit operations such as substitution (Ghazvininejad et al., 2019) and insertion-deletion (Gu et al., 2019) have reduced the quality gap between non-autoregressive and autoregressive models. However, we argue that these operations limit the flexibility and efficiency of the resulting models for MT by entangling lexical choice and reordering decisions.
Reordering vs. Lexical Choice EDITOR's insertion and reposition operations connect closely with the long-standing view of MT as a combination of a translation or lexical choice modelwhich selects appropriate translations for source units given their context -and reordering model -which encourages the generation of a target sequence order appropriate for the target language. This view is reflected in architectures ranging from the word-based IBM models (Brown et al., 1990), sentence-level models that generate a bag of target words that is reordered to construct a target sentence (Bangalore et al., 2007), or the Operation Sequence Model (Durrani et al., 2015;Stahlberg et al., 2018), which views translation as a sequence of translation and reordering operations over bilingual minimal units. By contrast, autoregressive NMT models Vaswani et al., 2017) do not explicitly separate lexical choice and reordering, and previous non-autoregressive models break up reordering into sequences of other operations. This work introduces the reposition operation which makes it possible to move words around during the refinement process, as reordering models do. However, we will see that reposition differs from typical reordering to enable efficient oracles for training via imitation learning, and parallelization of edit operations at decoding time (Section 3).
MT with Soft Lexical Constraints NMT models lack flexible mechanisms to incorporate users preferences in their outputs. Lexical constraints have been incorporated in prior work via 1) constrained training where NMT models are trained on parallel samples augmented with constraint target phrases in both the source and target sequences (Song et al., 2019;Dinu et al., 2019), or 2) constrained decoding where beam search is modified to include constraint words or phrases in the output (Hokamp and Liu, 2017;Post and Vilar, 2018). These mechanisms can incorporate domain-specific knowledge and lexicons which is particularly helpful in low-resource cases (Arthur et al., 2016;Tang et al., 2016). Despite their success at domain adaptation for MT (Hokamp and Liu, 2017) and caption generation (Anderson et al., 2017), they suffer from several issues: constrained training requires building dedicated models for constrained language generation, while constrained decoding adds significant computational overhead and treats all constraints as hard constraints which may hurt fluency. In other tasks, various constraint types have been introduced by designing complex architectures tailored to specific content or style constraints (Abu Sheikha and Inkpen, 2011;Mei et al., 2016), or via segmentlevel "side-constraints" (Sennrich et al., 2016a;Ficler and Goldberg, 2017;Scarton and Specia, 2018), which condition generation on users' stylistic preferences, but do not offer fine-grained control over their realization in the output sequence. We refer the reader to Yvon and Abdul Rauf (2020) for a comprehensive review of the strengths and weaknesses of current techniques to incorporate terminology constraints in NMT.
Our work is closely related to Susanto et al. (2020)'s idea of applying the Levenshtein Transformer to MT with hard terminology constraints. We will see that their technique can directly be used by EDITOR as well (Section 3.3), but this does not offer empirical benefits over the default EDITOR model (Section 4.3).

The EDITOR Model
We cast both constrained and unconstrained language generation as an iterative sequence refinement problem modeled by a Markov Decision Process (Y, A, E, R, y 0 ), where a state y in the state space Y corresponds to a sequence of tokens y = (y 1 , y 2 , ..., y L ) from the vocabulary V up to length L, and y 0 ∈ Y is the initial sequence For standard sequence generation tasks, y 0 is the empty sequence ( s , /s ). For lexically constrained generation tasks, y 0 consists of the words to be used as constraints ( s , c 1 , ..., c m , /s ).
At the k-th decoding iteration, the model takes as input y k−1 , the output from the previous iteration, chooses an action a k ∈ A to refine the sequence into y k = E(y k−1 , a k ), and receives a reward r k = R(y k ). The policy π maps the input sequence y k−1 to a probability distribution P (A) over the action space A. Our model is based on the Transformer encoder-decoder (Vaswani et al., 2017) and we extract the decoder representa- Figure 2: Applying the reposition operation r to input y: r i > 0 is the 1-based index of token y i in the input sequence; y i is deleted if r i = 0.
tions (h 1 , ..., h n ) to make the policy predictions. Each refinement action is based on two basic operations: reposition and insertion.
Reposition For each position i in the input sequence y 1...n , the reposition policy π rps (r | i, y) predicts an index r ∈ [0, n]: if r > 0, we place the r-th input token y r at the ith output position, otherwise we delete the token at that position ( Figure 2). We constrain π rps (1 | 1, y) = 1, π rps (n | n, y) = n to maintain sequence boundaries. Note that reposition differs from typical reordering since 1) it makes it possible to delete tokens, and 2) it places tokens at each position independently, which enables parallelization at decoding time.
In principle, the same input token can thus be placed at multiple output positions. However, this happens rarely in practice as the policy predictor is trained to follow oracle demonstrations which cannot contain such repetitions by design. 2 The reposition classifier gives a categorical distribution over the index of the input token to be placed at each output position: where e j is the embedding of the j-th token in the input sequence, and b ∈ R d model is used to predict whether to delete the token. The dot product in the softmax function captures the similarity between the hidden state h i and each input embedding e j or the deletion vector b.
Insertion Following Gu et al. (2019), the insertion operation consists of two phases: (1) placeholder insertion: given an input sequence y 1...n , the placeholder predictor π plh (p | i, y) predicts the number of placeholders p ∈ [0, K max ] to be inserted between two neighboring tokens (y i , y i+1 ); 3 (2) token prediction: given the output of the placeholder predictor, the token predictor π tok (t | i, y) replaces each placeholder with an actual token.
The Placeholder Insertion Classifier gives a categorical distribution over the number of placeholders to be inserted between every two consecutive positions: The Token Prediction Classifier predicts the identity of each token to fill in each placeholder: Action Given an input sequence y 1...n , an action consists of repositioning tokens, inserting and replacing placeholders. Formally, we define an action as a sequence of reposition (r), placeholder insertion (p), and token prediction (t) operations: a = (r, p, t). r, p, and t are applied in this order to adjust non-empty initial sequences via reposition before inserting new tokens. Each of r, p, and t consists of a set of basic operations that can be applied in parallel: where m = n i I(r i > 0) and l = m−1 i p i . We define the policy as with intermediate outputs y = E(y, r) and y = E(y , p).

Dual-Path Imitation Learning
We train EDITOR using imitation learning (Daumé III et al., 2009;Ross et al., 2011;Ross and Bagnell, 2014) to efficiently explore the space of valid action sequences that can reach a reference translation. The key idea is to construct a roll-in policy π in to generate sequences to be refined and a roll-out policy π out to estimate cost-to-go for all possible actions given each input sequence. The model is trained to choose actions that minimizes the cost-to-go estimates. We use a search-based oracle policy π * as the roll-out policy and train the model to imitate the optimal actions chosen by the oracle.
Formally, d π in rps and d π in ins denote the distributions of sequences induced by running the roll-in policies π in rps and π in ins respectively. We update the model policy π = π rps · π plh · π tok to minimize the expected cost C(π ; y, π * ) by comparing the model policy against the cost-to-go estimates under the oracle policy π * given input sequences y: The cost function compares the model vs. oracle actions. As prior work suggests that cost functions close to the cross-entropy loss are better suited to deep neural models than the squared error (Leblond et al., 2018;Cheng et al., 2018), we define the cost function as the KL divergence between the action distributions given by the model policy and by the oracle (Welleck et al., 2019): where the oracle has additional access to the reference sequence y * . By minimizing the cost function, the model learns to imitate the oracle policy without access to the reference sequence.
Next, we describe how the reposition operation is incorporated in the roll-in policy (Section 3.2.1) and the oracle roll-out policy (Section 3.2.2).

Dual-Path Roll-in Policy
As shown in Figure 3, the roll-in policies π in ins and π in rps for the reposition and insertion policy predictors are stochastic mixtures of the noised reference sequences and the output sequences sampled from their corresponding dual policy predictors. Figure 4 shows an example for creating the roll-in sequences: we first create the initial sequence y 0 by applying random word dropping (Gu et al., 2019) and random word shuffle (Lample et al., 2018) with probability of 0.5 and maximum shuffle distance of 3 to the reference sequence y * , and produce the roll-in sequences for each policy predictor as follows: Oracle Roll-Out Figure 3: Our dual-path imitation learning process uses both the reposition and insertion policies during roll-in so that they can be trained to refine each other's outputs: Given an initial sequence y 0 , created by noising the reference y * , the roll-in policy stochastically generates intermediate sequences y ins and y rps via reposition and insertion respectively. The policy predictors are trained to minimize the costs of reaching y * from y ins and y rps estimated by the oracle policy π * . * : <s> I know how you feel </s> " : <s> how feel you </s> <s> I know how feel you </s> <s> how you feel </s> Figure 4: The roll-in sequence for the insertion predictor is a stochastic mixture of the noised reference y 0 and the output by applying the model's reposition policy π rps to y 0 . The roll-in sequence for the reposition predictor is a stochastic mixture of the noised reference y 0 and the output by applying the oracle placeholder insertion policy π * plh and the model's token prediction policy π tok to y 0 .
1. Reposition: the roll-in policy π in rps is a stochastic mixture of the initial sequence y 0 and the output sequence by applying one iteration of the oracle placeholder insertion policy p * ∼ π * and the model's token prediction policyt ∼ π tok to y 0 : where the mixture factor β ∈ [0, 1] and random variable u ∼ Uniform(0, 1).

2.
Insertion: the roll-in policy π in ins is a stochastic mixture of the initial sequence y 0 and the output sequence by applying one iteration of the model's reposition policyr ∼ π rps to y 0 : where the mixture factor α ∈ [0, 1] and random variable u ∼ Uniform(0, 1).
While Gu et al. (2019) define roll-in using only the model's insertion policy, we call our approach dual-path because roll-in creates two distinct intermediate sequences using the model's reposition or insertion policy. This makes it possible for the reposition and insertion policy predictors to learn to refine one another's outputs during rollout, mimicking the iterative refinement process used at inference time. 4

Oracle Roll-Out Policy
Policy Given an input sequence y and a reference sequence y * , the oracle algorithm finds the optimal action to transform y into y * with the minimum number of basic edit operations: Oracle(y, y * ) = arg min a NumOps(y, y * | a) (8) The associated oracle policy is defined as: Algorithm The reposition and insertion operations used in EDITOR are designed so that the Levenshtein edit distance algorithm (Levenshtein, 1966) can be used as the oracle. The reposition operation (Section 3.1) can be split into two distinct types of operations: (1) deletion and (2) replacing a word with any other word appearing in the input sequence, which is a constrained version of the Levenshtein substitution operation. As a result, we can use dynamic programming to find the optimal action sequence in O(|y||y * |) time. By contrast, the Levenshtein Transformer restricts the oracle and model to insertion and deletion operations only. While in principle substitutions can be performed indirectly by deletion and re-insertion, our results show the benefits of using the reposition variant of the substitution operation.

Inference
During inference, we start from the initial sequence y 0 . For standard sequence generation tasks, y 0 is an empty sequence, whereas for lexically constrained generation y 0 is a sequence of lexical constraints. Inference then proceeds in the exact same way for constrained and unconstrained tasks. The initial sequence is refined iteratively by applying a sequence of actions (a 1 , a 2 , ...) = (r 1 , p 1 , t 1 ; r 2 , p 2 , t 2 ; ...). We greedily select the best action at each iteration given the model policy in Eqs.
(1) to (3). We stop refining if 1) the output sequences from two consecutive iterations are the same (Gu et al., 2019), or 2) the maximum number of decoding steps is reached (Lee et al., 2018;Ghazvininejad et al., 2019). 5 Incorporating Soft Constraints Although ED-ITOR is trained without lexical constraints, it can be used seamlessly for MT with constraints without any change to the decoding process except using the constraint sequence as the initial sequence.

Incorporating Hard Constraints
We adopt the decoding technique introduced by Susanto et al.
(2020) to enforce hard constraints at decoding time by prohibiting deletion operations on constraint tokens or insertions within a multi-token constraints.
5 Following Stern et al. (2019), we also experiment with adding penalty for inserting "empty" placeholders during inference by subtracting a penalty score γ = [0, 3] from the logits of zero in Eq. (2) to avoid overly short outputs. However, preliminary experiments show that zero penalty score achieves the best performance.

Experiments
We evaluate the EDITOR model on standard (Section 4.2) and lexically constrained machine translation (Sections 4.3-4.4).

Experimental Settings
Dataset Following Gu et al. (2019), we experiment on three language pairs spanning different language families and data conditions (  , 2018)).

Experimental Conditions
We train and evaluate the following models in controlled conditions to thoroughly evaluate EDITOR: • Auto-Regressive Transformers (AR) built using Sockeye (Hieber et al., 2017) and fairseq (Ott et al., 2019). We report AR baselines with both toolkits to enable fair comparisons when using our fairseq-based implementation of EDITOR and Sockeye-based  implementation of lexically constrained decoding algorithms (Post and Vilar, 2018).

• Non Auto-Regressive Transformers (NAR)
In addition to EDITOR, we train a Levenshtein Transformer (LevT) with approximately the same number of parameters. Both are implemented using fairseq.

Model & Training Configurations
All models adopt the base Transformer architecture (Vaswani et al., 2017) with d model = 512, d hidden = 2048, n heads = 8, n layers = 6, and p dropout = 0.1. For En-De and Ro-En, the source and target embeddings are tied with the output layer weights (Press and Wolf, 2017;Nguyen and Chiang, 2018). We add dropout to embeddings (0.1) and label smoothing (0.1). AR models are trained with the Adam optimizer (Kingma and Ba, 2015) with a batch size of 4096 tokens. We checkpoint models every 1000 updates. The initial learning rate is 0.0002, and it is reduced by 30% after 4 checkpoints without validation perplexity improvement. Training stops after 20 checkpoints without improvement. All NAR models are trained using Adam (Kingma and Ba, 2015) with initial learning rate of 0.0005 and a batch size of 64,800 tokens for maximum 300,000 steps. 7 We select the best checkpoint based on validation BLEU (Papineni et al., 2002). All models are trained on 8 NVIDIA V100 Tensor Core GPUs.

Knowledge Distillation
We apply sequencelevel knowledge distillation from autoregressive teacher models as widely used in nonautoregressive generation (Gu et al., 2018;Lee et al., 2018;Gu et al., 2019). Specifically, when training the non-autoregressive models, we replace the reference sequences y * in the training data with translation outputs from the AR teacher model (Sockeye, with beam = 4). 8 We also report the results when applying knowledge distillation to autoregressive models.
Evaluation We evaluate translation quality via case-sensitive tokenized BLEU (as in Gu et al. (2019)) 9 and RIBES (Isozaki et al., 2010), which is more sensitive to word order differences. Before computing the scores, we tokenize the German and English outputs using Moses and Japanese outputs using KyTea. 10 For lexically constrained decoding, we report the constraint preservation rate (CPR) in the translation outputs. We quantify decoding speed using latency per sentence. It is computed as the average time (in ms) required to translate the test set using batch size of one (excluding the model loading time) divided by the number of sentences in the test set.

MT Tasks
Since our experiments involve two different toolkits, we first compare the same Transformer AR models built with Sockeye and with fairseq: the AR models achieve comparable decoding speed and translation quality regardless of toolkit -the Sockeye model obtains higher BLEU than the fairseq model on Ro-En and En-De but lower on En-Ja (Table 2). Further comparisons will therefore center on the Sockeye AR model to better compare EDITOR with the lexically constrained decoding algorithm (Post and Vilar, 2018).
Table 2 also shows that knowledge distillation has a small and inconsistent impact on AR models (Sockeye): it yields higher BLEU on Ro-En, close BLEU on En-De, and lower BLEU on En-Ja. 11 Thus, we use the AR models trained without distillation in further experiments.
Next, we compare the NAR models against the AR (Sockeye) baseline. As expected, both EDI-TOR and LevT achieve close translation quality to their AR teachers with 2-4 times speedup. BLEU differences are small (∆ < 1.1) as in prior work (Gu et al., 2019). The RIBES trends are more surprising: both NAR models significantly outperform the AR models (Sockeye) on RIBES, except for En-Ja, where EDITOR and the AR models significantly outperforms LevT. This illustrates the strength of EDITOR in word reordering.
Finally, results confirm the benefits of EDI-TOR's reposition operation over LevT: decoding with EDITOR is 6-7% faster than LevT on Ro-En and En-De, and 33% faster on En-Ja -a more distant language pair which requires more reordering but no inflection changes on reordered words -with no statistically significant difference in BLEU nor RIBES, except for En-Ja, where ED-ITOR significantly outperforms LevT on RIBES. Overall, EDITOR is shown to be a good alternative to LevT on standard machine translation tasks and can also be used to replace the AR models in settings where decoding speed matters more than small differences in translation quality.

MT with Soft Lexical Constraints
We now turn to the main evaluation of EDITOR on machine translation with soft lexical constraints.

Experimental Conditions
We conduct a controlled comparison of the following approaches: • NAR models: EDITOR and LevT view the lexical constraints as soft constraints, provided via the initial target sequence. We also explore the decoding technique introduced in Susanto et al. (2020) to support hard constraints. • AR models: they use the provided target words as hard constraints enforced at decoding time by an efficient form of constrained beam search: dynamic beam allocation (DBA) (Post and Vilar, 2018). 12 Crucially, all models, including EDITOR, are the exact same models evaluated on the standard MT tasks above, and do not need to be trained specifically to incorporate constraints.
We define lexical constraints as Post and Vilar (2018): for each source sentence, we randomly select one to four words from the reference as lexical constraints. We then randomly shuffle the constraints and apply BPE to the constraint sequence. Different from the terminology test sets in Dinu et al. (2019) which contain only several hundred sentences with mostly nominal constraints, our constructed test sets are larger and include lexical constraints of all types. Table 3 shows that EDITOR exploits the soft constraints to strike a better balance between translation quality and decoding speed than other models. Compared to LevT, EDITOR preserves 7-17% more constraints and achieves significantly higher translation quality (+1.1-2.5 on BLEU and +1.6-1.8 on RIBES) and faster decoding speed. Compared to the AR model  Table 3: Machine Translation with soft lexical constraints (averages over 5 runs). For each metric, we underline the top scores among all models and boldface the top scores among NAR models based on the independent Student's t-test with p < 0.05. EDITOR exploits constraints better than LevT. It also achieves comparable RIBES to the best AR model with 6-7 times decoding speedup.

Main Results
with beam = 4, EDITOR yields significantly higher BLEU (+1.0-2.2) and RIBES (+4.1-6.9) with 3-4 times decoding speedup. After increasing the beam to 10, EDITOR obtains lower BLEU but comparable RIBES with 6-7 times decoding speedup. 13 Note that AR models treat provided words as hard constraints and therefore achieve over 99% CPR by design, while NAR models treat them as soft constraints. Results confirm that enforcing hard constraints increases CPR but degrades translation quality compared to the same model using soft constraints: for LevT, it degrades BLEU by 2.2-3.9 and RIBES by 5.0-6.6. For EDITOR, it degrades BLEU by 1.6-4.3 and RIBES by 3.1-4.4 (Table 3). By contrast, EDITOR with soft constraints strikes a better balance between translation quality and constraint preservation.
The strengths of EDITOR hold when varying the number of constraints ( Figure 5). For all tasks and models, adding constraints helps BLEU up to a certain point, ranging from 4 to 10 words. When 13 Post and Vilar (2018) show that the optimal beam size for DBA is 20. Our experiment on En-De shows that increasing the beam size from 10 to 20 improves BLEU by 0.7 at the cost of doubling the decoding time.
excluding the slower AR model (beam = 10), EDITOR consistently reaches the highest BLEU score with 2-10 constraints: EDITOR outperforms LevT and the AR model with beam = 4. Consistent with Post and Vilar (2018), as the number of constraints increases, the AR model needs larger beams to reach good performance. When the number of constraints increases to 10, EDI-TOR yields higher BLEU than the AR model on En-Ja and Ro-En, even after incurring the cost of increasing the AR beam to 10.
Are EDITOR improvements limited to preserving constraints better? We verify that this is not the case by computing the target word F1 binned by frequency . Figure 6 shows that EDITOR improves over LevT across all test frequency classes and closes the gap between NAR and AR models: the largest improvements are obtained for low and medium frequency words -on En-De and En-Ja, the largest improvements are on words with frequency between 5 and 1000, while on Ro-En, EDITOR improves more on words with frequency between 5 and 100. EDITOR also improves F1 on rare words (frequency in [0, 5)), but not as much as for more frequent words. We now conduct further analysis to better understand the factors that contribute to EDITOR's advantages over LevT.

Impact of Reposition
We compare the average number of basic edit operations (Section 3.1) of different types used by EDITOR and LevT on each test sentence (averaged over the 5 runs): reposition (excluding deletion for controlled comparison with LevT), deletion, and insertion performed by LevT and EDITOR at decoding time. Table 4 shows that LevT deletes tokens 2-3 times more often than EDITOR, which explains its lower CPR : Target word F1 score binned by word test set frequency: EDITOR improves over LevT the most for words of low or medium frequency. AR achieves higher F1 than EDITOR for words of low or medium frequency at the cost of much longer decoding time.
than EDITOR. LevT also inserts tokens 1.2-1.6 times more often than EDITOR and performs 1.4 times more edit operations on En-De and En-Ja. On Ro-En, LevT performs -4% fewer edit operations in total than EDITOR but is overall slower than EDITOR, since multiple operations can be done in parallel at each action step. Overall, EDI-TOR takes 3-40% fewer decoding iterations than LevT. These results suggest that reposition suc-  cessfully reduces redundancy in edit operations and makes decoding more efficient by replacing sequences of insertions and deletions with a single repositioning step. Furthermore, Figure 7 illustrates how reposition increases flexibility in exploiting lexical constraints, even when they are provided in the wrong order. While LevT generates an incorrect output by using constraints in the provided order, EDI-TOR's reposition operation helps generate a more fluent and adequate translation.
Impact of Dual-Path Roll-In Ablation experiments (  greatly from dual-path roll-in. Replacing dualpath roll-in with the simpler roll-in policy used in Gu et al. (2019), the model's translation quality drops significantly (by 0.9-1.3 on BLEU and 0.6-1.9 on RIBES) with fewer constraints preserved and slower decoding. It still achieves better translation quality than LevT thanks to the reposition operation: specifically, it yields significantly higher BLEU and RIBES on Ro-En, comparable BLEU and significantly higher RIBES on En-De, and comparable RIBES and significantly higher BLEU on En-Ja than LevT.  Table 6: Term usage percentage (Term%) and BLEU scores of En-De models on terminology test sets (Dinu et al., 2019) provided with correct terminology entries (exact matches on both source and target sides). EDITOR with soft constraints achieves higher BLEU than LevT with soft constraints, and on par or higher BLEU than LevT with hard constraints.

MT with Terminology Constraints
We evaluate EDITOR on the terminology test sets released by Dinu et al. (2019) to test its ability to incorporate terminology constraints and to further compare it with prior work (Dinu et al., 2019;Post and Vilar, 2018;Susanto et al., 2020). Compared to Post and Vilar (2018) and Dinu et al. (2019), EDITOR with soft constraints achieves higher absolute BLEU, and higher BLEU improvements over its counterpart without constraints (Table 6). Consistent with previous findings by Susanto et al. (2020), incorporating soft constraints in LevT improves BLEU by +0.3 on Wiktionary and by +0.4 on IATE. Enforcing hard constraints as in Susanto et al. (2020) increases the term usage by +8-10% and improves BLEU by +0.3-0.6 over LevT using soft constraints. 14 For EDITOR, adding soft constraints improves BLEU by +0.5 on Wiktionary and +0.9 on IATE, with very high term usages (96.8% and 97.1% respectively). EDITOR thus correctly uses the provided terms almost all the time when they are provided as soft constraints, so there is little benefit to enforcing hard constraints instead: they help close the small gap to reach 100% term usage and do not improve BLEU. Overall, EDITOR achieves on par or higher BLEU than LevT with hard constraints.
Results also suggest that EDITOR can handle phrasal constraints even though it relies on tokenlevel edit operations, since it achieves above 99% term usage on the terminology test sets where 26-27% of the constraints are multi-token.

Conclusion
We introduced EDITOR, a non-autoregressive transformer model that iteratively edits hypotheses using a novel reposition operation. Reposition combined with a new dual-path imitation learning strategy helps EDITOR generate output sequences that flexibly incorporate user's lexical choice preferences. Extensive experiments showed that ED-ITOR exploits soft lexical constraints more effectively than the Levenshtein Transformer (Gu et al., 2019) while speeding up decoding dramatically compared to constrained beam search (Post and Vilar, 2018). Results also confirm the benefits of using soft constraints over hard ones in terms of translation quality. EDITOR also achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer on three standard MT tasks. These promising results open several avenues for future work, including using EDITOR for other generation tasks than MT and investigating its ability to incorporate more diverse constraint types into the decoding process.