Abstract
Conventional neural autoregressive decoding commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decoding algorithm— InDIGO—which supports flexible sequence generation in arbitrary orders through insertion operations. We extend Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, including word order recovery, machine translation, image caption, and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better performance compared with the conventional left-to-right generation. The generated sequences show that InDIGO adopts adaptive generation orders based on input information.
1 Introduction
Neural autoregressive models have become the de facto standard in a wide range of sequence generation tasks, such as machine translation (Bahdanau et al., 2015), summarization (Rush et al., 2015), and dialogue systems (Vinyals and Le, 2015). In these studies, a sequence is modeled autoregressively with the left-to-right generation order, which raises the question of whether generation in an arbitrary order is worth considering (Vinyals et al., 2016; Ford et al., 2018). Nevertheless, previous studies on generation orders mostly resort to a fixed set of generation orders, showing particular choices of ordering are helpful (Wu et al., 2018; Ford et al., 2018; Mehri and Sigal, 2018), without providing an efficient algorithm for finding adaptive generation orders, or restrict the problem scope to n-gram segment generation (Vinyals et al., 2016).
In this paper, we propose a novel decoding algorithm, Insertion-based Decoding with Inferred Generation Order (InDIGO), which models generation orders as latent variables and automatically infers the generation orders by simultaneously predicting a word and its position to be inserted at each decoding step. Given that absolute positions are unknown before generating the whole sequence, we use a relative-position-based representation to capture generation orders. We show that decoding consists of a series of insertion operations with a demonstration shown in Figure 1.
An example of InDIGO. At each step, we simultaneously predict the next token and its (relative) position to be inserted. The final output sequence is obtained by mapping the words based on their positions.
An example of InDIGO. At each step, we simultaneously predict the next token and its (relative) position to be inserted. The final output sequence is obtained by mapping the words based on their positions.
We extend Transformer (Vaswani et al., 2017) for supporting insertion operations, where the generation order is directly captured as relative positions through self-attention inspired by Shaw et al. (2018). For learning, we maximize the evidence lower-bound (ELBO) of the maximum likelihood objective, and study two approximate posterior distributions of generation orders based on a pre-defined generation order and adaptive orders obtained from beam-search, respectively.
Experimental results on word order recovery, machine translation, code generation, and image caption demonstrate that our algorithm can generate sequences with arbitrary orders, while achieving competitive or even better performance compared to the conventional left-to-right generation. Case studies show that the proposed method adopts adaptive orders based on input information. The code will be released as part of the official repository of Fairseq (https://github.com/pytorch/fairseq).
2 Neural Autoregressive Decoding
Learning
Neural autoregressive model is commonly learned by maximizing the conditional likelihood log p(y|x) = log pθ(yt+1|y0:t, x1:T′) given a set of parallel examples.
Decoding
A common way to decode a sequence from a trained model is to make use of the autoregressive nature that allows us to predict one word at each step. Given any source x, we essentially follow the order of factorization to generate tokens sequentially using some heuristic-based algorithms such as greedy decoding and beam-search.
3 Insertion-based Decoding with Inferred Generation Order (InDIGO)
Equation 1 explicitly assumes a left-to-right (L2R) generation order of the sequence y. In principle, we can factorize the sequence probability in any permutation and train a model for each permutation separately. As long as we have an infinite amount of data with proper optimization performed, all these models are equivalent. Nevertheless, Vinyals et al. (2016) have shown that the generation order of a sequence actually matters in many real-world tasks, e.g., language modeling.
Although the L2R order is a strong inductive bias, as it is natural for most human beings to read and write sequences from left to right, L2R is not necessarily the optimal option for generating sequences. For instance, people sometimes tend to think of central phrases first before building up a whole sentence. For programming languages, it is beneficial to be generated based on abstract syntax trees (Yin and Neubig, 2017).
Therefore, a natural question arises, how can we decode a sequence in its best order?
3.1 Orders as Latent Variables
At decoding time, the factorization allows us to decode autoregressively by predicting word yt+1 and its position zt+1 step by step. The generation order is automatically inferred during decoding.
3.2 Relative Representation of Positions
It is difficult and inefficient to predict the absolute positions zt without knowing the actual length T. One solution is directly using the absolute positions , …, of the partial sequence y0:t at each autoregressive step t. For example, the absolute positions for the sequence (〈s〉, 〈/s〉, dream, I) are ( = 0, = 3, = 2, = 1) in Figure 1 at step t = 3. It is, however, inefficient to model such explicit positions using a single neural network without recomputing the hidden states for the entire partial sequence, as some positions are changed at every step (as shown in Figure 1).
Relative Positions
3.3 Insertion-based Decoding
4 Model
We present Transformer-InDIGO, an extension of Transformer (Vaswani et al., 2017), supporting insertion-based decoding. The overall framework is shown in Figure 2.
The overall framework of the proposed Transformer-InDIGO which includes (a) the word & position prediction module; (b) the one step decoding with position updating; (c) final decoding output by reordering. The black-white blocks represent the relative position matrix.
The overall framework of the proposed Transformer-InDIGO which includes (a) the word & position prediction module; (b) the one step decoding with position updating; (c) final decoding output by reordering. The black-white blocks represent the relative position matrix.
4.1 Network Design
We extend the decoder of Transformer with relative-position-based self-attention, joint word and position prediction, and position updating modules.
Self-Attention
One of the major challenges that prevents the vanilla Transformer from generating sequences following arbitrary orders is that the absolute-position-based positional encodings are inefficient (as mentioned in Section 3.2), in that absolute positions are changed during decoding, invalidating the previous hidden states. In contrast, we adapt Shaw et al. (2018) to use relative positions in self-attention. Different from Shaw et al. (2018), in which a clipping distance d (usually d ≥ 2) is set for relative positions, our relative-position representations only preserve d = 1 relations (Equation (3)).
Word and Position Prediction
Position Updating
As mentioned in Sec. 3.1, we update the relative position representation Rt with the predicted rt+1. Because updating the relative positions will not change the pre-computed relative-position representations, Transformer-InDIGO can reuse the previous hidden states in the next decoding step the same as the vanilla Transformer.
4.2 Learning
Here, we study two types of q(π|x, y):
Pre-defined Order
If we already possess some prior knowledge about the sequence, e.g., the L2R order is proven to be a strong baseline in many scenarios, we assume a Dirac-delta distribution q(π|x, y) = δ(π = π*(x, y)), where π*(x, y)) is a predefined order. In this work, we study a set of pre-defined orders, which can be found in Table 1, for evaluating their effect on generation.
Pre-defined Order . | Descriptions . |
---|---|
Left-to-right (L2R) | Generate words from left to right. (Wu et al., 2018) |
Right-to-left (R2L) | Generate words from right to left. (Wu et al., 2018) |
Odd-Even (ODD) | Generate words at odd positions from left to right, then generate even positions. (Ford et al., 2018) |
Balanced-tree (BLT) | Generate words with a top-down left-to-right order from a balanced binary tree. (Stern et al., 2019) |
Syntax-tree (SYN) | Generate words with a top-down left-to-right order from the dependency tree. (Wang et al., 2018b) |
Common-First (CF) | Generate all common words first from left to right, and then generate the others. (Ford et al., 2018) |
Rare-First (RF) | Generate all rare words first from left to right, and then generate the remaining. (Ford et al., 2018) |
Random (RND) | Generate words in a random order shuffled every time the example was loaded. |
Pre-defined Order . | Descriptions . |
---|---|
Left-to-right (L2R) | Generate words from left to right. (Wu et al., 2018) |
Right-to-left (R2L) | Generate words from right to left. (Wu et al., 2018) |
Odd-Even (ODD) | Generate words at odd positions from left to right, then generate even positions. (Ford et al., 2018) |
Balanced-tree (BLT) | Generate words with a top-down left-to-right order from a balanced binary tree. (Stern et al., 2019) |
Syntax-tree (SYN) | Generate words with a top-down left-to-right order from the dependency tree. (Wang et al., 2018b) |
Common-First (CF) | Generate all common words first from left to right, and then generate the others. (Ford et al., 2018) |
Rare-First (RF) | Generate all rare words first from left to right, and then generate the remaining. (Ford et al., 2018) |
Random (RND) | Generate words in a random order shuffled every time the example was loaded. |
Searched Adaptive Order (SAO)
We choose the approximate posterior q as the point estimation that maximizes log pθ(yπ|x), which can also be seen as the maximum-a-posteriori (MAP) estimation on the latent order π. In practice, we approximate these generation orders π through beam-search (Pal et al., 2006). Unlike the original beam-search for autoregressive decoding that searches in the sequence space to find the sequence maximizing the probability shown in Equation 1, we search in the space of all the permutations of the target sequence to find π maximising Equation 2, as all the target tokens are known in advance during training.
Beam-Search with Dropout
The goal of beam-search is to approximately find the most likely generation orders, which limits learning from exploring other generation orders that may not be favourable currently but may ultimately be deemed better. Prior research (Vijayakumar et al., 2016) also pointed out that the search space of the standard beam-search is restricted. We encourage exploration by injecting noise during beam-search (Cho, 2016). Particularly, we found it effective to keep the dropout on (e.g., dropout = 0.1).
Bootstrapping from a Pre-defined Order
During preliminary experiments, sequences returned by beam-search were often degenerated by always predicting common or functional words (“the”, “,”, etc.) as the first several tokens, leading to inferior performance. We conjecture that is due to the fact that the position prediction module learns much faster than the word prediction module, and it quickly captures spurious correlations induced by a poorly initialized model. It is essential to balance the learning progress of these modules. To do so, we bootstrap learning by pre-training the model with a pre-defined order (e.g., L2R), before training with beam-searched orders.
4.3 Decoding
As for decoding, we directly follow Algorithm 1 to sample or decode greedily from the proposed model. However, in practice beam-search is important to explore the output space for neural autoregressive models. In our implementation, we perform beam-search for InDIGO as a two-step search. Suppose the beam size B, at each step, we do beam-search for word prediction and then with the searched words, try out all possible positions and select the top-B sub-sequences. In preliminary experiments, we also tried doing beam-search for word and positions simultaneously with their joint probability. However, it did not seem helpful.
5 Experiments
We evaluate InDIGO extensively on four challenging sequence generation tasks: word order recovery, machine translation, natural language to code generation (NL2Code, Ling et al., 2016) and image captioning. We compare our model trained with the pre-defined orders and the adaptive orders obtained by beam-search. We use the same architecture for all orders including the standard L2R order.
5.1 Experimental Settings
Dataset
The machine translation experiments are conducted on three language pairs for studying how the decoding order influences the translation quality of languages with diversified characteristics: WMT’16 Romanian-English (Ro-En),2 WMT 18 English-Turkish (En-Tr),3 and KFTT English-Japanese (En-Ja, Neubig, 2011).4 The English part of the Ro-En dataset is used for the word order recovery task. For the NL2Code task, We use the Django dataset (Oda et al., 2015)5 and the MS COCO (Lin et al., 2014) with the standard split (Karpathy and Fei-Fei, 2015) for the NL2Code task and image captioning, respectively. The dataset statistics are shown in Table 2.
Dataset . | Train . | Dev . | Test . | Length . |
---|---|---|---|---|
WMT16 Ro-En | 620k | 2000 | 2000 | 26.48 |
WMT18 En-Tr | 207k | 3007 | 3000 | 25.81 |
KFTT En-Ja | 405k | 1166 | 1160 | 27.51 |
Django | 16k | 1000 | 1801 | 8.87 |
MS-COCO | 567k | 5000 | 5000 | 12.52 |
Dataset . | Train . | Dev . | Test . | Length . |
---|---|---|---|---|
WMT16 Ro-En | 620k | 2000 | 2000 | 26.48 |
WMT18 En-Tr | 207k | 3007 | 3000 | 25.81 |
KFTT En-Ja | 405k | 1166 | 1160 | 27.51 |
Django | 16k | 1000 | 1801 | 8.87 |
MS-COCO | 567k | 5000 | 5000 | 12.52 |
Preprocessing
We apply the Moses tokenization6 and normalization on all the text datasets except for codes. We perform 32,000 joint BPE (Sennrich et al., 2016) operations for the MT datasets, while using all the unique words as the vocabulary for NL2Code. For image captioning, we follow the same procedure as described by Lee et al. (2018), where we use 49 512-dimensional image feature vectors (extracted from a pretrained ResNet-18 [He et al., 2016]) as the input to the Transformer encoder. The image features are fixed during training.
Models
We set dmodel = 512, dhidden = 2048, nheads = 8, nlayers = 6, lrmax = 0.0005, warmup = 4000, and dropout = 0.1 throughout all the experiments. The source and target embedding matrices are shared except for En-Ja, as our preliminary experiments showed that keeping the embeddings not shared significantly improves the translation quality. Both the encoder and decoder use relative positions during self-attention except for the word order recovery experiments (where the position embedding is removed in the encoder, as there is no ground-truth position information in the input). We do not introduce task-specific modules such as copying mechanism (Gu et al., 2016).
Training
When training with the pre-defined orders, we reorder words of each training sequence in advance accordingly, which provides supervision of the ground-truth positions that each word should be inserted. We test the pre-defined orders listed in Table 1. The SYN orders were generated according to the dependency parse obtained by a dependency parse parser from Spacy (Honnibal and Montani, 2017) following a parent-to-children left-to-right order. The CF & RF orders are obtained based on vocabulary cut-off so that the number of common words and the number of rare words are approximately the same (Ford et al., 2018). We also consider on-the-fly sampling a random order for each sentence as the baseline (RND). When using L2R as the pre-defined order, Transformer-InDIGO is almost equivalent to the vanilla Transformer, as the position prediction simply learns to predict the next position as the left of the 〈s〉 symbol. The only difference is that it enhances the vanilla Transformer with a small number of additional parameters for the position prediction.
We also train Transformer-InDIGO using the SAO where we set the beam size to 8. In default, models trained with SAO are bootstrapped from a slightly pre-trained (6,000 steps) model in L2R order.
Inference
During the test time, we do beam-search as described in Sec. 4.3. We observe from our preliminary experiments that models trained with different orders (either pre-defined or SAO) have very different optimal beam sizes for decoding. Therefore, we perform sensitivity studies, in which the beam sizes vary from 1 ∼ 20 and pick the beam size with the highest BLEU score on the validation set for each particular model.
5.2 Results and Analysis
Word Order Recovery
Word order recovery takes a bag of words as input and recovers its original word order, which is challenging as the search space is factorial. We do not restrict the vocabulary of the input words. We compare our model trained with the L2R order and eight SAO from beam-search for word order recovery. The BLEU scores over various beam sizes are shown in Figure 3. The model trained with SAO lead to higher BLEU scores over that trained with L2R with a gain up to 3 BLEU scores. Furthermore, increasing the beam size brings more improvements for SAO compared with L2R, suggesting that InDIGO produces more diversified predictions so that it has higher chances to recover the order.
The BLEU scores on the test set for word order recovery with various decoding beam sizes.
The BLEU scores on the test set for word order recovery with various decoding beam sizes.
Machine Translation
As shown in Table 3, we compare our model trained with pre-defined orders and the SAO with varying setups. We use four evaluation metrics including BLEU (Papineni et al., 2002), Ribes (Isozaki et al., 2010), Meteor (Banerjee and Lavie, 2005), and TER (Snover et al., 2006) to avoid using a single metric that might be in favor of a particular generation order. Most of the pre-defined orders (except for the random order and the balanced tree [BLT] order) perform reasonably well with InDIGO on the three language pairs. The best score with a predefined word ordering is reached by the L2R order among the pre-defined orders except for En-Ja, where the R2L order works slightly better according to Ribes. This indicates that in machine translation, the monotonic orders are reasonable and reflect the languages. ODD, CF, and RF show similar performance, which is below the L2R and R2L orders by around 2 BLEU scores. The tree-based orders, such as the SYN and BLT orders, do not perform well, indicating that predicting words following a syntactic path is not preferable. On the other hand, Table 3 shows that the model with SAO achieves competitive and even statistically significant improvements over the L2R order. The improvements are larger for Turkish and Japanese, indicating that a flexible generation order may improve the translation quality for languages with different syntactic structures from English.
Order . | WMT16 Ro → En . | WMT18 En → Tr . | KFTT En → Ja . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BLEU . | Ribes . | Meteor . | TER . | BLEU . | Ribes . | Meteor . | TER . | BLEU . | Ribes . | Meteor . | TER . | |
RND | 20.20 | 79.35 | 41.00 | 63.20 | 03.04 | 55.45 | 19.12 | 90.60 | 17.09 | 70.89 | 35.24 | 70.11 |
L2R | 31.82 | 83.37 | 52.19 | 50.62 | 14.85 | 69.20 | 33.90 | 71.56 | 30.87 | 77.72 | 48.57 | 59.92 |
R2L | 31.62 | 83.18 | 52.09 | 50.20 | 14.38 | 68.87 | 33.33 | 71.91 | 30.44 | 77.95 | 47.91 | 61.09 |
ODD | 30.11 | 83.09 | 50.68 | 50.79 | 13.64 | 68.85 | 32.48 | 72.84 | 28.59 | 77.01 | 46.28 | 60.12 |
BLT | 24.38 | 81.70 | 45.67 | 55.38 | 08.72 | 65.70 | 27.40 | 77.76 | 21.50 | 73.97 | 40.23 | 64.39 |
SYN | 29.62 | 82.65 | 50.25 | 52.14 | — | — | ||||||
CF | 30.25 | 83.22 | 50.71 | 50.72 | 12.04 | 67.61 | 31.18 | 74.75 | 28.91 | 77.06 | 46.46 | 61.56 |
RF | 30.23 | 83.29 | 50.72 | 51.73 | 12.10 | 67.44 | 30.72 | 73.40 | 27.35 | 76.40 | 45.15 | 62.14 |
SAO | 32.47 | 84.10 | 53.00 | 49.02 | 15.18 | 70.06 | 34.60 | 71.56 | 31.91 | 77.56 | 49.66 | 59.80 |
Order . | WMT16 Ro → En . | WMT18 En → Tr . | KFTT En → Ja . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BLEU . | Ribes . | Meteor . | TER . | BLEU . | Ribes . | Meteor . | TER . | BLEU . | Ribes . | Meteor . | TER . | |
RND | 20.20 | 79.35 | 41.00 | 63.20 | 03.04 | 55.45 | 19.12 | 90.60 | 17.09 | 70.89 | 35.24 | 70.11 |
L2R | 31.82 | 83.37 | 52.19 | 50.62 | 14.85 | 69.20 | 33.90 | 71.56 | 30.87 | 77.72 | 48.57 | 59.92 |
R2L | 31.62 | 83.18 | 52.09 | 50.20 | 14.38 | 68.87 | 33.33 | 71.91 | 30.44 | 77.95 | 47.91 | 61.09 |
ODD | 30.11 | 83.09 | 50.68 | 50.79 | 13.64 | 68.85 | 32.48 | 72.84 | 28.59 | 77.01 | 46.28 | 60.12 |
BLT | 24.38 | 81.70 | 45.67 | 55.38 | 08.72 | 65.70 | 27.40 | 77.76 | 21.50 | 73.97 | 40.23 | 64.39 |
SYN | 29.62 | 82.65 | 50.25 | 52.14 | — | — | ||||||
CF | 30.25 | 83.22 | 50.71 | 50.72 | 12.04 | 67.61 | 31.18 | 74.75 | 28.91 | 77.06 | 46.46 | 61.56 |
RF | 30.23 | 83.29 | 50.72 | 51.73 | 12.10 | 67.44 | 30.72 | 73.40 | 27.35 | 76.40 | 45.15 | 62.14 |
SAO | 32.47 | 84.10 | 53.00 | 49.02 | 15.18 | 70.06 | 34.60 | 71.56 | 31.91 | 77.56 | 49.66 | 59.80 |
Code Generation
The goal of this task is to generate Python code based on a natural language description, which can be achieved by using a standard sequence-to-sequence generation framework such as the proposed Transformer-InDIGO. As shown in Table 4, SAO works significantly better than the L2R order in terms of both BLEU and accuracy. This shows that flexible generation orders are more preferable in code generation.
Model . | Django . | MS-COCO . | ||
---|---|---|---|---|
BLEU . | Accuracy . | BLEU . | CIDEr-D . | |
L2R | 36.74 | 13.6% | 22.12 | 68.88 |
SAO | 42.33 | 16.3% | 22.58 | 69.42 |
Model . | Django . | MS-COCO . | ||
---|---|---|---|---|
BLEU . | Accuracy . | BLEU . | CIDEr-D . | |
L2R | 36.74 | 13.6% | 22.12 | 68.88 |
SAO | 42.33 | 16.3% | 22.58 | 69.42 |
Image Captioning
For the captioning task, one caption is generated per image and is compared against five human-created captions during testing. As show in Table 4, we observe that SAO obtains higher BLEU and CIDEr-D (Vedantam et al., 2015) compared to the L2R order, and it implies that better captions are generated with different orders.
5.3 Ablation Study
Model Variants
Table 5 shows the results of the ablation studies using the machine translation task. SAO without bootstrapping nor beam-search degenerates by approximate 1 BLEU score on Ro-En, demonstrating the effectiveness of these two methods. We also test SAO by bootstrapping from a model trained with a R2L order as well as a SYN order, which obtains slightly worse yet comparable results compared to bootstrapping from L2R. This suggests that the SAO algorithm is quite robust with different bootstrapping methods, and L2R bootstrapping performs the best. In addition, we re-implement a recent work (Stern et al., 2019) that adopts a similar idea of generating sequences through insertion operations for machine translation. We use the best settings of their algorithm, i.e., training with binary-tree/uniform slot-losses and slot-termination, while removing the knowledge distillation for a fair comparison with ours. Our model obtains better performance compared with Stern et al. (2019) on WMT16 Ro-En.
Model Variants . | dev . | test . |
---|---|---|
Baseline L2R | 32.53 | 31.82 |
SAO default | 33.60 | 32.47 |
no bootstrap | 32.86 | 31.88 |
no bootstrap, no noise | 32.64 | 31.72 |
bootstrap from R2L order | 33.12 | 32.02 |
bootstrap from SYN order | 33.09 | 31.93 |
Stern et al. (2019) - Uniform | 29.99 | 28.52 |
Stern et al. (2019) - Binary | 32.27 | 30.66 |
Running Time
As shown in Table 6, InDIGO decodes sentences as efficient as the standard L2R autoregressive models. However, it is slower in terms of training time using SAO as the supervision, as additional efforts are needed to search the generation orders, and it is difficult to parallelize the SAO. SAO with beam sizes 1 and 8 are 3.8 and 7.2 times slower than L2R, respectively. Note that enlarging the beam size during training won’t affect the decoding time as searching the best orders only happen in the training time. We will investigate off-line searching methods to speed up SAO training and make InDIGO more scalable in the future.
Model . | Training (b/s) . | Decoding (ms/s) . |
---|---|---|
L2R | 4.21 | 12.3 |
SAO (b = 1) | 1.12 | 12.5 |
SAO (b = 8) | 0.58 | 12.8 |
Model . | Training (b/s) . | Decoding (ms/s) . |
---|---|---|
L2R | 4.21 | 12.3 |
SAO (b = 1) | 1.12 | 12.5 |
SAO (b = 8) | 0.58 | 12.8 |
5.4 Visualization
Relative-Position Matrix
In Figure 4, we show an instantiated example produced by InDIGO, which is randomly sampled from the validation set of the KFTT En-Ja dataset. The relative-position matrices (Rt) and their corresponding absolute positions (zt) are shown at each step. We argue that relative-position matrices are flexible to encode position information, and its append-only property enables InDIGO to reuse previous hidden states.
An instantiated concrete example of the decoding process using InDIGO sampled from the En-Ja translation datset. The final output is reordered based on the predicted relative-position matrix.
An instantiated concrete example of the decoding process using InDIGO sampled from the En-Ja translation datset. The final output is reordered based on the predicted relative-position matrix.
Case Study
We demonstrate how InDIGO works by uniformly sampling examples from the validation sets for machine translation (Ro-En), image captioning, and code generation. As shown in Figure 5, the proposed model generates sequences in different orders based on the order used for learning (either pre-defined or SAO). For instance, the model generates tokens approximately following the dependency parse when we used the SYN order for the machine translation task. On the other hand, the model trained using the RF order learns to first produce verbs and nouns first, before filling up the sequence with remaining functional words.
Examples randomly sampled from three tasks that are instructed to decode using InDIGO with various learned generation order. Words in red and underlined are the inserted token at each step. For visual convenience, we reordered all the partial sequences to its correct positions at each decoding step.
Examples randomly sampled from three tasks that are instructed to decode using InDIGO with various learned generation order. Words in red and underlined are the inserted token at each step. For visual convenience, we reordered all the partial sequences to its correct positions at each decoding step.
We observe several key characteristics about the inferred orders of SAO by analyzing the model’s output for each task: (1) For machine translation, the generation order of an output sequence does not deviate too much from L2R. Instead, the sequences are shuffled with chunks, and words within each chunk are generated in a L2R order; (2) In the examples of image captioning and code generation, the model tends to generate most of the words in the L2R order and insert a few words afterward in certain locations. Moreover, we provide more examples in the appendix.
6 Related Work
Decoding for Neural Models
Neural autoregressive modelling has become one of the most successful approaches for generating sequences (Sutskever et al., 2011; Mikolov, 2012), which has been widely used in a range of applications, such as machine translation (Sutskever et al., 2014), dialogue response generation (Vinyals and Le, 2015), image captioning (Karpathy and Fei-Fei, 2015), and speech recognition (Chorowski et al., 2015). Another stream of work focuses on generating a sequence of tokens in a non-autoregressive fashion (Gu et al., 2018; Lee et al., 2018; van den Oord et al., 2018), in which the discrete tokens are generated in parallel. Semi-autoregressive modelling (Stern et al., 2018; Wang et al., 2018a) is a mixture of the two approaches, while largely adhering to left-to-right generation. Our method is different from these approaches as we support flexible generation orders, while decoding autoregressively.
Non-L2R Orders
Previous studies on generation order of sequences mostly resort to a fixed set of generation orders. Wu et al. (2018) empirically show that R2L generation outperforms its L2R counterpart in a few tasks. Ford et al. (2018) devise a two-pass approach that produces partially-filled sentence “templates” and then fills in missing tokens. Zhu et al. (2019) also propose to generate tokens by first predicting a text template and infill the sentence afterwards while in a more general way. Mehri and Sigal (2018) propose a middle-out decoder that firstly predicts a middle-word and simultaneously expands the sequence in both directions afterwards. Previous studies also focused on decoding in a bidirectional fashion such as (Sun et al., 2017; Zhou et al., 2019a, b). Another line of work models sequence generation based on syntax structures (Yamada and Knight, 2001; Charniak et al., 2003; Chiang, 2005; Emami and Jelinek, 2005; Zhang et al., 2016; Dyer et al., 2016; Aharoni and Goldberg, 2017; Wang et al., 2018b; Eriguchi et al., 2017). In contrast, Transformer-InDIGO supports fully flexible generation orders during decoding.
There are two concurrent papers (Welleck et al., 2019; Stern et al., 2019) that study sequence generation in a non-L2R order. Welleck et al. (2019) propose a tree-like generation algorithm. Unlike this work, the tree-based generation order only produces a subset of all possible generation orders compared to our insertion-based models. Further, Welleck et al. (2019) find L2R is superior to their learned orders on machine translation tasks, while transformer-InDIGO with searched adaptive orders achieves better performance. Stern et al. (2019) propose a very similar idea of using insertion operations in Transformer for machine translation. The major difference is that they directly use absolute positions, whereas ours utilizes relative positions. As a result, their model needs to re-encode the partial sequence at every step, which is computationally more expensive. In contrast, our approach does not necessitate re-encoding the entire sentence during generation. In addition, knowledge distillation was necessary to achieve good performance in Stern et al. (2019), while our model is able to match the performance of L2R even without bootstrapping.
7 Conclusion
We have presented a novel approach—InDIGO— that supports flexible sequence generation. Our model was trained with either pre-defined orders or searched adaptive orders. In contrast to conventional neural autoregressive models that often generate from left to right, our model can flexibly generate a sequence following an arbitrary order. Experiments show that our method achieved competitive or even better performance compared with the conventional left-to-right generation on four tasks, including machine translation, word order recovery, code generation and image captioning.
For future work, it is worth exploring a trainable inference model to directly predict the permutation (Mena et al., 2018) instead of beam-search. Also, the proposed InDIGO could be extended for post-editing tasks such as automatic post-editing for machine translation and grammatical error correction by introducing additional operations such as “deletion” and “substitution”.
Acknowledgments
We specially thank our action editor Alexandra Birch and all the reviewers for their great efforts to review the draft. We also would like to thank Douwe Kiela, Marc’Aurelio Ranzato, Jake Zhao, and our colleagues at FAIR for the valuable feedback, discussions, and technical assistance. This work was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: From Pattern Recognition to AI) and Samsung Electronics (Improving Deep Learning Using Latent Structure). KC thanks for the support of eBay and Nvidia.
Notes
𝒫T is the set of all the permutations of (1, …, T).
References
Author notes
This work was completed while the author worked as an AI resident at Facebook AI Research.