Pretraining the Noisy Channel Model for Task-Oriented Dialogue

Direct decoding for task-oriented dialogue is known to suffer from the explaining-away effect, manifested in models that prefer short and generic responses. Here we argue for the use of Bayes' theorem to factorize the dialogue task into two models, the distribution of the context given the response, and the prior for the response itself. This approach, an instantiation of the noisy channel model, both mitigates the explaining-away effect and allows the principled incorporation of large pretrained models for the response prior. We present extensive experiments showing that a noisy channel model decodes better responses compared to direct decoding and that a two stage pretraining strategy, employing both open-domain and task-oriented dialogue data, improves over randomly initialized models.


Introduction
Task-oriented dialogue agents provide a conversational interface to assist users in accomplishing specific goals, such as finding a restaurant or booking a hotel (Seneff and Polifroni, 2000;Raux et al., 2005;Budzianowski et al., 2018;Peng et al., 2020a). Increasing demand from industry for natural language assistants and scalable customer service solutions has recently been driving a renaissance in the development of task-oriented dialogue models. In addition, the specification of explicit dialogue agent goals, afforded by the task-oriented paradigm, makes such research easier to ground and evaluate than open-domain chatbots.
Current research on task-oriented dialogue is dominated by monolithic sequence-to-sequence models that directly parameterize the conditional distribution of the response given the prior dialogue context. However, this monolithic approach Work completed during an internship at DeepMind. conflates the task-specific and language-general aspects of dialogue, and adversely favors short and generic responses (Bao et al., 2020) due to the explaining-away effect (Klein and Manning, 2002).
Here we pursue an alternative to the direct model. Employing Bayes' rule allows us to factorize the probability of the response given the context ppR|Cq into a language model ppRq and a context model ppC|Rq. 1 Within natural language processing (NLP), this approach is traditionally known as the noisy channel model (Shannon, 1948), and has recently seen renewed interest with its successful application to neural machine translation (Yu et al., 2017Yee et al., 2019).
We hypothesize that the noisy channel reformulation is advantageous for dialogue because the factorization enables each sub-module to specialize in a dialogue sub-task. In particular, the context conditional model can help to discount short and generic responses and mitigate the explaining-away effect, while the language model helps ensure that responses are natural. We find that a noisy channel model with the same number of parameters as a direct model achieves better accuracy on three task-oriented dialogue datasets. Moreover, a larger noisy channel model can be trained with the same hardware, by training the sub-modules separately, yielding additional improvements.
It has become common in recent years to pretrain dialogue models on large text data, either general text (Peng et al., 2020b;Budzianowski and Vulić, 2019;Wu et al., 2020a) or dialogue-structured data (Roller et al., 2020;Adiwardana et al., 2020), such as tweets and Reddit posts. We utilise a similar strategy with Reddit data and find that the benefits of pretraining to the noisy channel model are similar to those for the direct model. Further, we evaluate transfer across task-oriented dialogue datasets by implementing a second pretraining stage using Taskmaster (Byrne et al., 2019) and Schema-Guided Dialogue (Rastogi et al., 2020) as training data, before fine-tuning on our final tasks.
We evaluate the algorithm on three datasets, MultiWOZ 2.0 (Budzianowski et al., 2018), Cam-Rest676 (Wen et al., 2017a) and SMCalFlow (Andreas et al., 2020), demonstrating that the noisy channel approach is robust to different dialogue schema annotations used across datasets. Further analysis demonstrates that the noisy channel models can decode responses with similar lengths and Zipf scores compared to ground-truth responses and reduce the likelihood of falling into repetition loops (Holtzman et al., 2019).

A Seq-to-Seq Dialogue Model
In this section, we introduce a discriminative sequence-to-sequence model for task-oriented dialogue. The traditional sequence of steps needed to produce a system turn in a task-directed dialogue is shown in Figure 1, with an example from MultiWOZ 2.0 (Budzianowski et al., 2018). Given a dialogue context containing previous user and system utterances, the dialogue system first predicts a belief state, consisting of a set of slot-value pairs (e.g. destination: Cambridge), to capture user intent. To ground the system with external information, the belief state can be converted into a database query in order to retrieve relevant information, such as the number of matches and booking information. Next, the system predicts a set of dialogue acts, representing the abstract meaning of the proposed dialogue response (Austin, 1975). Finally, a delexicalized dialogue response is generated, where slot values are replaced by generic placeholders, such as value_time for a train departure time, in order to reduce lexical variation. The delexicalized response can be converted to a lexicalized response in post-processing by filling in the slot values based on belief states and database information.
We use the MultiWOZ schema for illustration in Section 2 and 3, but our models easily generalize to different schema annotations (e.g. datasets without annotated dialogue acts (Andreas et al., 2020)).
Since it is well known that pipelined models tend to suffer from error propagation, many NLP tasks have been reformulated in recent years as end-to-end text-to-text transformations (Raffel et al., 2020;Brown et al., 2020). State-of-the-art task-oriented dialogue systems have followed this approach (Hosseini-Asl et al., 2020;Peng et al., 2020b). We represent the example from Figure 1 as follows, serializing turns and using special start and end tokens to encapsulate each data field: Given this text representation, the direct discriminative approach models ppB, A, R|Cq, where C, B, A, and R represent dialogue context, belief state, dialogue act, and delexicalized response, respectively. 2 We use the serialized text of the dialogue context as input, and the concatenation of belief state, dialogue act, and response as target output, making the task amenable to the application of an autoregressive sequence-to-sequence model. B, A and R can be generated sequentially with direct decoding methods, such as greedy decoding and beam search. We use a sequence-to-sequence Transformer (Vaswani et al., 2017) to implement ppB, A, R|Cq. This distribution will also be used to build the noisy channel model in Section 3.

Noisy Channel Model for Dialogue
While direct decoding is an effective approach for decoding belief states (Hosseini-Asl et al., 2020), it may be sub-optimal for generating responses. First, it favors short and generic responses (Bao et al., 2020). As a result, the decoded responses are bland and lack diversity (Li et al., 2016). Second, it suffers from the explaining-away effect (Klein and Manning, 2002), where inputs are "explained-away" by highly predictive output prefixes. For example, if there is one hotel matching the user's intent as encoded in the belief state, the model is nevertheless prone to decoding "no" given the output prefix "there is", ignoring the input information.
In this work, we propose using the neural noisy channel model (Yu et al., 2017) to mitigate the above problems for response generation. Given an input sequence x and output sequence y, the noisy channel formulation (Shannon, 1948)   uses Bayes' rule to rewrite the model ppy|xq as ppx|yqppyq ppxq 9 ppx|yqppyq. It was originally applied to speech recognition, where ppy|xq is a conditional model of the source text given a noisy observation. The channel model ppx|yq estimates the probability of the observation given the source, while ppyq is an unconditional language model (or source model), which can be trained on unpaired data. More recently it has been applied to machine translation, where y is a translation of input text x.
Abstracting away from belief states and dialogue acts, for task-oriented dialogue we want to estimate ppR|Cq, the probability of a response given a context. The channel model ppC|Rq, given a response, predicts a distribution over contexts which might have elicited that response. The source model ppRq is an unconditional language model. In this extension of the noisy channel approach to task-oriented dialogue, the "channel" can be understood as connecting dialogue contexts with suitable responses.
For the full task, we develop a noisy channel model for ppB, A, R|Cq. Using the chain rule, ppB, A, R|Cq " ppB|Cq¨ppA, R|C, Bq. Following Hosseini-Asl et al. (2020), we use the direct model described in Section 2 to parameterize ppB|Cq and decode B, which our preliminary experiments confirmed to be advantageous.
We use the noisy channel formulation to parameterize ppA, R|C, Bq. Using Bayes' Rule, ppA, R|C, Bq 9 ppC, B|A, Rq¨ppA, Rq. The channel model ppC, B|A, Rq and source model ppA, Rq are implemented as Transformers.
We choose to use the noisy channel formulation for decoding A based on preliminary experi-ments which showed improved overall accuracy over direct decoding, possibly because poor dialogue act prediction by the direct model led to worse quality responses. The serialized text of A and R are concatenated during training, and the decoded sequence is split into A and R with the special start/end tokens during decoding.
We suggest that the noisy channel model has three advantages over the direct model for response generation: (1) The channel model can penalize short and generic responses. Such responses can be mapped to a large number of contexts, resulting in a flat distribution over contexts. This leads to a lower channel model score for short and generic responses . (2) The channel model ensures that pA, Rq must explain the corresponding pC, Bq, alleviating the explaining-away effect (Yu et al., 2017). (3) The source model, an unconditional distribution over A and R, can make use of abundant non-dialogue textual data for pretraining, further improving the fluency of generated sequences (Brants et al., 2007). We leave exploration of this last advantage for future work, as we pretrain all sub-modules with the same data.

Decoding
Since exact decoding from the noisy channel model arg max A,R ppC, B|A, Rq¨ppA, Rq 3 is computationally intractable, we experiment with two approximation methods, noisy channel reranking and Algorithm 1: Online decoding for the noisy channel.
Input :Context C Output :Belief, act and response pB, A, Rq Decode B given C with ppB|Cq Select O P S with the largest score using Eq. 1 and return pB, A, Rq noisy channel online decoding. Since these methods rely on ppA, R|C, Bq as a proposal distribution for approximation, and both ppA, R|C, Bq and ppB|Cq are parameterized with the direct model introduced in Section 2, our noisy channel model therefore has three sub-modules: a direct model ppB, A, R|Cq, a channel model ppC, B|A, Rq, and a source model ppA, Rq. Noisy channel reranking: Noisy channel reranking first decodes B and then continues decoding a list S of pA, Rq pairs by beam search with the direct model, prior to utilizing the noisy channel model to rerank pA, Rq pairs. In particular, during beam search, partial sequences are expanded and pruned with ppA, R|C, Bq (from the direct model in Section 2). The pairs after decoding are reranked using the following model combination: where |A, R| denotes the length of pA, Rq, and λ 1 , λ 2 and λ 3 are hyperparameters. Besides the channel model ppC, B|A, Rq and the source model ppA, Rq, we additionally use the direct model ppA, R|C, Bq and a length bias |A, R| to encour-age responses with high direct model likelihood and discourage short responses, respectively.
Noisy channel online decoding: In contrast to reranking, online decoding applies the noisy channel model during beam search for pruning partial sequences, thus exploring a larger search space.
As shown in Algorithm 1, we first decode the belief state with ppB|Cq, which comes from the direct model in Section 2. Then, starting with a beam S containing a single sequence [a] (the dialogue act start token), we continuously expand the sequences in S until end(S) is met, i.e. all sequences in S either end with [/r] or have lengths larger than l.
In each iteration, we first expand the sequences in the beam, then prune the expanded beam. To expand a partial act and response sequence (denoted as O in Algorithm 1), a naive way is to use the noisy channel model to score |V | (the vocabulary size) possible expansions, which is computationally expensive. Instead, we use the probability of the next token ppO |O|`1 |C, B, Oq (where |O| denotes the length of O) to select k 1 candidates to be scored by the noisy channel model. This next token probability is from the direct model introduced in Section 2. One straightforward way to select k 1 expansions from ppO |O|`1 |C, B, Oq is using the top-k maximization, but we can also take advantage of the advances in sampling from a categorical distribution for text generation (e.g. top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019)). After the expansion, we prune the expanded beam S 1 to obtain a smaller beam with k 2 partial sequences based on the model combination in Eq. 1. Compared to noisy channel reranking, online decoding applies the noisy channel model during beam search, which is potentially less biased towards the direct model.
In summary, we note that beam search for both the direct model and the online decoding for our noisy channel model decodes (B, A, R) autoregressively. Thus both approaches are end-to-end models for task-oriented dialogue. The key difference is that noisy channel online decoding uses Eq. 1 for pruning, while the direct model uses ppA, R|C, Bq.

Model and Pretraining
We use three Transformer (Vaswani et al., 2017) networks to parameterize the direct model ppB, A, R|Cq, the channel model ppC, B|A, Rq and the source model ppA, Rq, respectively. The input to each Transformer is the sum of four em-beddings: word embeddings, position embeddings, role embeddings (user/system), and turn embeddings (each word corresponds to a turn number). Cross entropy is used as the loss function.
Given training samples pC, B, A, Rq, if we train the channel model using complete pA, Rq pairs as input, a significant discrepancy arises between training and decoding for noisy channel online decoding. Since the channel model is used to score partial act and response pairs, i.e. ppC, B|Oq in Algorithm 1, the channel model trained with complete pA, Rq pairs is unsuited to scoring partial sequences. In order to manually create partial sequences during training that are better matched for online decoding, we truncate the pA, Rq pairs with a truncation length uniformly sampled from 1 to the sequence length (inclusive). The direct model and the source model are trained with complete sequences, as partial sequences occur naturally in their standard autoregressive training procedure.
As in-domain dialogue data are usually scarce, we use a two-stage pretraining strategy to enhance the noisy channel model. Although the effectiveness of pretraining with Reddit data has been validated for open-domain dialogue Bao et al., 2019;Adiwardana et al., 2020), relatively little work has applied such data to taskoriented dialogue. 4 In the first stage, we explore Reddit pretraining (where the Reddit data is preprocessed into pC, Rq, i.e. context-response, pairs as described below). In the second stage, we use two task-oriented dialogue datasets, Taskmaster 5 (Byrne et al., 2019) and Schema-Guided Dialogue 6 (Rastogi et al., 2020), to specialize the Redditpretrained models. Since the Reddit data consists of open-domain-style dialogues (where belief states and dialogue acts are missing), pretraining on these datasets can familiarize the models with the sequence-to-sequence representation of task-oriented dialogue. Three models, a contextto-response model, a response-to-context model and a response language model, are pretrained to initialize the direct model, the channel model and the source model, respectively.

Implementation Details
Models: All models are implemented with JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020). For the direct model introduced in Section 2, we use a Transformer model with hidden size 512, 12 encoder-decoder layers, and 16 self-attention heads. The model has 114M parameters. For the noisy channel model, we use a base setting and a large setting. The base setting reduces the number of layers to 5, hidden size to 384 and self-attention heads to 12. Its sub-modules, a direct model, a reverse model and a language model, have 43M, 43M and 30M parameters, respectively. We employ the base setting for a fair comparison with a single direct model using roughly the same number of parameters (116M vs. 114M). For the large setting, we use the same hyperparameters as the direct model (114M), so that its sub-modules, a direct model, a reverse model and a language model, have 114M, 114M and 64M parameters, respectively. We use this large setting to explore the limits of the noisy channel model. The large noisy channel model (292M) is 2.56 times larger compared to the direct model (114M). This illustrates another advantage of the noisy channel model during training. While training a direct model with 292M parameters will overflow the memory of 16GB TPUs (v3) without using model parallelism, training the sub-modules of the large noisy channel model can easily fit into 16GB TPUs, as these modules are independently trained with no need to load three modules for training. This enables us to train a noisy channel model with more parameters compared to training a direct model using the same hardware. For inference, we still need to load the sub-modules into a TPU. Since gradients are not required during inference, we are able to load the three sub-modules of the large noisy channel model (292M) into a single TPU with 16GB memory for decoding. The large noisy channel model (292M) still consumes more memory than the direct model (114M) during inference.
Pretraining settings: The maximum sequence length l is set to 1024, and sequences with longer lengths are truncated. We reuse the vocabulary from GPT-2 (Radford et al., 2019), which contains 50,257 BPE tokens. We use PreNorm (Nguyen and Salazar, 2019) for faster convergence. GELU (Hendrycks and Gimpel, 2016) is applied as the activation function. Following ALBERT (Lan et al., 2020), dropout is disabled during pretraining. We  Table 1: Statistics of task-oriented dialogue datasets. We define a multi-task dialogue as a dialogue involving multiple tasks, e.g. hotel and restaurant booking, while its counterpart handles a single task, e.g. hotel booking. Taskmaster and CamRest676 do not contain any multi-task dialogues.
use the normal distribution truncated to the range r´0.01, 0.01s to initialize the input embeddings, while other parameters are initialized using the normal distribution with zero mean and standard deviation 0.1. The batch size is set to 256. The LAMB optimizer (You et al., 2020) (b 1 " 0.9 and b 2 " 0.999) is employed for optimization. The initial learning rate is 1e-7, and we apply 4000 warmup steps to increase the learning rate to 1e-3, before utilizing cosine annealing to decay the learning rate. Gradient clipping with clipping value 1 is applied to avoid gradient explosion. We use gradient accumulation with accumulation step 20.
Pretraining: For Reddit pretraining, we download a Reddit dump (with Reddit posts ranging from 2005-12 to 2019-09) from PushShift. 7 Since the comments of a Reddit post are organized into a tree, we extract paths from a tree as dialogue turns. The last comment of each comment path is regarded as the response, while the others are used as the dialogue context. We pretrain each model for 400,000 steps, consuming 102,400,000 (400,0002 56) comment paths in total. For the task-oriented pretraining, we combine the two datasets, Taskmaster and Schema-Guided Dialogue, and pretrain for 1e5 steps. The statistics of the task-oriented dialogue datasets are shown in Table 1.
We train each model using 64 TPU chips with 16GB memory each. The pretraining takes around 4 days to complete.

Datasets
MultiWOZ 8 is a multi-domain dataset consisting of dialogues annotated with C, B, A, R in the following seven domains: attraction, hotel, hospital, police, restaurant, train, and taxi. Since its release, MultiWOZ has been one of the most commonly used task-oriented dialogue datasets.
CamRest676 9 is annotated similarly to Multi-WOZ and consists of dialogues in a single domain: restaurant reservations. Though CamRest676 is smaller than MultiWOZ and predates it, it still provides a widely used benchmark for evaluating taskoriented dialogue models.
SMCalFlow consists of dialogues in four domains: calendar, weather, places, and people. Unlike MultiWOZ and CamRest676, SMCalFlow uses dataflow graphs instead of slot-value pairs to represent belief states and does not annotate dialogue acts. We refer readers to Andreas et al. (2020) for a detailed description of the dataflow representation. We follow Andreas et al. (2020) to convert dataflow graphs into sequences to apply seq2seq models. This dataset is newer and offers fewer prior models to compare with, but we use this dataset to study the robustness of the noisy channel model under different annotation schemas.
We use the public splits for these datasets, where MultiWOZ, CamRest676 and SMCalFlow are split to 8438/1000/1000, 404/136/136 and 32647/3649/5211 dialogues for training, development and testing, respectively. However, since SM-CalFlow's test set has not been publicly released, we randomly select 500 dialogues from its training set to tune hyperparameters and use its development set for testing.
Preprocessing: We use the standard preprocessing procedures for each dataset in order to facilitate

Fine-Tuning
We apply label smoothing with parameter 0.1. Dropout is used on input embeddings and hidden representations, with dropout rate 0.1. The Adam optimizer (Kingma and Ba, 2015) (b 1 " 0.9 and b 2 " 0.999) is adopted. We use a fixed learning rate 1e-4 with gradient clipping for fine-tuning.

Evaluation Metrics
For MultiWOZ and CamRest676, following previous work, we adopt three automatic evaluation metrics: inform, success and BLEU score. Peng et al. (2020a) showed that these metrics are well correlated to human evaluation. The evaluators 14 15 provided with the datasets are used for calculating these metrics. To calculate the inform score for a dialogue, the evaluator first checks whether certain placeholders (e.g. [restaurant_name]) appear in decoded responses. If so, decoded belief states are converted to database queries to retrieve database records. These database records are compared with the records retrieved with groundtruth belief states. The inform score is one if these two sets of database records match. The success score takes all the requestable slots (e.g. postcode, phone number and address) from a decoded response and compares these requestable slots with the ones in the ground-truth response. The success score is one if generated requestable slots coincide with the ground-truth ones. BLEU score (BLEU-4) compares the n-grams of generated responses and human responses, and is a widely used metric in NLP for evaluating text quality. Following Budzianowski et al. (2018), we also calculate a combined score, which is (Inform + Success) / 2 + BLEU. For SMCalFlow, inform and success scores are not applicable since calculation of these scores relies on delexicalization placeholders, and this dataset does not use delexicalization. We use SacreBLEU 16 and TER 17 to directly measure the quality of responses. As prior work on this dataset has focused on belief tracking rather than end-toend response generation, we are the first to use these metrics on this dataset. We perform significance tests, where we use ttest for inform, success and TER scores and use permutation test for BLEU.

Results
MultiWOZ: Results on the MultiWOZ test set are shown in Table 2. We observe several trends. First, the base noisy channel model (116M) performs better than direct decoding (114M), despite having a similar number of parameters, showing that the noisy channel factorization is beneficial for task-oriented dialogue. The large noisy channel setting improves further over the base setting. Second, Reddit pretraining provides benefits over random initialization, validating the use of large open-domain dialogue-genre pretraining for taskoriented dialogue, while the models with a second stage of task-oriented pretraining obtain further improvements. This effect is consistent across both direct and noisy channel decoding. Finally, we observe that online decoding consistently outperforms reranking, indicating the benefits of tighter model integration during decoding.
Our model performs better on combined score than SOLOIST (Peng et al., 2020a), a closely related baseline which pretrains a GPT2-initialized Transformer with Taskmaster and Schema-Guided Dialogue and decodes with nucleus sampling.
CamRest676: Results on the CamRest676 test set are shown in Table 3. We observe that the base noisy channel model (116M) obtains better results compared to direct decoding (114M), again demonstrating the effectiveness of the noisy channel model. Reddit pretraining again provides a large benefit over random initialization for both direct decoding and noisy channel decoding, while task-oriented pretraining provides a further boost. Our model again performs better than SOLOIST.
SMCalFlow: Results on the SMCalFlow development set are shown in Table 4. As end-to-end models have not previously been tested on this dataset, we use it to demonstrate that the noisy channel model, which we developed primarily on MultiWOZ, continues to be effective on task-16 https://cutt.ly/BkuU7dL 17 https://pypi.org/project/pyter/  oriented dialogue datasets with different annotation schema. The results are consistent with MultiWOZ and CamRest676. The noisy channel model outperforms the direct model by a large margin, demonstrating that dialogue act annotations are not essential for the noisy channel model, and that it remains effective across diverse dialogue representations. Reddit pretraining confers a similar large benefit on SMCalFlow as on the other datasets, but we observe that task-oriented pretraining brings only marginal further improvements. This may be due to differences in domain or format between our pretraining datasets and SMCalFlow. Alternatively, task-oriented pretraining may help more on taskspecific metrics, such as inform and success scores, than on text quality metrics such as BLEU and TER scores. This hypothesis is further supported by the MultiWOZ results in Table 2.

Analysis
In this section, we use MultiWOZ and CamRest676 to perform ablation studies on the effects of model combination, large-scale pretraining, and sample efficiency; as well as analyzing the runtime requirements of our model and the reasons for its success.

Ablation on Model Combination
Noisy channel decoding involves a combination of four sub-modules, as in Eq. 1: the direct model,      the effectiveness of the noisy channel factorization, and the importance of each model component.

Effect of Pretraining Scale
We investigate the importance of scale for both our pretraining stages. We select different checkpoints for Reddit pretraining, and truncate the two taskoriented dialogue datasets for task-oriented pretraining. We fine-tune these models using the full training data of CamRest676 or MultiWOZ. The results of three decoding methods (with the large   noisy channel model) on the development sets are shown in Figure 2. In Figure 2 (a) and (c), the combined scores of all three decoding methods improve with more Reddit pretraining steps, demonstrating the advantage of increasing amounts of data in the open-domain dialogue pretraining stage. In Figure  2 (b) and (d), the combined scores further increase with more task-oriented data, confirming that additional task-oriented pretraining data is useful.

Sample Efficiency of Fine-Tuning
We investigate whether pretraining can improve sample efficiency during fine-tuning. We gradually increase the amount of fine-tuning data and evaluate the randomly-initialized, Reddit pretrained and task-oriented pretrained models. The results on the development sets are shown in Figure 3. Combined scores increase with more training data under all conditions. Crucially, Reddit pretrained models show better performance with a smaller amount of fine-tuning data than randomly initialized models, and task-oriented pretrained models better still. We conclude that both our pretraining stages can improve sample efficiency, which is especially important when the target task has little training data.

Decoding Runtime
In Table 6, we report the average clock time for decoding one turn (including its belief state, dialogue act and response). Noisy channel reranking is slightly slower compared to direct decoding, with overhead due to the reranking step in Eq. 1. Noisy channel online decoding is significantly slower, since it needs to apply Eq. 1 at each beam search step. In future work we will investigate ways to improve the efficiency of online decoding.

Decoding Properties
In this section we analyze why the noisy channel model performed better than direct decoding. Length: In Table 7 we show the average length of generated responses. Direct decoding produces shorter responses than the ground truth, confirming that the direct model prefers short and generic responses. Adding a length bias to direct decoding (with lambda tuned on the development sets) produces responses longer than the ground truth, which may be a disadvantage. The noisy channel models produce responses with average length closest to the ground truth.
Zipf: Table 8 shows the Zipf scores of responses. We find that the word distributions of responses generated by the noisy channel models are closer to the word distribution of ground-truth responses.
Repetition: In Table 9 we examine the likelihood of falling into repetition loops (Holtzman et al., 2019) for different decoding methods. Repetition loops are rare for all decoding methods, but noisy channel decoding can further decrease their likelihood. The channel model can discount a sequence with a repetition loop, since it conveys less information than a natural sequence of the same length, making it harder to "explain" the context.
Examples: Some examples of responses are shown in Table 10. We observe that noisy channel models decode longer responses compared to direct decoding, and that the responses can explain their dialogue contexts well to meet users' requirements.

Related Work
Task-oriented dialogue models: Most taskoriented dialogue systems break down the task into three components: belief tracking (Henderson et al., 2013;Mrkšić et al., 2016;Rastogi et al., 2017;Nouri and Hosseini-Asl, 2018;Wu et al., 2019a;Zhou and Small, 2019;Heck et al., 2020), dialogue act prediction (Wen et al., 2017a;Tanaka et al., 2019) and response generation Budzianowski et al., 2018;Lippe et al., 2020). Traditionally, a modular approach is adopted, where these components are optimized independently (i.e. a pipeline design) or learned via multi-task learning (i.e. some parameters are shared among the components) (Wen et al., 2017b;Zhao et al., 2019;Mehri et al., 2019;Tseng et al., 2020;. However, it is known that improvements in one component do not necessarily lead Ground truth Yes, rattraction_names is on rattraction_addresss and is in the rvalue_areas side of town. Is there anything else you need to know? -Direct decoding rattraction_names is located in the rvalue_areas part of town and has free admission.

27.53
Reranking rattraction_names is located in the rvalue_areas of town at rattraction_addresss. The entrance fee is free. Can I help you with anything else? 41.66 Online decoding rattraction_names is located in the rvalue_areas part of town at rattraction_addresss. Can I help you with anything else? 42.38 to overall performance improvements (Ham et al., 2020), and the modular approach suffers from error propagation in practice (Liu and Lane, 2018). These observations gave rise to the sequence-tosequence approach (Lei et al., 2018;Budzianowski and Vulić, 2019;Wu et al., 2019b;Zhang et al., 2020a;Ham et al., 2020;Hosseini-Asl et al., 2020;Peng et al., 2020a;Yang et al., 2021), where dialogue beliefs and acts are represented as text spans, and a sequence-to-sequence model is applied to subsume the three components. Our work is situated within this general approach. In contrast to previous work, however, which uses a direct model for decoding, we introduce the noisy channel model to improve task-oriented dialogue.
Pretraining models for dialogue: Recent work has applied pretraining (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019) to dialogue. For open-domain dialogue, DialoGPT  and CGRG (Wu et al., 2020b) extend GPT-2 (Radford et al., 2019) for response generation. PLATO (Bao et al., 2019) and PLATO-2 (Bao et al., 2020) pretrain a latent variable model with social media data for diversified response generation. Meena (Adiwardana et al., 2020) collects a largescale social media corpus for pretraining and proposes a metric named sensibleness and specificity average for evaluation. Roller et al. (2020) study various strategies for building an open-domain chatbot with Reddit for pretraining. For task-oriented dialogue, ToD-BERT (Wu et al., 2020a) fine-tunes BERT (Devlin et al., 2019) for four tasks, includ-ing intention detection, belief tracking, dialogue act prediction, and response selection. SC-GPT (Peng et al., 2020b) fine-tunes GPT-2 for few-shot response generation with given dialogue acts. Ham et al. (2020) fine-tune GPT-2 for belief tracking and context-to-response generation. SimpleTOD (Hosseini-Asl et al., 2020) proposes a method to serialize dialogue beliefs and acts into text spans and fine-tunes GPT-2 for end-to-end dialogue modeling. SOLOIST (Peng et al., 2020a) uses a series of task-oriented dialogue datasets to further pretrain GPT-2 before fine-tuning it on final tasks for evaluation. Unlike these BERT-or GPT-initialized taskoriented dialogue models, which are essentially pretrained with general text, such as Wikipedia and BookCorpus, we use a Reddit dump to pretrain the models to learn from open-domain dialogues.

Conclusion
We introduced two noisy channel models, noisy channel reranking and noisy channel online decoding, for task-oriented dialogue. Large-scale pretraining was further adopted to tackle data scarcity in downstream tasks. Extensive experiments on MultiWOZ, CamRest676 and SMCalFlow demonstrated that (1) the noisy channel models significantly outperform direct decoding; (2) models with pretraining improve over randomly-initialized models; (3) the models are robust to different dialogue schema annotations; (4) the noisy channel models can decode responses closer to ground-truth responses than direct decoding.