Better Document-Level Machine Translation with Bayes’ Rule

Abstract We show that Bayes’ rule provides an effective mechanism for creating document translation models that can be learned from only parallel sentences and monolingual documents a compelling benefit because parallel documents are not always available. In our formulation, the posterior probability of a candidate translation is the product of the unconditional (prior) probability of the candidate output document and the “reverse translation probability” of translating the candidate output back into the source language. Our proposed model uses a powerful autoregressive language model as the prior on target language documents, but it assumes that each sentence is translated independently from the target to the source language. Crucially, at test time, when a source document is observed, the document language model prior induces dependencies between the translations of the source sentences in the posterior. The model’s independence assumption not only enables efficient use of available data, but it additionally admits a practical left-to-right beam-search algorithm for carrying out inference. Experiments show that our model benefits from using cross-sentence context in the language model, and it outperforms existing document translation approaches.


Introduction
There have been many recent demonstrations that neural language models based on transformers (Vaswani et al., 2017; are capable of learning to generate remarkably coherent documents with few (Zellers et al., 2019) or no (Radford et al., 2019) conditioning variables. Despite this apparent generation ability, in practical applications, unconditional language models are most often used to provide representations for natural language understanding applications (Devlin et al., 2019;Peters et al., 2018), and how to use them for conditional generation applications remains an open question.
Our hypothesis in this work is that Bayes' rule provides an effective way to leverage powerful unconditional document language models to improve a conditional task: machine translation. The application of Bayes' rule to transform the translation modeling problem p(y | x), where y is the target language, and x is the source language, has a long tradition and was the dominant paradigm in speech and language processing for many years (Brown et al., 1993), where it is often called a ''noisy channel'' decomposition, by analogy to an information theoretic conception of Bayes' rule.
Whereas several recent papers have demonstrated that the noisy channel decomposition has benefits when translating sentences one-by-one (Yu et al., 2017;Ng et al., 2019), in this paper we show that this decomposition is particularly suited to tackling the problem of translating complete documents. Although using crosssentence context and maintaining cross-document consistency has long been recognized as essential to the translation problem (Tiedemann and Scherrer, 2017;Bawden et al., 2018, inter alia), operationalizing this in models has been challenging for several reasons. Most prosaically, parallel documents are not generally available (whereas parallel sentences are much more numerous), making direct estimation of document translation probabilities challenging. More subtly, documents are considerably more diverse than sentences, and models must be carefully biased so as not to pick up spurious correlations.
Our Bayes' rule decomposition ( §2) permits several innovations that enable us to solve these problems. Rather than directly modeling the conditional distribution, we rewrite it as p(y | x) ∝ p(y) × p(x | y). This changes the learning problem from estimating a single complex conditional distribution to learning two different distributions: a language model p(y), which provides unconditional estimates of the output (in this paper, documents); and p(x | y), which provides the probability of translating a candidate output y into the (observed) source document x.
As we will discuss subsequently, although the problems of estimating p(y | x) and p(x | y) are formally similar, independence assumptions made in p(x | y) are less statistically costly than they might otherwise be since, at test time, we will be conditioning on x and reasoning about a posterior distribution over y, which will be jointly dependent on all (conditionally independent) parts of x. This statistical fact-which is the same trick that gives naïve Bayes classifiers their expressiveness and ease of estimation-permits us to assume independence between sentence translations in the reverse translation model, and therefore to use parallel sentences (rather than parallel documents) to train it. In the posterior, we thus have an implicit estimate of a document-level translation system, even though we made no use of parallel documents when estimating the prior or likelihood models. This is particularly useful because parallel sentences are much more readily available than parallel documents. A second benefit of our approach is that the unconditional language model can be estimated from nonparallel data, which exists in vast quantities.
Although the noisy channel model is ideal for exploiting the data resources that naturally exist in the world (large corpora of parallel but independent sentences, and large corpora of monolingual documents), we are faced with a much harder decoding problem ( §3). To address this problem, we propose a new beam-search algorithm, exploiting the fact that our document language model operates left-to-right, and our reverse translation model treats sentences independently. The search is guided by a proposal distribution that provides candidate continuations of a document prefix, and these are reranked according to the posterior distribution. In particular, we compare two proposal models: one based on estimates of independent sentence translations (Vaswani et al., 2017) and one that conditions on the source document context . Although closely related, our algorithm is much simpler and faster than that proposed in Yu et al. (2017). Rather than using a specially designed channel model (Yu et al., 2016) which is limited in process-ing long sequences like documents, our conditional sentence independence assumptions allow us to use any sequence-to-sequence model as the channel model, making it a better option for document-level translation.
To explore the performance of our proposed model, we focus on Chinese-English translation, following a series of papers on document translation Werlen et al., 2018;Tu et al., 2018;. Although in general it is unreasonable to expect that independent translations of sentences would lead to coherent translations of documents, the task of translating Chinese into English poses some particularly acute challenges. As Chinese makes fewer inflectional distinctions than English does, and the relevant clues for predicting, for example, what tense an English verb should be in, or whether an English noun should have singular or plural morphology, may be spread throughout a document, it is crucial that extra-sentential context is used.
Our experiments ( §4) explore: (1) different approaches to reranking, (2) different independence assumptions when modeling documents (i.e., whether sentences are generated independently or not), (3) different amounts of language modeling data, and (4) different proposal models. Briefly summarized, we find that documentcontext language models significantly improve the translation quality obtained with our system, both in terms of BLEU scores, and in terms of a human evaluation. Targeted error analysis demonstrates the document prior is capable of enforcing consistency of tense and number and lexical choice across documents.

Model Description
We define x = (x 1 , x 2 , . . . , x I ) as the source document with I sentences, and similarly, y = (y 1 , y 2 , . . . , y J ) as the target document with J sentences. During the (human) translation process, translators may split or recombine sentences, but we will assume that I = J. 1 Let x i = (x i 1 , x i 2 , . . . , x i M ) represent the ith sentence in the document, consisting of M words; likewise y i = (y i 1 , y i 2 , . . . , y i N ) denote the ith sentence in the target document, containing N words.
The translation of a document x is determined by finding the documentŷ, where p(ŷ | x) is optimal.ŷ = arg max y p(y | x). (1) Instead of modeling the probability p(y | x) directly, we factorize it using Bayes' rule: ( We further assume that sentences are independently translated, and that the sentences are generated by a left-to-right factorization according to the chain rule. Therefore, we havê where y <i = (y 1 , . . . , y i−1 ) denotes a document prefix consisting of the first i − 1 target sentences. Thus conceived, this is a generative model of parallel documents that makes a particular independence assumption; we illustrate the corresponding graphical model on the top of Figure 1.

Impact of the Conditional Independence Assumption
At first glance, the conditional independence assumption we have made might seem to be the very independence assumption that bedevils conventional sentence-based approaches to document translation-translations of sentence i appear to be uninfluenced by the translation of any sentence j = i. However, although this is the case during training, this is not the case at test time.
Then, we will be conditioning on the x i 's (the source language sentences), and reasoning about the posterior distribution over the ''underlying'' y i 's. By conditioning on the child variables, conditional dependencies between all y i 's and between each y i and all x i 's are created (Shachter, 1998). The (in)dependencies that are present in the posterior distribution are shown in the bottom of Figure 1. Thus, although modeling p(y | x) or p(x | y) would appear to be superficially similar, the our noisy channel model where y i indicates the ith target language sentence and x i indicates the ith source language sentence. In the prior (top) the target sentences (the y i 's) only influence the corresponding source sentence and therefore can be learned and modeled independently, but at test time (bottom), when the target is not observed, each y i depends on every x i . statistical impact of making a conditional independence assumption is quite different. This is fortunate, as it makes it straightforward to use parallel sentences, rather than assuming we have parallel documents, which are less often available (Voita et al., 2019b;Maruf et al., 2019, inter alia). Finally, because we only need to learn to model the likelihood of sentence translations (rather than document translations), the challenges of guiding the learners to make robust generalizations in direct document translation models (Voita et al., 2019b;Maruf et al., 2019, inter alia) are neatly avoided.

Learning
We can parameterize the channel probability p(x i | y i ) using any sequence-to-sequence model and parameterize the language model p(y i | y <i ) using any language model. It is straightforward to learn our model: We simply optimize the channel model and the language model separately on parallel data and monolingual data, respectively. We remark that it is a significant practical Figure 2: The decoding process. In Phase 1, the auxiliary proposal model generates candidate translations (3 candidates in the diagram) for each sentence in the document (containing 4 sentences). In Phase 2, beam search is employed to search for the best path from the candidate translations.
advantage of this parameterization that we can retrain the channel and language models independently-for example, if we acquire more monolingual data, or use different language models with the same channel model conditioned on the domain of the source text.

Decoding
Because of the global dependencies in the posterior distribution, decoding in our model is computationally complex. On one hand, similar to the decoding problem faced in standard sequence-tosequence models, we must search over the space of all possible outputs with a model that makes no Markov assumptions. On the other hand, unlike traditional models, we have to have a complete y i before we can compute p(x i | y i ), making greedy and near-greedy algorithms ineffective. To address this issue, we use an auxiliary proposal model q(y | x), that approximates the posterior distribution using a direct model, to focus our search on promising parts of the output space.
Because of the autoregressive factorization of the language model (p LM ), and the independent sentence translation assumption in the channel model (p TM ), we can carry out the reranking process using a left-to-right beam search strategy with the aid of our proposal model (q). Figure 2 illustrates the decoding process. For an input document of ℓ sentences, we let the proposal model propose K candidate translations for each sentence in the document. 2 We then search for the best document path through this lattice-or confusion network (Mangu et al., 2000)-of candidate sentence translations. To do so, we maintain a beam of the B active hypotheses (i.e., when considering the ith sentence, the prefix consists of i − 1 sentences), and we consider the proposal's K one-sentence extensions (which we write y i ). We retain B partial translations from the K × B candidates according to the following linear objective, where |y| denotes the number of tokens in the sentence y, and where the base case O(x, y <0 , y 0 ) = 0. Note that Eq. 4 is a generalization of Eq. 3 in log space-if we set λ 1 = λ 3 = 0 and λ 2 = 1 and take the log of Equation 3 the two objectives are equivalent. The extra factors-the proposal probability and the length of the output-provide improvements (e.g., by calibrating the expected length of the output), and can be incorporated at no cost in the model; they are widely used in prior work (Koehn et al., 2007;Yu et al., 2017;Ng et al., 2019). The elements on the beam after considering the ℓth sentence are reranked one final time by adding log p LM ( STOP | y ≤ℓ ) to the final score; this accounts for the language model's assessment that the candidate document has been appropriately ended. 3

Experiments
We evaluate our model on two translation tasks, the NIST Open MT Chinese-English task 4 and the WMT19 Chinese-English news translation task. 5 On both tasks, we use the standard parallel training data, and compare our model with a strong transformer baseline, as well as related models from prior work.

Dataset Description
The NIST training data is composed from LDCdistributed news articles and broadcast transcripts and consists of 1.5M sentence pairs. The documentlevel parallel corpus is a subset of the full training set, including 55K documents with 1.2M sentences. Following prior work, we use the MT06 dataset as validation set and MT03, MT04, MT05, and MT08 as test sets. There are 79 documents and 1,649 sentences in the validation set and in total 509 documents and 5,146 sentences in the test set. On average, documents in the test set has 10 sentences, and 250 words and 330 words on the Chinese and English sides, respectively. We preprocess the dataset by doing punctuation normalization, tokenization, and lower-casing. We use byte pair encoding (Sennrich et al., 2016b) with 32K merges to segment words into sub-word units for both Chinese and English. The evaluation metric is case-insensitive BLEU calculated using multi-bleu.perl, which is consistent with prior work on this task.
The training data for the WMT19 Chinese-English task includes the UN corpus, CWMT, and news commentary. The total number of sentence pairs is 18M after filtering the data by removing duplicate sentences and sentences longer than 250 words. The validation sets that we use in the experiment are newstest2017 and newstest2018, which contains 169 documents, 2,001 sentences and 275 documents, 3,981 sentences, respectively. The test set is newstest2019, containing 163 documents and 2,000 sentences. On average, documents in the test set have 12 sentences, and 360 words and 500 words on the Chinese and English sides, respectively. The dataset is preprocessed by segmenting Chinese sentences and normalizing punctuation, tokenizing, and truecasing English sentences. As for NIST, we learn a byte pair encoding (Sennrich et al., 2016b) with 32K merges to segment words into sub-word units for both Chinese and English. The evaluation metric is sacreBLEU (Post, 2018).

Model Configuration
For NIST, we use the transformer (Vaswani et al., 2017) as the channel model and the document transformer  as the proposal model. The hyperparameters for training the transformer are the same as transformer base (Vaswani et al., 2017), that is, 512 hidden size, 2,048 filter size, 8 attention heads, and 6 layers for both the encoder and decoder. We follow 's configuration to train the document transformer: Context length is set to 2 and all other hyperparameters are the same as transformer base. Both models are optimized using Adam (Kingma and Ba, 2015) with approximately 24K BPE tokens per mini-batch. For the language model, we train the transformer-XL  on a combination of the English side of NIST training data as well as three sections of Gigaword: XIN, AFP, APW, resulting in a total of 7.3M documents and 115M sentences. We use an architecture with 24 layers, 16 attention heads, and embeddings of dimension 1024. The input sequences to the language model are encoded into bytes using the byte-level encoder provided by GPT2 (Radford et al., 2019).
For WMT19, we use the transformer as both the channel and proposal model. The hyperparameters for training the transformer is the same as transformer big (Vaswani et al., 2017), namely, 1,024 hidden size, 4,096 filter size, 16 attention heads, and 6 layers. The model is trained on 8 GPUs with batch size of 4,096. The setup for the language model is the same as that of NIST except that the training data is the English side of the parallel training data and Gigaword.  Table 1: Comparison with prior work on NIST Chinese-English translation task. The evaluation metric is tokenized case-insensitive BLEU. The first three rows are numbers reported in the papers of prior work. The first two baselines are the results that we obtained by running the transformer (Vaswani et al., 2017) and the document transformer  on the NIST dataset. The sent-reranker is a variation of our model in which sentences in documents are assumed to be independent. The backtranslation baseline is obtained by training the document transformer using additional synthetic parallel documents generated by backtranslation.
For both tasks, the weights λ are selected using grid search, from [0.8, 1., 1.5, 2., 2.2, 2.5, 3.] for the weights of channel model λ 2 and proposal model λ 1 , and from [0.2, 0.5, 0.8, 1.] for the length penalty λ 3 . The size of the n-best list used in the reranker is set to K = 50. 6 The beam size in the document decoding algorithm is B = 5.
The running time for our decoding algorithm (Section 3) highly depends on the language model's speed of calculating probabilities of partial documents. Using the transformer-XL language model with the aforementioned configuration, it takes approximately 90 seconds to decode a document on a Google Cloud TPU v3. We leave systematic exploration of inference algorithms for better solving the decoding problem to future work. Table 1 presents the best result from our model (doc-reranker) in comparison with prior work on the NIST Chinese-English translation task. The first three rows are numbers reported in prior work. Wang et al. (2017) incorporate document context by introducing a hierarchical RNN to an LSTM sequence-to-sequence model. Kuang et al. (2017) use a cache to store previously translated words across sentences, which they then use in sequence-to-sequence models.  extend the transformer model with an extra context encoder to capture information from previous source sentences. Apart from prior work, we also compare our doc-reranker with four baselines: the transformer (Vaswani et al., 2017), document transformer , the sentencelevel reranker (sent-reranker), and the document transformer with backtranslation.

Experimental Results
In the sent-reranker, we assume sentences in the document are independent (formulationŷ = arg max y |x| i=1 p(x i | y i ) ×p(y i )), and therefore we train a sentence-level language model and rerank each sentence independently. This sentreranker setup is close to the work from  and Ng et al. (2019) with the difference that rather than using a language model trained on documents we use a language model trained on sentences, which is more statistically consistent.  Table 2: BLEU scores on NIST dev set MT06 from rerankers which are incorporated with various language models. In the language model column X: Y means the language model X is trained on dataset Y. A bigger language model improves the doc-reranker but does not help the sent-reranker. about the reliability of using BLEU at assessing cross-sentential consistency (Voita et al., 2019b).
To compare the effectiveness of leveraging monolingual data between backtranslation (Edunov et al., 2018;Sennrich et al., 2016a) and our model, we train the document transformer  using additional synthetic parallel documents generated by backtranslation (q ′ ). For fair comparison we use the same monolingual data for both models. As shown in Table 1, although both techniques improve translation, backtranslation is less effective than our model. Because we have a new model q ′ , we can use it as a proposal model for our doc-reranker-effectively using the monolingual data twice. We find that this improves results even further, indicating that the effect of both approaches is additive.
To understand the rerankers better, we investigate the effect of different proposal models, different language models, and various numbers of  candidates in the n-best list. Table 2 and Figure 3 show that better proposal models and bigger nbest lists lead to consistently better reranking results. This is an appealing behavior showing that our reranker is able to pick better translations from higher quality and more diverse candidate pools generated by better proposal models and bigger n-best lists. To compare the effect of language models, we train an LSTM language model (Merity et al., 2018b,a) and a transformer-XL language model on the English side of NIST parallel training data in addition to the transformer-XL trained on NIST and Gigaword. Table 3 lists the perplexity per word on the NIST validation set for different language models. Given the same training data, the transformer-XL performs significantly better than the LSTMbased language model, which in turn results in a higher BLEU score from the doc-reranker. By adding more training data, the transformer-XL language model achieves even lower perplexity and that gives a further boost to the performance of the doc-reranker. Notably, when the strong transformer-XL language model is incorporated  into the doc-reranker, the best weight ratio of the channel and language model is 1:1, indicating that the doc-reranker depends heavily on the language model. By contrast, if a weak language model is incorporated, the best ratio is approximately 2 : 1. A further observation is that although a largerscale language model improves the doc-reranker, it does not help the sent-reranker.
We perform an ablation study to explore what each component of the doc-reranker contributes to the overall performance. Table 4 shows BLEU scores on the NIST validation set for the optimal interpolation of various component models. No gains are observed if the language model is combined with the proposal model (a probabilistically unsound combination, although one that often worked in pre-neural approaches to statistical translation). We find that as we increase the weight of the language model, the results become worse. The interpolation of the proposal model and channel model slightly outperforms the proposal model baseline but considerably underperforms the interpolation of the proposal model, channel model, and the language model. This difference indicates the key roles that the language model plays in the doc-reranker. When the channel model is combined with the language model the performance of the doc-reranker is comparable to that with all three components included. We conclude from the ablation study that both the channel and language models are indispensable for the doc-reranker, indicating that Bayes' rule provides reliable estimates of translation probabilities. Table 5 presents the results of our model together with baselines on the WMT19 Chinese-English translation task. We find that the results follow the same pattern as those on NIST: A better language model leads to better translation results and overall the reranker outperforms the transformer-big by approximately 2.5 BLEU.
The two best systems submitted to the WMT19 Chinese-English translation task are Microsoft Research Asia's system (Xia et al., 2019) and Baidu's system (Sun et al., 2019), both of which use multiple techniques to improve upon the transformer big model. Here, we mainly compare our results with those from Xia et al. (2019) because we use the same evaluation metric SacreBLEU (Post, 2018) and the same validation and test sets. Using extra parallel training data and the techniques of masked sequence-to-sequence pretraining (Song et al., 2019), sequence-level knowledge distillation (Kim and Rush, 2016), and backtranslation (Edunov et al., 2018), the best model from Xia et al. (2019) achieves 30.8, 30.9, and 39.3 on newstest2017, newstest2018, and newstest2019, respectively. Although our best results are lower than this, it is notable that our model achieves comparable results to their model, which was trained on 56M sentences of parallel data-over two times more training data than we use. However, our method is orthogonal to these works and can be combined with other techniques to make further improvement.

Analysis
In this section, we present the quantitative and qualitative analysis of our models. The analysis is performed on the experiments of the NIST dataset.

Quantitative Analysis
We do oracle experiments in order to assess our models' ability to select good translation candidates. We create our candidate pool by mixing the proposals generated from the transformer model (Vaswani et al., 2017) and the four references. We subsequently calculate how many cases over the entire validation dataset in which different models (the proposal model, sent-reranker, and doc-reranker) assign the highest model scores to the reference translations. As shown in Figure 4, while the proposal model selects one of the references as the best candidate for 22% of the sentences in the validation dataset, both rerankers double the ratio and the doc-reranker achieves 2% higher accuracy than the sent-reranker. This observation provides further evidence that if we improve the quality of the candidate pool our model will generate better translations.   We also assess the diversity of the candidate pool and investigate the effect of their diversity on our model's performance. Table 6 lists pairwise-BLEU 7 scores (Shen et al., 2019) of different candidate pools (of size 50) and their corres ponding BLEU scores from the doc-reranker. We use the document transformer  trained with additional backtranslated synthetic documents as the proposal models (q ′ in Table 1) in the doc-reranker. Table 6 shows that the candidates generated from our proposal model (by taking 50 best sentences from the beam search) are much less diverse than human translations. We conjecture that the lack of diversity in the candidate pool may harm the performance of our model.
To increase the diversity of candidate translations, we create candidate pools by composing translations generated from different ''experts'', which are simply document transformer models trained from different random initializations. As 7 Pairwise-BLEU (Shen et al., 2019) is a metric of measuring the similarity of candidate translations. The lower the pairwise-BLEU is, the more diverse the candidate translations are. We refer the readers to Shen et al. (2019) for the definition of the metric.    with backtranslation with different random initialization. The size of the candidate pool is 50. The experts for the human proposal baseline are the reference translations.
illustrated in Table 6, we find that a candidate pool from more experts results in more diverse translations (quantified by pairwise BLEU) and better reranking results (quantified by BLEU).

Qualitative Analysis
To investigate how the rerankers improve translation quality, we analyze the output from different models: The document transformer ) (our proposal model), the sentreranker, and the doc-reranker. We observe that in general the doc-reranker improves adequacy of translations and can generate more fluent and natural sentences compared with the document transformer. More importantly, our doc-reranker shows its superiority over the others in terms of exploiting context, improving consistency of tense, number, and lexical choice across entire articles. Tables 7 and 8 in Appendix A present example output from the aforementioned systems.
In Example 1, the pronoun he is omitted in the Chinese sentence. While the document transformer misses this pronoun resulting in a translation of completely different meaning, the doc-reranker is able to recover it. Likewise, in Example 6 them is dropped in the source sentence and this pronoun can only be inferred from previous context. Although both rerankers recover some pronoun, only the doc-reranker gets it right, by relying on cross-sentential context. Example 2 is a good example showing that the doc-reranker is better at generating adequate translations than the proposal model: the document transformer ignores the phrase with these people, but the doc-reranker covers it. Chinese does not mark nouns for number, and it therefore has to be inferred from context to generate accurate English translations. It is not possible for a sentence-level MT system to capture this information if the relevant context is not from the current sentence. In Example 3 and 5 the plural problems and identities can only be inferred from previous sentences (the immediate previous sentence in Example 3 and the sentence 4-5 sentences away from the current one in Example 5). While neither the document transformer nor the sent-reranker makes the right predictions in both examples, the doc-reranker translates correctly, indicating its strength in capturing extra-sentential information. In addition to making inference across sentences, the doc-reranker is also capable of maintaining consistency of tense and lexical choice, as demonstrated in Examples 4, 7, and 9. Furthermore, it improves the consistency of writing style. To illustrate, in Example 8, the context is that of a list of bullet points starting with continue. The doc-reranker follows in this style by starting the translation with the verb continue. However, the sent-reranker starts the sentence with we should continue. Although both translations are reasonable, the former one is more natural within the document since it preserves stylistic consistency.

Related Work
Our work is closely related to three lines of research: context-aware neural machine translation, large-scale language models for language understanding, and semi-supervised machine translation. Recent studies (Tiedemann and Scherrer, 2017;Bawden et al., 2018, inter alia) have shown that exploiting document-level context improves translation performance, and in particular improves lexical consistency and coherence of the translated text. Existing work in the area of context-aware NMT typically adapts the MT system to take additional context as input, either a few previous sentences (Jean et al., 2017;Wang et al., 2017;Tu et al., 2018;Voita et al., 2018;Werlen et al., 2018) or the full document (Haffari and Maruf, 2018;Maruf et al., 2019). These methods vary in the method of encoding the additional context and the way of integrating the context with the existing sequenceto-sequence models. For example, Werlen et al. (2018) encode the context with a separate transformer encoder (Vaswani et al., 2017) and use a hierarchical attention model to integrate the context into the rest of transformer model.  introduce an extra self-attention layer in the encoder to attend over the the context.
Strategies for exploiting monolingual documentlevel data have been explored in two recent studies (Voita et al., 2019a;Junczys-Dowmunt, 2019). Both use backtranslation (Edunov et al., 2018;Sennrich et al., 2016a) to create synthetic parallel documents as additional training data. In contrast, we train a large-scale language model and use it to refine the consistency between sentences under a noisy channel framework. Advantages of our model over back-translation are that 1) the language model is portable across domain and language pairs; 2) our model involves straightforward training procedures. Specifically, for backtranslation to succeed, monolingual data that will be back-translated must be carefully selected; the ratio of backtranslated data and original data must be balanced carefully. While techniques for doing this are fairly well established for single sentence models, no such established techniques exist for documents.
More generally, strategies for using monolingual data in nueral MT systems is an active research area (Gülçehre et al., 2015;Cheng et al., 2016, inter alia). Backtranslation (Edunov et al., 2018;Sennrich et al., 2016a), originally invented for semi-supervised MT, has been used as a standard approach for unsupervised MT (Lample et al., 2018a,b;Artetxe et al., 2019Artetxe et al., , 2018. Noisy channel decompositions have been a standard approach in statistical machine translation (Brown et al., 1993;Koehn et al., 2007) and recently have been applied to neural models (Yu et al., 2017;Ng et al., 2019). Unlike prior work, we adopt noisy channel models for document-level MT. While the model from Yu et al. (2017) could be used on documents by concatenating their sentences to form a single long sequence, this would not let us use the conditional sentence independence assumptions that gives our model the flexibility to use just parallel sentences. Secondarily, their inference algorithm is specialized to their channel model, and it has a quadratic complexity, which would be prohibitive for sequence longer than a single sentence; in practice our inference technique is much faster.
Large-scale pretrained language models have achieved success in improving systems in language understanding, leading to state-of-the-art results on a wide range of tasks (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2018;McCann et al., 2017;Chronopoulou et al., 2019;Lample and Conneau, 2019). Language generation is another area where pretrained language models have been applied, with existing work focusing on fine-tuning for repurposing an unconditional language model (Zhang et al., 2019;Song et al., 2019;Dong et al., 2019;Ziegler et al., 2019;de Oliveira and Rodrigo, 2019). In contrast to our work, which uses probabilities from languagemodels, that work uses model internal representations.

Conclusion
We have presented a noisy channel reranker and empirically validated it on Chinese-English document-level translation tasks. The noisy channel formulation requires only parallel sentences (rather than documents) but we can use abundant monolingual documents to train the language model component. Experiments show that our proposed model considerably improves translation quality-it achieves approximately 2.5 BLEU higher than transformer baselines. Subjective evaluation further confirms that the language model helps enforce consistency of tense, number, and lexical choice across documents.

A.1 Human Evaluation
We selected 50 translation triplets (reference translation, translation from the doc-reranker, translation from the sent-reranker) sampled from the validation and test sets of NIST for evaluation by four native English speakers. The samples are selected by taking the triplets where the output from the sent-reranker and the doc-reranker have a translation edit rate (Snover et al., 2006) above 17.5%.
Each of these documents was presented with a reference translation, and with two translations, labeled A and B, one generated by the doc-reranker and one generated by the sent-reranker. They were tasked with indicating which of these two they found better overall, considering fluency, idiomaticness and correctness (relatively to the reference).
Each of the human evaluators preferred a majority of doc-reranker translations. When aggregated for each document by majority vote, the docreranker translations were considered better in 25 documents, worse for 13, and tied for 12. A statistically significant preference at p < 0.05 according to an exact one-tailed Binomial test (n = 38).

A.2 Comparison of Output from Different Systems
To investigate how the rerankers improve translation quality, we manually inspect the output from three different systems: the document transformer , the sentreranker, and the doc-reranker. Tables 7 and 8 present the comparison between the output from the document transformer  and sent-reranker and between the output from sent-reranker and doc-reranker, respectively. In general, we find that the doc-reranker outperforms other systems in terms of maintaining consistency of tense, number, and lexical choices across documents. For detailed analysis, we refer readers to §5.2.
1 src: 霍夫曼在接受美国哥伦比亚广播公司新闻杂志「六十分钟」访问时轻叹,那段时期为了得到毒品和酒,真 是不择手段。 ref: in an interview on us cbs news magazine 60 minutes, hoffman softly sighed that in such period he would truly do anything to get drugs and alcohol. out1: in an interview with the cbs news magazine ''60 minutes", hoffmann sighed that those days were really unscrupulous in getting drugs and alcohol. out2: in an interview with the cbs news magazine ''60 minutes", hoffmann sighed that at that time in order to obtain drugs and alcohol, he was really unscrupulous.
2 ref: in the meantime, more than 10 chinese personnel working in the same place with these people have been called back to karachi. at present they are emotionally stabilized. out1: at the same time, more than ten chinese personnel working at the same site have also withdrawn to karachi. their sentiments are now stable. out2: at the same time, more than ten chinese personnel working with these people on the same site have also withdrawn to karachi. at present, their sentiments are stable.   6 src: 现在又要平安的送到家里。 cxt: . . . when the plane carrying the three survivors and 11 other personnel arrived in Hefei, people waiting at the airport heaved a long sigh of relief. . . . after the incident occurred, it made proper arrangements for them. ref: now they will also be escorted home safely. out1: now they have to send it home safely. out2: now they want to send them safely to their homes. 7 cxt: . . . a traffic accident occurred at the 58 kilometer point of the beijing-harbin highway, with a spill from an oil tanker leading to the closure of a section of the highway. . . . it was learned that the oil tanker contained waste oil from charcoal production. . . . ref: the section of the highway from harbin to shuangcheng was closed, with many vehicles detoured. out1: part of the roads heading towards shuangcheng in harbin are closed, and many vehicles are bypassing. out2: part of the road from harbin to shuangcheng was closed , and many vehicles were bypassing. 8 cxt: . . . with regard to coalmine safety this year, saws will effectively carry out the following three tasks: -continue to effectively tackle the tough issue of controlling methane. . . . ref: -continue to effectively tackle the tough issue of restructuring and shutting down. out1: -we should continue to make a success of the rectification and closure battle. out2: -continue to fight the battle of rectification and closure. 9 cxt: . . . first, such abuse of ''quota" restricts the thorough implementation of world trade organization's free trade principle. on one hand, u.s. is talking in high-sounding tone about ''free trade". on the other hand, it re-establishes trade barriers and stabs your back at will with ''quotas". does it appear too arbitrary and unfair? ref: second, ''quota" limits the nice growth trend in sino-america trade relation. out1: second, the ''restriction" restricts the good development momentum of sino-us economic and trade relations. out2: second, the ''quota" restricts the good development momentum of sino-us economic and trade relations.