Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.


Introduction
Unsupervised and self-supervised pre-training methods, such as ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), and more recently BERT (Devlin et al., 2019), GPT and GPT-2 (Radford et al., 2018(Radford et al., , 2019, XLNet  and RoBERTa  have established a qualitatively new level of baseline performance for many widely used Natural Language Understanding (NLU) benchmarks including some of the most popular, like GLUE (Williams et al., 2018) and SQuAD (Rajpurkar et al., 2018).
The most appealing part about this massive shift towards using large architectures pre-trained on large collections of texts is that the pre-trained checkpoints along with the inference code are made freely available. This saves hundreds of TPU/GPU hours as warm-starting a model from a pre-trained checkpoint typically requires orders of magnitude fewer fine-tuning steps while delivering significant performance boosts. More importantly, the ability to bootstrap from a state-of-theart performing model such as BERT (Devlin et al., 2019) motivates the community to greatly speed up the progress towards developing better and easily reusable NLU systems.
While we continue to observe an increasing number of papers building on top of BERT and/or GPT models reporting encouraging improvements on Glue, SQuAD, and other similar benchmarks, very little attention has been paid to using these pre-trained models to warm-start sequence-tosequence (seq2seq) models. It has been argued that the pre-training objective used by BERT is not well suited for tasks that require decoding texts, e.g., conditional text generation in machine translation and summarization . Nevertheless, it remains unclear to what extent employing such large models pre-trained on large collections of text can be beneficial to warm-start sequence-to-sequence generation models.
In this paper, we have developed a Transformerbased sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints. We aim to provide an empirical answer to the following research question: what is the best way to leverage publicly available pre-trained checkpoints for warm-starting sequence generation models? For example, one could imagine using BERT checkpoint to initialize the encoder for better input understanding and choosing GPT-2 model as the decoder for better text generation. One of the main contributions of this paper is that we rigorously experiment with a large number of different settings to combine BERT, GPT and RoBERTa pre-trained checkpoints to initialize our Transformer-based model. We report re-sults on three canonical conditional text generation tasks of increasing complexity: sentencelevel fusion (DiscoFuse, Geva et al., 2019) and splitting (WikiSplit, Botha et al., 2018)), WMT14 En↔De machine translation using most common eval sets: newstest2014 and newstest2016, and abstractive summarization using three datasets: Gigaword (Napoles et al., 2012), CNN and Dai-lyMail (Hermann et al., 2015) and BBC extreme (Narayan et al., 2018a).
Our models report significant improvements over randomly initialized models demonstrating the benefit of leveraging unsupervised pre-trained models. More importantly, this simple strategy results in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion. Our results also demonstrate that a pre-trained encoder is an essential component for sequence generation tasks and often these tasks benefit from sharing the weights between the encoder and the decoder. Overall, we have run over 300 experiments spending thousands of TPU v3 hours to better accommodate the language modeling and understanding capabilities of these pre-trained models for text generation. We believe that NLP researchers and practitioners will derive actionable insights from our findings when tackling various seq2seq tasks.
The code to query our models and predictions on various benchmarks will be available at https://github.com/google-research/google-re search/tree/master/bertseq2seq.

Models and Pre-trained Checkpoints
BERT was primarily developed for encoding text representations for NLU tasks (encoder-only architecture), whereas GPT-2 (Radford et al., 2019), as a decoder-only architecture for language modeling. Our model uses a seq2seq architecture with encoder and decoder both composed of Transformer layers (Vaswani et al., 2017). For the encoder, we inherit the BERT Transformer layer implementations (Devlin et al., 2019), which differs slightly from the canonical Transformer layer (Vaswani et al., 2017); BERT uses a GELU activation (Hendrycks and Gimpel, 2016) rather than the standard RELU. If not stated otherwise, the implementation of the decoder layers are also identical to the BERT implementation with two adjustments. First the self-attention mechanism is masked to look only at the left context. Secondly, we add an encoder-decoder attention mechanism. Note, that if the model was randomly initialized, we found no difference between a BERT compatible decoder and a GPT-2 compatible decoder.
Most of the models use the base checkpoint and therefore have 12 layers, a hidden size of 768, filter size of 3072, and 12 attention heads. We chose the best-performing model and also collect numbers using larger pre-trained checkpoints. These models have 24 layers, a hidden size of 1024, filter size of 4096, and 16 attention heads.
All models were fine-tuned on the target task using Adam with a learning rate of 0.05. We used a linear learning rate warmup with 40k steps, normalization by the square root of the hidden size, and a square root decay. We did not perform any tuning of these hyperparameters (except for §5). The batch size and the number of training steps will be reported for each task individually. BERT Checkpoints. We tokenize our text using the WordPiece (Wu et al., 2016) to match the BERT pre-trained vocabulary. Depending on the experiment, we use one of the following publicly available checkpoints: BERT-Base Cased, BERT-Base Uncased, BERT-Base Multilingual Cased (Devlin et al., 2019). 1 The first two checkpoints have a vocabulary size of around ∼30k wordpieces, whereas the multilingual checkpoint has a much larger vocabulary size of ∼110k. BERT also trains positional embeddings for up to 512 positions, which is the maximum input and output length in all experiments. GPT-2 Checkpoints. We tokenize our text using the SentencePieces (Kudo and Richardson, 2018) to match the GPT-2 pre-trained vocabulary. 2 Note that, while the available checkpoint is frequently called 117M, which suggests the same number of parameters, we count 125M parameters in the checkpoint. This is the smallest architecture they trained, and the number of layers, hidden size, and filter size are comparable to BERT-Base. The model was trained mainly on English data but does contain some foreign language. The vocabulary size is ∼50k. While GPT-2 has positional embeddings for up to 1024 position, we only use the first 512 to make the results comparable with BERT. RoBERTa Checkpoints. RoBERTa    the learned parameters are fully compatible with the existing TensorFlow BERT architectures with some minor adjustments. 3 The vocabulary treatment in RoBERTa is compatible with the Senten-cePiece tokenization in  As the conceptual differences between BERT and RoBERTa are minor, we might use BERT as a hypernym to address both pretraining methods in this paper.

Investigated Model Variants
In this section, we describe several combinations of model initialization. The number of total trainable parameters, the number of embedding parameters and the number of parameters initialized from the checkpoint vs. randomly are shown in Table 1. RND2RND A Transformer encoder-decoder architecture with all weights initialized randomly.
BERT2RND A BERT-initialized encoder paired with a randomly initialized decoder. Encoder and decoder share the embedding matrix initialized from a checkpoint. RND2BERT A randomly initialized encoder paired with a BERT-initialized decoder. To perform autoregressive decoding, we mask the bidirectional self-attention mechanism of BERT to look only at the left context.
BERT2BERT A BERT-initialized encoder paired with a BERT-initialized decoder. All 3 More specifically: a) the variable names have to be adjusted; b) the weight and bias variables of the attention mechanism have to be splitted into query, key, and values; c) all variables except the embedding matrices have to be transposed. 4 RoBERTa checkpoints are available at https://github.com/pytorch/fairseq. weights are initialized from a public BERT checkpoint. The only variable that is initialized randomly is the encoder-decoder attention.
BERTSHARE Like BERT2BERT, but the parameters between encoder and decoder are shared. This greatly reduces the memory footprint of the model (136M vs. 221M parameters). Additionally, we experimented with a layer-wise attention mechanism (He et al., 2018), but got nearly identical numbers on most tasks. ROBERTASHARE Same as BERTSHARE, but the shared encoder and decoder are initialized with the public RoBERTa checkpoint.
GPT A decoder-only architecture. We treat the input as a conditioning prefix of a language model. The decoder is warm-started with a public GPT-2 checkpoint. Similarly to BERTSHARE and ROBERTASHARE, the memory footprint of this model is smaller compared to an encoder-decoder setup (125M parameters). RND2GPT A randomly initialized encoder paired with a GPT-2-compatible decoder. We warm-start the decoder and the embedding matrix with a public GPT-2 checkpoint.
BERT2GPT A BERT-compatible encoder paired with a GPT-2-compatible decoder. We warm-start both sides with the two separate, BERT and GPT-2, public checkpoints. We use the BERT vocabulary for the input and the GPT-2 vocabulary for the output.
ROBERTA2GPT Same as BERT2GPT, but we use a public RoBERTa checkpoint to warm-start the encoder. RoBERTa was trained using the GPT-2 vocabulary so we can use it for input and output. Note that while the vocabulary is shared, this model still has two embeddings matrices, one for the input and one for the output.
The pre-training objective in the BERT models learns to predict a masked token using the bidirectional representation of the input text (Devlin et al., 2019;. Our decoder, even when initialized with the BERT or RoBERTa checkpoints, always generates the output text in an autoregressive fashion as in Tranformers (Vaswani et al., 2017) and GPT-2 (Radford et al., 2019).
We performed the bulk of our experiments on the 12-layer checkpoints of BERT, GPT-2, and RoBERTa, assuming that the findings will also hold for the 24-layer checkpoints. We chose BERTSHARE and ROBERTASHARE to also report  numbers using the 24-layer public pre-trained checkpoints. We also experimented with the GPT setup with 24 layers and 345M parameters but as we did not achieve any better results we excluded this from the paper.

Sentence Fusion
Sentence Fusion is the problem of combining multiple sentences into a single coherent sentence. We use the "balanced Wikipedia" portion of the Dis-coFuse dataset (Geva et al., 2019) for our experiments with 4.5M fusion examples in the training set. The evaluation set has 50k example. Due to the size of this evaluation set, even small changes are statistically significant. For this reason, we have solely chosen this dataset for additional experiments described at the end of the paper. Training was done for 300k steps with a global batch size of 256. The input and output are padded to a length of 128, which covers 100% of the training, evaluation and test data. We report SARI (Xu et al., 2016) 5 and the exact match accuracy. The results can be seen in Table 2. Previous 5 SARI is a lexical similarity metric which compares the model's output to multiple references and the input in order to assess the model's ability to add, delete and keep an n-gram. It's implementation is available at: https:// github.com/tensorflow/tensor2tensor/blob/ master/tensor2tensor/utils/sari_hook.py.

WikiSplit
Exact SARI BLEU (Botha et al., 2018)   state-of-the-art results by Geva et al. (2019) used the vanilla transformer model by Vaswani et al. (2017), with only 7 layers. All models with initialized encoders outperform the baseline by a large margin, with a SARI score of 89.3 compared to 86.9 (BERT2RND vs. RND2RND). To measure the effect on smaller training sets, we randomly subsample the training data down to 10% and 1%, i.e. 450k and 45k training examples, respectively. First, we notice, that performance comparable to the baseline is achieved even when training on only 10% of the training data (RND2RND vs. ROBERTASHARE). Secondly, when using only 1% of the training data setups with fewer randomly initialized parameters (BERT2BERT vs. BERT2RND) perform better. The best performing 12 layer setup is ROBERTA2GPT with a SARI score of 89.9 only outperformed by 24 layer setup of ROBERTASHARE with a SARI score of 90.3.

Split and Rephrase
The reverse task of sentence fusion is the splitand-rephrase task, which requires rewriting a long sentence into two or more coherent short sentences (Narayan et al., 2017). We use the Wik-iSplit dataset (Botha et al., 2018), which consists of 1M examples of sentence splits extracted from the Wikipedia edit history, and follow the training/test split suggested by the authors. Training was done for 300k steps with a global batch size of 256. The input and output are padded to a length of 128, which covers 100% of the training, evaluation and test data. As in Botha et al.
(2018), we report corpus-level BLEU 6 , the exact match accuracy, and SARI score. Previous state-of-the-art results by Botha et al. (2018) used a bi-directional LSTM with a copy mechanism (Aharoni and Goldberg, 2018). Analogous to the DiscoFuse task we observe that initializing the encoder improves the model the most (Table 3). The shared encoder-decoder setup of BERTSHARE outperforms all other setups. For the larger models with 24 layers, we observed a small over-fitting after 100k steps (~25 epochs), and therefore stop the training early. BERTSHARE and ROBERTASHARE perform on par and both outperform their 12 layer counterpart.

Machine Translation
We test our setups on the most common benchmark in machine translation -WMT 2014 English ↔ German task -using newstest2014 and newstest2016 eval sets. We use the same hyperparameter settings as in the previous experiments. We limit the input and output lengths to 128 tokens each. We used a global batch size of 256 and train for 30 epochs. Decoding was done with the beam size of 4 and the default value for the sentence length penalty is set to α = 0.6. We report uncased BLEU-4 scores. 7 In Table 4, we first report the baseline scores for the original Transformer model Vaswani et al. (2017) and our Transformer implementation 8 with the same hyper-parameters. In both cases, we use the encoder and decoder with 6 layers and the 32k wordpiece vocabulary extracted from the WMT14 training set. Our implementation obtains slightly higher scores than the original implementation.
The middle section of Table 4 reports the results for various initialization schema using BERT and GPT-2 pre-trained checkpoints. Note that here all models have encoders and decoders with 12 layers. For BERT models, we use the BERT-Base 6 We use NLTK v3.2.2 with case sensitive scoring to estimate BLEU scores. 7 We use a script from the Tensorflow Official Transformer implementation https://github.com/tensorflow/models/tree master/official/nlp/transformer. Note that, differently from the tensor2tensor/utils/ get_ende_bleu.sh used by Vaswani et al. (2017), this script does not split noun compounds, but we normalize utf-8 quotes to ascii quotes as we noted that our pre-processed training set contains only ascii quotes. 8 We use Transformer layers from the official BERT implementation which have small differences from (Vaswani et al., 2017).
Multilingual Cased checkpoint to initialize the encoder or the decoder or both, as the task involves one non-English language. This checkpoint has been pre-trained on 108 languages using a multilingual Wikipedia dump with a vocabulary of 110k wordpieces. First, we observe that initializing the model with the BERT checkpoint is most beneficial on the encoder side; our observation is in line with Yang et al. (2019). Furthermore, models initialized with the BERT checkpoint receive a significant boost: BERT2RND compared to the no-initialization RND2RND setup scores higher by +4 points on En→De and +3.6 points on De→En on newstest2014. Contrary to the WikiSplit and DiscoFuse task, sharing the encoder and decoder variables did not give an additional boost. This is most likely because a) model capacity is an important factor in MT and b) encoder and decoder have to deal with different grammar and vocabulary.
GPT-based models (RND2GPT, GPT, and BERT2GPT) do not perform nearly as well, especially when GPT is used as the decoder and the target language is German. This is because the GPT model comes with an English vocabulary and has been pre-trained mainly on English text. Hence, we report the scores for GPT in the En→De setting in gray.
Customized BERT checkpoint. For this experiment we did not include RoBERTa, as the public checkpoint is available for English only. Instead, we train our own checkpoint. We also observe that our implementation of the baseline Transformer, as well as RND2RND setup which uses no initialization, perform weaker on newstest2014 compared to the Transformer baselines (with 6 layers and the 32k wordpiece vocabulary) we report in the top section of Table 4. We conjecture that the differences might be due to the larger 110k wordpiece vocabulary trained to handle 104 languages from Wikipedia dump which is suboptimal for WMT14 data and leads to inferior results. To verify this conjecture, we perform the following experiment: we use the 32k wordpiece vocabulary extracted from the WMT14 En ↔ De training set (same as used in the top section of Table 4) and pre-train a BERT model on the English and German subset of the Wikipedia dump in the same way as the multilingual BERT checkpoint was obtained. We initialize our best-performing setups, BERT2RND and BERTSHARE, with this checkpoint (the third block of Table 4). This pro-   Edunov et al. (2018) report better results when they augment the training set with a massive amount of back-translated sentence pairs. To the best of our knowledge, among the approaches that only leverage parallel data from WMT14, our results are state-of-the-art on both newstest2014 and newstest2016.

Abstractive Summarization
Document summarization is the task of producing a short version of a document while preserving its salient information content. We evaluate our setups on three different summarization datasets of varying characteristics: Gigaword (Napoles et al., 2012), CNN and Daily-Mail (Hermann et al., 2015), and BBC extreme (Narayan et al., 2018a). The Gigaword dataset focuses on abstractive sentence summarization with a total of 3.8M sentence-summary training pairs. The other two datasets focus on single-document summarization: the CNN/DailyMail dataset consists of 287k document-summary pairs, whereas the BBC dataset consists of 204k documentsummary pairs. The CNN/DailyMail summaries are in the form of bullet-point story highlights and exhibit a high degree of extraction, requiring the models to learn to copy from the source documents. The BBC summaries, on the other hand, are extreme, in that the documents are summarized into single-sentence summaries. These summaries demonstrate a high level of abstractiveness, and generating them automatically requires documentlevel inference, abstraction, and paraphrasing.
In all three cases, we did not anonymize entities. We worked on the original cased versions of CNN/DailyMail and BBC datasets. For Gigaword we used the lowercased version to match the requirements of the publicly available lowercased   , 2003); in particular, we report on ROUGE-1 and ROUGE-2 for informativeness and ROUGE-L for fluency in Table 5. Document understanding. All BERT encoder based setups (i.e., BERT2RND, BERTSHARE, ROBERTASHARE, and BERT2BERT) outperform the baseline RND2RND by a large margin. The improvements of the RND2BERT setup, where only the decoder is initialized, are narrow. These results overall validate the significance of document representation in the encoder-decoder framework for summarization. On the BBC extreme summarization in particular, these four models achieve on average +6.85 point improvement in ROUGE-L compared to the RND2RND setup. Our results demonstrate that the models with better document representations are better in generating extreme summaries that require document-level inference and abstraction. For the extractive highlights in the CNN/DailyMail dataset, these models show an improvement of +3.53 ROUGE-L points over the RND2RND baseline. For Gigaword, where the input is a single sentence, the improvements are minimal (average of +1.02 ROUGE-L points). The BERTSHARE setup with shared encoder and decoder parameters achieves better performance than BERT2BERT on all three datasets. The gains are larger on the BBC dataset than on the Gigaword and CNN/DailyMail datasets. This is probably because the BBC summary sentences follow a distribution that is similar to that of the sentences in the document, whereas this is not necessarily the case for the Gigaword headlines and the CNN/DailyMail bullet-point highlights. ROBER-TASHARE performs superior to BERTSHARE on the CNN/DailyMail and BBC datasets. ROBER-TASHARE performs competitively to BERTSHARE on the Gigaword dataset where the task is to summarize sentences.
Summarization with GPT checkpoints. GPT (decoder-only) performs better than RND2GPT, BERT2GPT or ROBERTA2GPT (encoder-decoder models) by a large margin for generating CNN/DailyMail extracts, but poorer for generating BBC abstracts. The encoder-decoder architecture where the input document is modeled separately is better equipped for document-level abstraction than the decoder-only architectures where the input document is a conditioning prefix of a language model. Initialization with different checkpoints, e.g., encoder with BERT and decoder with GPT in BERT2GPT, is not effective for document summarization; BERT2GPT and ROBERTA2GPT are inferior to RND2GPT on the BBC dataset and BERT2GPT, to RND2GPT on the CNN/DailyMail dataset. However, this is not the case with the Gigaword dataset, which has 3.8M training instances; BERT2GPT and ROBERTA2GPT perform better than RND2GPT.
ROBERTASHARE performs the best and is on par with the current state-of-the-art MASS model (Song et al., 2019) on the Gigaword dataset. The MASS model has an advantage of pretraining encoder-decoder attention from scratch, our proposed models use the publicly available pre-trained checkpoints and only fine-tune on the target task.
It is not obvious how the masked seq2seq pre-training objective for sentence generation in the MASS model will be beneficial for tasks like document summarization. Our proposed models provide a generic alternative and can be easily adapted to various text generation tasks. The ROBERTASHARE setup sets a new state-of-the-art, outperforming all existing baselines by a large margin on the BBC extreme summarization task. The best model on the CNN/DailyMail dataset outperforms the Pointer Generator network (See et al., 2017) and the pre-trained single-decoder model with TransformerLM (Khandelwal et al., 2019). Our model, however, lags behind the Bottom-Up system (Gehrmann et al., 2018) with a taskspecific module for content selection along with the copy mechanism (Gu et al., 2016) and the UniLM model (Dong et al., 2019) with BERT-Large pre-trained for Bidirectional, unidirectional and seq2seq language modeling objectives. The UniLM model is also fine-tuned with an additional extractive summarization objective to predict relevant sentences in the document; this objective could be beneficial to generate the CNN/DailyMail extracts.

Discussion on Ablation Studies
Combining Different Checkpoints. Combining BERT and GPT-2 into a single model (BERT2GPT) did not work and often underperformed than a randomly initialized baseline. This is presumable because the model has to learn two different vocabularies. This argument is backed by the fact that for MT de→en the BERT2GPT setup performed well. For this task the vocabulary setting is in favor of this particular task, meaning two vocabularies have to be learned anyways and the output is English, where GPT-2 was trained on. Since RoBERTa and GPT-2 share the same vocabulary, combining both into a single model (ROBERTA2GPT) showed strong results on several tasks but did not outperform a setup where RoBERTa is used in the encoder and decoder.
Tuning GPT-2 Based Models. We were surprised that setups using the GPT-2 checkpoint performed relatively poorly given that it is trained as a language model on a large corpus; our intuition was that GPT-2 initialized decoders will be strong natural language generators. To ensure that this was not due to an unfortunate choice of hyperparameters, we tuned the learning rate, the warmup steps, and the optimizer ∈ {Adam, Adafactor} for the GPT-2 based setups (RND2GPT, GPT, BERT2GPT) on the DiscoFuse dataset. Naturally, this gave us slightly higher numbers but not at a magnitude that would suggest a previously suboptimal setting. Specifically, we got a SARI score of 88.8 compared to 88.4 for BERT2GPT, 88.1 compared to 88.0 for GPT and 87.7 compared to 86.5 for RND2GPT. Initializing only Embeddings. We want to investigate the impact of the non-contextualized BERT and GPT-2 embeddings. This means we are initializing the transformer model with only the embedding matrices. The advantage of this setup would be that we could freely choose the model architecture and size and adapt it to a specific task. We found almost no improvement over the fully randomly initialized model RND2RND. Concretely, we compute a SARI score of 87.1 using the BERT embeddings and 87.0 using the GPT-2 embeddings, compared to 86.9 of the RND2RND baseline. We observe slightly higher improvements of up to 2 percentage points when training on only 10% of the training data.
Initializing only Layers. Contrary to the previous paragraph, we want to investigate the effect of initializing everything but the word embedding matrix. The embedding matrix accounts for only 10-31% of all learnable parameters and sometimes the vocabulary given from a public checkpoint might not be optimal for a certain task. In these cases, it would be nice to redefine the vocabulary while still leveraging the checkpoint. First, we remove the embeddings matrices from the warmstarted variables and observe a drop of 1.7 points using the BERTSHARE setup and 11 points using the GPT setup ( Table 6). The latter is probably due to the large vocab of the GPT-2 model which now remains random initialized. We then train a new BPE model with 16k tokens using the Dis-coFuse training data (Kudo and Richardson, 2018;Sennrich et al., 2016). We observe almost no change on BERTSHARE, suggesting that the BERT vocabulary was already optimal for DiscoFuse. GPT however, showed a significant improvement using this much smaller vocabulary but is still behind the fully initialized setup. Finally, we experimented with a more sensitive way of training the model, meaning that we fix all warmstarted variables for 100k steps. During this pretraining phase, we only train the new word embeddings. After the pre-training, we fine-tune the entire model for another 300k steps. This training scheme resulted in an improvement of 0.5 for the BERTSHARE setup, but overall the number is still way behind the fully initialized setup. For GPT, this training scheme did not result in a satisfying training curve.  Initializing a Subset of Layers. Motivated by the results of using 24 layers, we want to investigate if only a subset of these 24 layers can be used. To account for the larger hidden layer size (1024 vs. 768) and filter size (4096 vs. 3072) we limit ourselves to using only 10 layers and the embedding matrix of this model. This model still has more parameters then the base model (324M vs. 221M for BERT2BERT, 198M vs. 136M for BERTSHARE) but can be trained with the same batch size, in a comparable amount of time (3 min/1000 iterations). As an initial experiment, we used the first 10 layers out of the large BERT checkpoint to initialize the BERTSHARE setup. This gave us a SARI score of 88.2 on DiscoFuse, compared to 89.3 of using the base checkpoint and compared to 87.0 of using the embeddings only (see "Initializing only Embeddings"). We then performed a hyperparameter search on the evaluation set using CMA-ES (Hansen, 2016) to find an optimal subset of layers to use. The best setup used the following layers: 9, 10, 13-18, 23, 24; and achieved a SARI score of 89.1. While this is a remarkable improvement over using the first 10 layers, this setup is still outperformed by the base BERT model.

Analysis of Abstractive Summaries
Finally we present a qualitative analysis of these models for text generation. In particular, we focused on extreme summarization which assesses models ability to do document-level inference and abstraction. We evaluated summaries from randomly initialized model (RND2RND) and from best performing models initialized with GPT checkpoints (RND2GPT), BERT checkpoints (BERTSHARE) and RoBERTa checkpoints (ROBERTASHARE). We also included GOLD summaries in our evaluation. Results are presented in Table 7.

RND2RND
The Queen has celebrated her 90th birthday with a message on social media about her 90th birthday.

RND2GPT
The Queen has celebrated her 90th birthday with a birthday celebration in Buckingham Palace.

BERTSHARE
The Queen has paid tribute to the Queen by sending a tweet saying she was "unwittingly unwittingly unwittingly.

ROBERTASHARE
The Queen has sent a twitter message for her 90th birthday on twitter. GOLD The Queen has tweeted her thanks to people who sent her 90th birthday messages on social media.

RND2RND
Sir Bradley Wiggins says he is "proud" of being involved in the use of a banned steroid against Sir Bradley Wiggins.

RND2GPT
Team Sky boss Sir Dave Brailsford says he is "disappointed" after team Sky agreed to change their contracts with team Sky.

BERTSHARE
Team Sky boss Sir Dave Brailsford says he is "proud" of his team's handling of doping in cycling.

ROBERTASHARE
Team Sky boss Dave Brailsford says he is "not proud" of his team's handling of allegations of wrongdoing in the sport. GOLD Team Sky boss Sir Dave Brailsford has said that his handling of the media following allegations against his team has made things a "damn sight worse".

RND2RND
A 19-year-old American singer has been shot dead by police in San Francisco.

RND2GPT
Police are investigating a shooting in the grounds of a music venue in Los Angeles. BERTSHARE US singer Chris Brown has been shot and wounded at a gig in the US state of California. ROBERTASHARE Five people have been shot dead in a shooting at a concert in California. GOLD Five people have been shot at a California nightclub while Chris Brown was performing.

RND2RND
A council has asked people not to keep their toilets in a bid to save money.

RND2GPT
People are being urged to use a "ladies' toilet" in Skye in Skye in Skye by their own councillor.

BERTSHARE
Complaints about the availability of public toilets on Skye and the isle of Skye is being investigated by highland council.

ROBERTASHARE
Highland council has commissioned a review of public toilets and public toilets on Skye. GOLD Islanders on Skye have demanded greater availability of public toilets after complaints some visitors to the Isle are relieving themselves outside. RND2RND A man has been jailed for six years for posting offensive comments on Facebook about an Aberdeen teenager who was later found dead. RND2GPT A man who admitted killing his six-year-old friend in a disturbance in Aberdeen has been jailed. BERTSHARE A man who admitted murdering a toddler after posting offensive comments about him on Facebook has been jailed for three years. ROBERTASHARE A man has been jailed for three months for posting "vile" abuse on Facebook about a missing toddler found dead in his Aberdeenshire home. GOLD A man who admitted posting offensive comments on Facebook about an Edinburgh boy beaten to death by his mother has been jailed for 12 months. Human Assessment of Summary Quality. The study was conducted on the Amazon Mechanical Turk platform using Best-Worst Scaling, a less labor-intensive alternative to paired comparisons (Louviere and Woodworth, 1991;Louviere et al., 2015). Our participants were presented with a document and summaries generated from two out of five systems (four models and gold summaries) and were asked to decide which summary was better than the other in order of informativenessdoes the summary capture important information in the document correctly and concisely? -and fluencyis the summary written in well-formed English? We randomly selected 40 documents from the XSum test set. We collected judgments from three different participants for each compari-son. The order of summaries were randomized per document and the order of documents per participant. The score of a system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst. The scores range from -1 (worst) to 1 (best). See Figure 1 for few sample predictions that were used in our human evaluation. Our participants found the ROBERTASHARE summaries to be the best in terms of their overall quality; the BERTSHARE summaries ranked second after ROBERTASHARE. We further carried out pairwise comparisons between all models to assess whether system differences are statistically significant. 9 We did not observe  Finally, we estimated the percentage of summaries with at least one repetition of rare or content words. We discarded the 500 most common words from the model generated and reference summaries, the rests were considered as rare or content words. BERTSHARE and ROBER-TASHARE summaries improve over the RND2RND summaries, but have more repetitions than the RND2GPT summaries. See examples in Figure 1 for redundant repeated spans marked in orange.
Overall, BERTSHARE and ROBERTASHARE summaries are unequivocally better than RND2GPT summaries in terms of both automatic evaluations (assessing ROUGE) and human evaluations (assessing summary quality); there are still room for improvements in these models (Dong et al., 2019;Song et al., 2019;.

Related Work
Representation learning. Starting around 2013, word embeddings like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) became popular as they were easy to train in an unsupervised fashion on raw text and they improved several downstream tasks when used as features.

0.01.
These word embeddings are invariant to the context the word is in. There has been work to contextualize these embeddings, mainly to account for synonyms (e.g. (Huang et al., 2012;Rothe and Schütze, 2015)) before, but only in 2018 did training of the contextualized embeddings using large deep neural networks and an unsupervised training scheme become popular.
While ELMo (Peters et al., 2018) and ULMFiT (Howard and Ruder, 2018) are based on LSTMs (Hochreiter and Schmidhuber, 1997), BERT and GPT are based on the transformer architecture (Vaswani et al., 2017). This architecture outperforms LSTMs on several NLP tasks and we therefore concentrated on these two pre-trained models. The contextualized embedding for each input token is given by the corresponding output of the last encoder layer.
Pre-training models. One can also see these models as pre-trained models (Dai and Le, 2015), which are then fine-tuned for a downstream task. This is the conceptual view we adopted for this paper. Why unsupervised pre-training helps deep learning was investigated by Erhan et al. (2010). While the unsupervised pre-training strategies are different from those used in our paper, we expect the findings to still hold.
They show that unsupervised pre-training is not simply a way of getting a good initial marginal distribution, that classical regularization techniques cannot achieve the same performance as unsupervised pre-training, and that the effect of unsupervised pre-training does not go away with more training data. An extensive study of pre-training was done by Wang et al. (2019a). This study compares single sentence classification, sentence pair classification, sequence to sequence and language modeling tasks for pre-training and measures the effect on GLUE. The primary results support the use of language modeling. Peters et al. (2019) explore whether it is preferable to fine-tune the entire model on a specific task or to use the learned representations as features, i.e. freezing the pre-trained model. Their results suggest that the relative performance of fine-tuning vs. feature extraction depends on the similarity between the pre-training and the target tasks. Wang et al. (2019b) propose a combination of both, where first the model is trained with the BERT parameters being frozen and then the entire model is fine-tuned. This is the training scheme we used in "Initializing only Layers" study.
Pre-training for sequence generation. Pretraining for seq2seq learning was first done by Ramachandran et al. (2017). They used a language model to pre-train the encoder and decoder of an RNN seq2seq model. Their method improved BLEU scores on newstest2014 by 3 points and ROUGE-L on CNN/Dailymail also by 3 points. However their BLEU score of 24.7 on newstest2014 En→De, compared to 30.6 in this work, and 29.4 ROUGE-L on CNN/Dailymail, compared to 36.33 also show the superiority of the transformer model as well as the masked language model objective of BERT. MASS (Song et al., 2019) is a BERT-inspired method of pre-training sequence to sequence models. One advantage of this method is that, in contrast to our setups (except for GPT), the encoder-decoder attention mechanism is also pre-trained.
The downside of this approach is that the pre-trained model is task-specific and not as general as BERT or GPT-2. UniLM (Dong et al., 2019) also unifies bidirectional, unidirectional, and sequence to sequence language modeling. At the time of writing, no public checkpoint was available to us. We compare our work with their results in Table 5. To overcome the issue that the encoder-decoder attention is not pretrained, Khandelwal et al. (2019) pre-trained a single transformer language model that encodes the source and generates the target. This setup matches our GPT setup. Conneau and Lample (2019) pre-train their model using casual language modeling (like GPT), masked language modeling (like BERT) and a third new objective called translation language modeling to improve cross-lingual pre-training.

Leveraging public checkpoints.
BERT has been used for various NLP tasks, such as question answering on the SQuAD dataset (Rajpurkar et al., 2018). It also achieved new state-of-the-art results on the GLUE benchmark (Williams et al., 2018) and grounded commonsense inference (SWAG, Zellers et al., 2018). All of these tasks are a form of classification or regression. Liu (2019) fine-tuned BERT for Extractive Summarization.
An analysis of different layers of the BERT model was performed by (Tenney et al., 2019). They found that the classical NLP pipeline appears in the expected sequence. In the context of our experiments in "Initializing a Subset of Layers", this would mean that the DiscoFuse task profits the most from pre-trained information about POS, constituents, dependencies and semantic roles. A similar study by Jawahar et al. (2019) found that BERT captures phrase-level information in the lower layers and linguistic information in intermediate layers, with surface features at the bottom, syntactic features in the middle and semantic features at the top.
GPT was also evaluated on natural language inference tasks. In the extended version of GPT-2, the model was evaluated on more general natural language processing tasks, like machine translation, reading comprehension, summarization, and language modeling. GPT-2 achieved new stateof-the-art results on several language modeling datasets. On the other tasks, GPT-2 outperformed some unsupervised baselines but is still far behind supervised or task-specific approaches.
After we performed the majority of our experiments, XLNet , an autoregressive pre-training method based on Transformer XL (Dai et al., 2019) was released. XLNet achieved new state-of-the-art results on several NLP task. We leave the experiments with their public checkpoint for future work.

Conclusion
We performed an extensive study on leveraging pre-trained checkpoints for sequence generation. Our findings show, that a pre-trained encoder is an essential part. Most tasks also profit from sharing the weights between the encoder and the decoder, which additionally decreases the memory footprint. While combing BERT and GPT-2 into a single model often underperformed a randomly initialized baseline, combining RoBERTa and GPT-2 achieves strong results and shows the importance of sharing the vocabulary. Training a language specific BERT model also improves performance over using the multilingual version.