Skip to Main Content
Table 1: 
The number of total trainable parameters, embedding parameters, and parameters initialized from the checkpoint vs. randomly. The BERT/GPT-2 embeddings have 23M/39M parameters. The encoder-decoder attention accounts for 26M parameters.
totalembed.init.random
rnd2rnd 221M 23M 221M 
bert2rnd 221M 23M 109M 112M 
rnd2bert 221M 23M 109M 26M 
bert2bert 221M 23M 195M 26M 
bertShare 136M 23M 109M 26M 
robertaShare 152M 39M 125M 26M 
gpt 125M 39M 125M 
rnd2gpt 238M 39M 125M 114M 
bert2gpt 260M 62M 234M 26M 
roberta2gpt 276M 78M 250M 26M 
totalembed.init.random
rnd2rnd 221M 23M 221M 
bert2rnd 221M 23M 109M 112M 
rnd2bert 221M 23M 109M 26M 
bert2bert 221M 23M 195M 26M 
bertShare 136M 23M 109M 26M 
robertaShare 152M 39M 125M 26M 
gpt 125M 39M 125M 
rnd2gpt 238M 39M 125M 114M 
bert2gpt 260M 62M 234M 26M 
roberta2gpt 276M 78M 250M 26M 
Close Modal

or Create an Account

Close Modal
Close Modal