Skip to Main Content
Table 2: 
Results on language modeling on Wikitext-103 data-set. Local Transformer refers to Transformer (Vaswani et al., 2017) with relative position encoding (Shaw et al., 2018) together with local attention. Perplexity is reported on the test set.
ModelLayersHeadsPerplexity
LSTMs (Grave et al., 2017) − − 40.8 
QRNNs (Merity et al., 2018) − − 33.0 
Adaptive Transformer (Sukhbaatar et al., 2019) 36 20.6 
Local Transformer 16 16 19.8 
Adaptive Input (Baevski and Auli, 2019) 16 16 18.7 
TransformerXL (Dai et al., 2019) 18 16 18.3 
 
Routing Transformer 10 16 15.8 
ModelLayersHeadsPerplexity
LSTMs (Grave et al., 2017) − − 40.8 
QRNNs (Merity et al., 2018) − − 33.0 
Adaptive Transformer (Sukhbaatar et al., 2019) 36 20.6 
Local Transformer 16 16 19.8 
Adaptive Input (Baevski and Auli, 2019) 16 16 18.7 
TransformerXL (Dai et al., 2019) 18 16 18.3 
 
Routing Transformer 10 16 15.8 
Close Modal

or Create an Account

Close Modal
Close Modal