Skip to Main Content
Table 5: 
Results on language modeling on PG-19 data-set. Local Transformer refers to Transformer (Vaswani et al., 2017) with relative position encoding (Shaw et al., 2018) together with local attention. Perplexity is normalized by the number of tokens reported in Rae et al. (2020) and is reported on the test set.
ModelLayersHeadsPerplexity
Local Transformer 24 39.3 
TransformerXL (Dai et al., 2019) 36 − 36.3 
Compressive Transformer (Rae et al., 2020) 36 − 33.6 
 
Routing Transformer 22 33.2 
ModelLayersHeadsPerplexity
Local Transformer 24 39.3 
TransformerXL (Dai et al., 2019) 36 − 36.3 
Compressive Transformer (Rae et al., 2020) 36 − 33.6 
 
Routing Transformer 22 33.2 
Close Modal

or Create an Account

Close Modal
Close Modal