Table 5:
Results on language modeling on PG-19 data-set. Local Transformer refers to Transformer (Vaswani et al., 2017) with relative position encoding (Shaw et al., 2018) together with local attention. Perplexity is normalized by the number of tokens reported in Rae et al. (2020) and is reported on the test set.
Local Transformer 24 39.3
TransformerXL (Dai et al., 2019) 36 − 36.3
Compressive Transformer (Rae et al., 2020) 36 − 33.6

Routing Transformer 22 33.2
Local Transformer 24 39.3
TransformerXL (Dai et al., 2019) 36 − 36.3
Compressive Transformer (Rae et al., 2020) 36 − 33.6

Routing Transformer 22 33.2
Close Modal