Skip to Main Content
Table 1

Comparison according to several criteria: = (Absolute, Relative, or Both); = Injection method (ape or mam); = Are the position representations learned during training?; = Is position information recurring at each layer vs. only before first layer?; = Can the position model generalize to longer inputs than a fixed value?;# Param = Number of parameters introduced by the position model (notation follows the text paper, d hidden dimension, h # attention heads, n vocabulary size, tmax longest sequence length, l # layers). The - symbol means that an entry does not fit into our categories. Note that a model as a whole can combine different position models while this comparison focuses on the respective novel part(s).

ModelRef. PointInject. Met.LearnableRecurringUnbound#Param
Data Structure Sequence Transformer w/ emb. (Vaswani et al. 2017) ape ✔ ✖ ✖ tmaxd 
BERT (Devlin et al. 2019) 
Reformer (Kitaev, Kaiser, and Levskaya 2020) (dd1)tmaxt1+d1t1 
FLOATER (Liu et al. 2020) ape ✔ ✔ ✔ 0 or more 
Shortformer (Press, Smith, and Lewis 2021) ape ✖ ✔ ✔ 
Wang et al. (2020) – ✔ ✖ ✔ 2nd 
Shaw, Uszkoreit, and Vaswani (2018) (abs) mam ✔ ✔ ✖ 2tmax2dl 
Shaw, Uszkoreit, and Vaswani (2018) (rel) mam ✔ ✔ ✖ 2(2tmax − 1)dl 
T5 (Raffel et al. 2020) (2tmax − 1)h 
Huang et al. (2020) dlh(2tmax − 1) 
DeBERTa (He et al. 2021) Both ✔ ✔ ✖ 3tmaxd 
Transformer XL (Dai et al. 2019) mam ✔ ✔ ✔ 2d + d2lh 
TENER (Yan et al. 2019) 2dlh 
DA-Transformer (Wu, Wu, and Huang 2021) 2h 
TUPE (Ke, He, and Liu 2021) mam ✔ ✖ ✖ 2d2 + tmax(d + 2) 
RNN-Transf. (Neishi and Yoshinaga 2019) – ✔ ✖ ✔ 6d2 + 3d 
SPE (Liutkus et al. 2021) mam ✔ ✔ ✖ 3Kdh + ld 
Transformer w/ sin. (Vaswani et al. 2017) ape ✖ ✖ ✔ 
Li et al. (2019) 
Takase and Okazaki (2019) 
Oka et al. (2020) 
Universal Transf. (Dehghani et al. 2019) ape ✖ ✔ ✔ 
DiSAN (Shen et al. 2018) mam ✖ ✔ ✔ 
Rotary (Su et al. 2021) 
Tree SPR-abs (Wang et al. 2019) ape ✖ ✖ ✔ 
SPR-rel (Wang et al. 2019) mam ✔ ✖ ✖ 2(2tmax + 1)d 
TPE (Shiv and Quirk 2019) ape ✔ ✖ ✖ dDmax 
Graph Struct. Transformer (Zhu et al. 2019) mam ✔ ✔ ✔ 5d2 + (d + 1)dr 
Graph Transformer (Cai and Lam 2020) 7d2 + 3d 
Graformer (Schmitt et al. 2021) mam ✔ ✔ ✖ 2(Dmax + 1)h 
 Graph Transformer (Dwivedi and Bresson 2020) ape ✖ ✖ ✔ 
graph-bert (Zhang et al. 2020) 
ModelRef. PointInject. Met.LearnableRecurringUnbound#Param
Data Structure Sequence Transformer w/ emb. (Vaswani et al. 2017) ape ✔ ✖ ✖ tmaxd 
BERT (Devlin et al. 2019) 
Reformer (Kitaev, Kaiser, and Levskaya 2020) (dd1)tmaxt1+d1t1 
FLOATER (Liu et al. 2020) ape ✔ ✔ ✔ 0 or more 
Shortformer (Press, Smith, and Lewis 2021) ape ✖ ✔ ✔ 
Wang et al. (2020) – ✔ ✖ ✔ 2nd 
Shaw, Uszkoreit, and Vaswani (2018) (abs) mam ✔ ✔ ✖ 2tmax2dl 
Shaw, Uszkoreit, and Vaswani (2018) (rel) mam ✔ ✔ ✖ 2(2tmax − 1)dl 
T5 (Raffel et al. 2020) (2tmax − 1)h 
Huang et al. (2020) dlh(2tmax − 1) 
DeBERTa (He et al. 2021) Both ✔ ✔ ✖ 3tmaxd 
Transformer XL (Dai et al. 2019) mam ✔ ✔ ✔ 2d + d2lh 
TENER (Yan et al. 2019) 2dlh 
DA-Transformer (Wu, Wu, and Huang 2021) 2h 
TUPE (Ke, He, and Liu 2021) mam ✔ ✖ ✖ 2d2 + tmax(d + 2) 
RNN-Transf. (Neishi and Yoshinaga 2019) – ✔ ✖ ✔ 6d2 + 3d 
SPE (Liutkus et al. 2021) mam ✔ ✔ ✖ 3Kdh + ld 
Transformer w/ sin. (Vaswani et al. 2017) ape ✖ ✖ ✔ 
Li et al. (2019) 
Takase and Okazaki (2019) 
Oka et al. (2020) 
Universal Transf. (Dehghani et al. 2019) ape ✖ ✔ ✔ 
DiSAN (Shen et al. 2018) mam ✖ ✔ ✔ 
Rotary (Su et al. 2021) 
Tree SPR-abs (Wang et al. 2019) ape ✖ ✖ ✔ 
SPR-rel (Wang et al. 2019) mam ✔ ✖ ✖ 2(2tmax + 1)d 
TPE (Shiv and Quirk 2019) ape ✔ ✖ ✖ dDmax 
Graph Struct. Transformer (Zhu et al. 2019) mam ✔ ✔ ✔ 5d2 + (d + 1)dr 
Graph Transformer (Cai and Lam 2020) 7d2 + 3d 
Graformer (Schmitt et al. 2021) mam ✔ ✔ ✖ 2(Dmax + 1)h 
 Graph Transformer (Dwivedi and Bresson 2020) ape ✖ ✖ ✔ 
graph-bert (Zhang et al. 2020) 
Close Modal

or Create an Account

Close Modal
Close Modal