Comparison according to several criteria: = (Absolute, Relative, or Both);
= Injection method (ape or mam);
= Are the position representations learned during training?;
= Is position information recurring at each layer vs. only before first layer?;
= Can the position model generalize to longer inputs than a fixed value?;# Param = Number of parameters introduced by the position model (notation follows the text paper, d hidden dimension, h # attention heads, n vocabulary size, tmax longest sequence length, l # layers). The - symbol means that an entry does not fit into our categories. Note that a model as a whole can combine different position models while this comparison focuses on the respective novel part(s).
. | Model . | Ref. Point . | Inject. Met. . | Learnable . | Recurring . | Unbound . | #Param . | |
---|---|---|---|---|---|---|---|---|
![]() | ![]() | ![]() | ![]() | ![]() | ||||
Data Structure | Sequence | Transformer w/ emb. (Vaswani et al. 2017) | A | ape | ✔ | ✖ | ✖ | |
BERT (Devlin et al. 2019) | ||||||||
Reformer (Kitaev, Kaiser, and Levskaya 2020) | ||||||||
FLOATER (Liu et al. 2020) | A | ape | ✔ | ✔ | ✔ | 0 or more | ||
Shortformer (Press, Smith, and Lewis 2021) | A | ape | ✖ | ✔ | ✔ | 0 | ||
Wang et al. (2020) | A | – | ✔ | ✖ | ✔ | 2nd | ||
Shaw, Uszkoreit, and Vaswani (2018) (abs) | A | mam | ✔ | ✔ | ✖ | 2tmax2dl | ||
Shaw, Uszkoreit, and Vaswani (2018) (rel) | R | mam | ✔ | ✔ | ✖ | 2(2tmax − 1)dl | ||
T5 (Raffel et al. 2020) | (2tmax − 1)h | |||||||
Huang et al. (2020) | dlh(2tmax − 1) | |||||||
DeBERTa (He et al. 2021) | B | Both | ✔ | ✔ | ✖ | 3tmaxd | ||
Transformer XL (Dai et al. 2019) | R | mam | ✔ | ✔ | ✔ | 2d + d2lh | ||
TENER (Yan et al. 2019) | 2dlh | |||||||
DA-Transformer (Wu, Wu, and Huang 2021) | 2h | |||||||
TUPE (Ke, He, and Liu 2021) | B | mam | ✔ | ✖ | ✖ | 2d2 + tmax(d + 2) | ||
RNN-Transf. (Neishi and Yoshinaga 2019) | R | – | ✔ | ✖ | ✔ | 6d2 + 3d | ||
SPE (Liutkus et al. 2021) | R | mam | ✔ | ✔ | ✖ | 3Kdh + ld | ||
Transformer w/ sin. (Vaswani et al. 2017) | A | ape | ✖ | ✖ | ✔ | 0 | ||
Li et al. (2019) | ||||||||
Takase and Okazaki (2019) | ||||||||
Oka et al. (2020) | ||||||||
Universal Transf. (Dehghani et al. 2019) | A | ape | ✖ | ✔ | ✔ | 0 | ||
DiSAN (Shen et al. 2018) | R | mam | ✖ | ✔ | ✔ | 0 | ||
Rotary (Su et al. 2021) | ||||||||
Tree | SPR-abs (Wang et al. 2019) | A | ape | ✖ | ✖ | ✔ | 0 | |
SPR-rel (Wang et al. 2019) | R | mam | ✔ | ✖ | ✖ | 2(2tmax + 1)d | ||
TPE (Shiv and Quirk 2019) | A | ape | ✔ | ✖ | ✖ | |||
Graph | Struct. Transformer (Zhu et al. 2019) | R | mam | ✔ | ✔ | ✔ | 5d2 + (d + 1)dr | |
Graph Transformer (Cai and Lam 2020) | 7d2 + 3d | |||||||
Graformer (Schmitt et al. 2021) | R | mam | ✔ | ✔ | ✖ | 2(Dmax + 1)h | ||
Graph Transformer (Dwivedi and Bresson 2020) | A | ape | ✖ | ✖ | ✔ | 0 | ||
graph-bert (Zhang et al. 2020) | B |
. | Model . | Ref. Point . | Inject. Met. . | Learnable . | Recurring . | Unbound . | #Param . | |
---|---|---|---|---|---|---|---|---|
![]() | ![]() | ![]() | ![]() | ![]() | ||||
Data Structure | Sequence | Transformer w/ emb. (Vaswani et al. 2017) | A | ape | ✔ | ✖ | ✖ | |
BERT (Devlin et al. 2019) | ||||||||
Reformer (Kitaev, Kaiser, and Levskaya 2020) | ||||||||
FLOATER (Liu et al. 2020) | A | ape | ✔ | ✔ | ✔ | 0 or more | ||
Shortformer (Press, Smith, and Lewis 2021) | A | ape | ✖ | ✔ | ✔ | 0 | ||
Wang et al. (2020) | A | – | ✔ | ✖ | ✔ | 2nd | ||
Shaw, Uszkoreit, and Vaswani (2018) (abs) | A | mam | ✔ | ✔ | ✖ | 2tmax2dl | ||
Shaw, Uszkoreit, and Vaswani (2018) (rel) | R | mam | ✔ | ✔ | ✖ | 2(2tmax − 1)dl | ||
T5 (Raffel et al. 2020) | (2tmax − 1)h | |||||||
Huang et al. (2020) | dlh(2tmax − 1) | |||||||
DeBERTa (He et al. 2021) | B | Both | ✔ | ✔ | ✖ | 3tmaxd | ||
Transformer XL (Dai et al. 2019) | R | mam | ✔ | ✔ | ✔ | 2d + d2lh | ||
TENER (Yan et al. 2019) | 2dlh | |||||||
DA-Transformer (Wu, Wu, and Huang 2021) | 2h | |||||||
TUPE (Ke, He, and Liu 2021) | B | mam | ✔ | ✖ | ✖ | 2d2 + tmax(d + 2) | ||
RNN-Transf. (Neishi and Yoshinaga 2019) | R | – | ✔ | ✖ | ✔ | 6d2 + 3d | ||
SPE (Liutkus et al. 2021) | R | mam | ✔ | ✔ | ✖ | 3Kdh + ld | ||
Transformer w/ sin. (Vaswani et al. 2017) | A | ape | ✖ | ✖ | ✔ | 0 | ||
Li et al. (2019) | ||||||||
Takase and Okazaki (2019) | ||||||||
Oka et al. (2020) | ||||||||
Universal Transf. (Dehghani et al. 2019) | A | ape | ✖ | ✔ | ✔ | 0 | ||
DiSAN (Shen et al. 2018) | R | mam | ✖ | ✔ | ✔ | 0 | ||
Rotary (Su et al. 2021) | ||||||||
Tree | SPR-abs (Wang et al. 2019) | A | ape | ✖ | ✖ | ✔ | 0 | |
SPR-rel (Wang et al. 2019) | R | mam | ✔ | ✖ | ✖ | 2(2tmax + 1)d | ||
TPE (Shiv and Quirk 2019) | A | ape | ✔ | ✖ | ✖ | |||
Graph | Struct. Transformer (Zhu et al. 2019) | R | mam | ✔ | ✔ | ✔ | 5d2 + (d + 1)dr | |
Graph Transformer (Cai and Lam 2020) | 7d2 + 3d | |||||||
Graformer (Schmitt et al. 2021) | R | mam | ✔ | ✔ | ✖ | 2(Dmax + 1)h | ||
Graph Transformer (Dwivedi and Bresson 2020) | A | ape | ✖ | ✖ | ✔ | 0 | ||
graph-bert (Zhang et al. 2020) | B |