Evaluation of various compression methods. * indicates models using task-specific sizes or speedups; average values are reported in such cases. † represents models that use BERTLARGE as the teacher model. ‡ represents speedup values that we calculated. Empty cells in the speedup columns are for papers that do not describe the detailed architecture of their final compressed model. A marks models compressed in a task-agnostic setup, i.e., requiring access to the pre-training dataset. S indicates models compressed in a task-specific setup. V100 is Nvidia Tesla V100; P100 is Nvidia Tesla P100; K80 is Nvidia Tesla K80; Titan V is Nvidia Titan V; K40 is Nvidia Tesla K40; CPU is Intel Xeon E5; TX2 is Nvidia Jetson TX2; and Pixel is Google Pixel Phone.
Methods . | Provenance . | Target Device . | Model Size . | Speedup . | Accuracy/F1 . | Avr. Drop . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
w/ emb . | w/o emb . | GPU . | CPU . | MNLI . | QQP . | SST-2 . | SQD . | ||||
BERTBASE | (Devlin et al., 2019) | – | 100% | 100% | 1x | 1x | 84.6 | 89.2 | 93.5 | 88.5 | 0.0 |
Quantization | (Shen et al., 2020) S | – | 15% | 12.5% | 1x | 1x | 83.9 | – | 92.6 | 88.3 | −0.6 |
(Zadeh et al., 2020) S | – | 10.2% | 5.5% | 1x | 1x | 83.7 | – | – | – | −0.9 | |
Unstructured Pruning | (Guo et al., 2019) A | – | 67.6% | 58.7% | 1x | 1x | – | – | – | 88.5 | 0.0 |
(Chen et al., 2020b) S | – | 48.9%* | 35.1%* | 1x | 1x | 83.1 | 89.5 | 92.9 | 87.8 | −0.63 | |
(Sanh et al., 2020) S | – | 23.8% | 3% | 1x | 1x | 79.0 | 89.3 | – | 79.9 | −4.73 | |
Structured Pruning | (Lin et al., 2020) S | – | 60.7% | 50% | – | – | – | 88.9 | 91.8 | – | −1.0 |
(Khetan and Karnin, 2020) A | – | 39.1% | 38.8% | 2.93x‡ | 2.76x‡ | 83.4 | – | 90.9 | 86.7 | −1.86 | |
KD from Output Logits | (Song et al., 2020) A,S | V100 | 22.8% | 10.9% | 6.25x | 7.09x | – | 88.6 | 92.9 | – | −0.6 |
(Liu et al., 2019a)†S | V100 | 24.1% | 3.3% | 10.7x | 8.6x‡ | 78.6 | 88.6 | 91.0 | – | −3.03 | |
(Chen et al., 2020a) A,S | V100 | 7.4% | 4.8% | 19.5x* | – | 81.6 | 88.7 | 91.8 | – | −2.06 | |
KD from Attn. | (Wang et al., 2020c) A | P100 | 60.7% | 50% | 1.94x | 1.73x | 84.0 | 91.0 | 92.0 | – | −0.1 |
Multiple KD combined | (Sanh et al., 2019) A | CPU | 60.7% | 50% | 1.94x | 1.73x | 82.2 | 88.5 | 91.3 | 86.9 | −1.73 |
(Sun et al., 2020b)†A | Pixel | 23.1% | 24.8% | 3.9x‡ | 4.7x‡ | 83.3 | – | 92.8 | 90.0 | −0.16 | |
(Jiao et al., 2020) A,S | K80 | 13.3% | 6.4% | 9.4x | 9.3x‡ | 82.5 | 89.2 | 92.6 | – | −1.0 | |
(Zhao et al., 2019b) A | – | 1.6% | 1.8% | 25.5x‡ | 22.7x‡ | 71.3 | – | 82.2 | – | −12.3 | |
Matrix Decomposition | (Noach and Goldberg, 2020) S | Titan V | 60.6% | 49.1% | 0.92x | 1.05x | 84.8 | 89.7 | 92.4 | – | −0.13 |
(Cao et al., 2020) S | V100 | 100% | 100% | 3.14x | 3.55x | 82.6 | 90.3 | – | 87.1 | −0.76 | |
Dynamic Inference | (Xin et al., 2020) S | P100 | 100% | 100% | 1.25x | 1.28x‡ | 83.9 | 89.2 | 93.4 | – | −0.26 |
(Goyal et al., 2020) S | K80 | 100% | 100% | 2.5x | 3.1x‡ | 83.8 | – | 92.1 | – | −1.1 | |
Param. Sharing | (Lan et al., 2020) A | – | 10.7% | 8.8% | 1.2x‡ | 1.2x‡ | 84.3 | 89.6 | 90.3 | 89.3 | −0.58 |
Pruning with KD | (Mao et al., 2020) S | – | 40.0% | 37.3% | 1x | 1x | 83.5 | 88.9 | 92.8 | – | −0.7 |
(Hou et al., 2020) S | K40 | 31.2% | 12.4% | 5.9x‡ | 8.7x‡ | 82.0 | 90.4 | 92.0 | – | −0.96 | |
Quantization with KD | (Zadeh et al., 2020) S | CPU | 7.6% | 3.9% | 1.94x | 1.73x | 82.0 | – | – | – | −2.6 |
(Sun et al., 2020b)†A | Pixel | 5.7% | 6.1% | 3.9x‡ | 4.7x‡ | 83.3 | – | 92.6 | 90.0 | −0.23 | |
Compound | (Tambe et al., 2020) S | TX2 | 1.3% | 0.9% | 1.83x | – | 84.4 | 89.8 | 88.5 | – | −1.53 |
Methods . | Provenance . | Target Device . | Model Size . | Speedup . | Accuracy/F1 . | Avr. Drop . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
w/ emb . | w/o emb . | GPU . | CPU . | MNLI . | QQP . | SST-2 . | SQD . | ||||
BERTBASE | (Devlin et al., 2019) | – | 100% | 100% | 1x | 1x | 84.6 | 89.2 | 93.5 | 88.5 | 0.0 |
Quantization | (Shen et al., 2020) S | – | 15% | 12.5% | 1x | 1x | 83.9 | – | 92.6 | 88.3 | −0.6 |
(Zadeh et al., 2020) S | – | 10.2% | 5.5% | 1x | 1x | 83.7 | – | – | – | −0.9 | |
Unstructured Pruning | (Guo et al., 2019) A | – | 67.6% | 58.7% | 1x | 1x | – | – | – | 88.5 | 0.0 |
(Chen et al., 2020b) S | – | 48.9%* | 35.1%* | 1x | 1x | 83.1 | 89.5 | 92.9 | 87.8 | −0.63 | |
(Sanh et al., 2020) S | – | 23.8% | 3% | 1x | 1x | 79.0 | 89.3 | – | 79.9 | −4.73 | |
Structured Pruning | (Lin et al., 2020) S | – | 60.7% | 50% | – | – | – | 88.9 | 91.8 | – | −1.0 |
(Khetan and Karnin, 2020) A | – | 39.1% | 38.8% | 2.93x‡ | 2.76x‡ | 83.4 | – | 90.9 | 86.7 | −1.86 | |
KD from Output Logits | (Song et al., 2020) A,S | V100 | 22.8% | 10.9% | 6.25x | 7.09x | – | 88.6 | 92.9 | – | −0.6 |
(Liu et al., 2019a)†S | V100 | 24.1% | 3.3% | 10.7x | 8.6x‡ | 78.6 | 88.6 | 91.0 | – | −3.03 | |
(Chen et al., 2020a) A,S | V100 | 7.4% | 4.8% | 19.5x* | – | 81.6 | 88.7 | 91.8 | – | −2.06 | |
KD from Attn. | (Wang et al., 2020c) A | P100 | 60.7% | 50% | 1.94x | 1.73x | 84.0 | 91.0 | 92.0 | – | −0.1 |
Multiple KD combined | (Sanh et al., 2019) A | CPU | 60.7% | 50% | 1.94x | 1.73x | 82.2 | 88.5 | 91.3 | 86.9 | −1.73 |
(Sun et al., 2020b)†A | Pixel | 23.1% | 24.8% | 3.9x‡ | 4.7x‡ | 83.3 | – | 92.8 | 90.0 | −0.16 | |
(Jiao et al., 2020) A,S | K80 | 13.3% | 6.4% | 9.4x | 9.3x‡ | 82.5 | 89.2 | 92.6 | – | −1.0 | |
(Zhao et al., 2019b) A | – | 1.6% | 1.8% | 25.5x‡ | 22.7x‡ | 71.3 | – | 82.2 | – | −12.3 | |
Matrix Decomposition | (Noach and Goldberg, 2020) S | Titan V | 60.6% | 49.1% | 0.92x | 1.05x | 84.8 | 89.7 | 92.4 | – | −0.13 |
(Cao et al., 2020) S | V100 | 100% | 100% | 3.14x | 3.55x | 82.6 | 90.3 | – | 87.1 | −0.76 | |
Dynamic Inference | (Xin et al., 2020) S | P100 | 100% | 100% | 1.25x | 1.28x‡ | 83.9 | 89.2 | 93.4 | – | −0.26 |
(Goyal et al., 2020) S | K80 | 100% | 100% | 2.5x | 3.1x‡ | 83.8 | – | 92.1 | – | −1.1 | |
Param. Sharing | (Lan et al., 2020) A | – | 10.7% | 8.8% | 1.2x‡ | 1.2x‡ | 84.3 | 89.6 | 90.3 | 89.3 | −0.58 |
Pruning with KD | (Mao et al., 2020) S | – | 40.0% | 37.3% | 1x | 1x | 83.5 | 88.9 | 92.8 | – | −0.7 |
(Hou et al., 2020) S | K40 | 31.2% | 12.4% | 5.9x‡ | 8.7x‡ | 82.0 | 90.4 | 92.0 | – | −0.96 | |
Quantization with KD | (Zadeh et al., 2020) S | CPU | 7.6% | 3.9% | 1.94x | 1.73x | 82.0 | – | – | – | −2.6 |
(Sun et al., 2020b)†A | Pixel | 5.7% | 6.1% | 3.9x‡ | 4.7x‡ | 83.3 | – | 92.6 | 90.0 | −0.23 | |
Compound | (Tambe et al., 2020) S | TX2 | 1.3% | 0.9% | 1.83x | – | 84.4 | 89.8 | 88.5 | – | −1.53 |