Results of BERT-based experiments on GLUE development sets, where results on MNLI and QQP are the median of five runs and results on other datasets are the median of nine runs. The size of MNLI and QQP is very large, taking a long time to train on. Therefore, we reduced the number of runs. Because we used a different optimization method to re-implement BERT, our median performance is not the same as that reported in Lan et al. (2019).
. | CoLA . | SST-2 . | RTE . | QNLI . | MRPC . |
---|---|---|---|---|---|
. | (Matthew Corr.) . | (Accuracy) . | (Accuracy) . | (Accuracy) . | (Accuracy/F1) . |
The median result | |||||
BERT, Lan et al., 2019 | 60.6 | 93.2 | 70.4 | 92.3 | 88.0/– |
BERT, our run | 62.1 | 93.1 | 74.0 | 92.1 | 86.8/90.8 |
TAPT | 61.2 | 93.1 | 74.0 | 92.0 | 85.3/89.8 |
SSL-Reg (SATP) | 63.7 | 93.9 | 74.7 | 92.3 | 86.5/90.3 |
SSL-Reg (MTP) | 63.8 | 93.8 | 74.7 | 92.6 | 87.3/90.9 |
The best result | |||||
BERT, our run | 63.9 | 93.3 | 75.8 | 92.5 | 89.5/92.6 |
TAPT | 62.0 | 93.9 | 76.2 | 92.4 | 86.5/90.7 |
SSL-Reg (SATP) | 65.3 | 94.6 | 78.0 | 92.8 | 88.5/91.9 |
SSL-Reg (MTP) | 66.3 | 94.7 | 78.0 | 93.1 | 89.5/92.4 |
. | CoLA . | SST-2 . | RTE . | QNLI . | MRPC . |
---|---|---|---|---|---|
. | (Matthew Corr.) . | (Accuracy) . | (Accuracy) . | (Accuracy) . | (Accuracy/F1) . |
The median result | |||||
BERT, Lan et al., 2019 | 60.6 | 93.2 | 70.4 | 92.3 | 88.0/– |
BERT, our run | 62.1 | 93.1 | 74.0 | 92.1 | 86.8/90.8 |
TAPT | 61.2 | 93.1 | 74.0 | 92.0 | 85.3/89.8 |
SSL-Reg (SATP) | 63.7 | 93.9 | 74.7 | 92.3 | 86.5/90.3 |
SSL-Reg (MTP) | 63.8 | 93.8 | 74.7 | 92.6 | 87.3/90.9 |
The best result | |||||
BERT, our run | 63.9 | 93.3 | 75.8 | 92.5 | 89.5/92.6 |
TAPT | 62.0 | 93.9 | 76.2 | 92.4 | 86.5/90.7 |
SSL-Reg (SATP) | 65.3 | 94.6 | 78.0 | 92.8 | 88.5/91.9 |
SSL-Reg (MTP) | 66.3 | 94.7 | 78.0 | 93.1 | 89.5/92.4 |