Results of BERT-based experiments on GLUE test sets, which are scored by the GLUE evaluation server (https://gluebenchmark.com/leaderboard). Models evaluated on AX are trained on the training dataset of MNLI.
. | BERT . | TAPT . | SSL-Reg (SATP) . | SSL-Reg (MTP) . |
---|---|---|---|---|
CoLA (Matthew Corr.) | 60.5 | 61.3 | 63.0 | 61.2 |
SST-2 (Accuracy) | 94.9 | 94.4 | 95.1 | 95.2 |
RTE (Accuracy) | 70.1 | 70.3 | 71.2 | 72.7 |
QNLI (Accuracy) | 92.7 | 92.4 | 92.5 | 93.2 |
MRPC (Accuracy/F1) | 85.4/89.3 | 85.9/89.5 | 85.3/89.3 | 86.1/89.8 |
MNLI-m/mm (Accuracy) | 86.7/85.9 | 85.7/84.4 | 86.2/85.4 | 86.6/86.1 |
QQP (Accuracy/F1) | 89.3/72.1 | 89.3/71.6 | 89.6/72.2 | 89.7/72.5 |
STS-B (Pearson Corr./Spearman Corr.) | 87.6/86.5 | 88.4/87.3 | 88.3/87.5 | 88.1/87.2 |
WNLI (Accuracy) | 65.1 | 65.8 | 65.8 | 66.4 |
AX(Matthew Corr.) | 39.6 | 39.3 | 40.2 | 40.3 |
Average | 80.5 | 80.6 | 81.0 | 81.3 |
. | BERT . | TAPT . | SSL-Reg (SATP) . | SSL-Reg (MTP) . |
---|---|---|---|---|
CoLA (Matthew Corr.) | 60.5 | 61.3 | 63.0 | 61.2 |
SST-2 (Accuracy) | 94.9 | 94.4 | 95.1 | 95.2 |
RTE (Accuracy) | 70.1 | 70.3 | 71.2 | 72.7 |
QNLI (Accuracy) | 92.7 | 92.4 | 92.5 | 93.2 |
MRPC (Accuracy/F1) | 85.4/89.3 | 85.9/89.5 | 85.3/89.3 | 86.1/89.8 |
MNLI-m/mm (Accuracy) | 86.7/85.9 | 85.7/84.4 | 86.2/85.4 | 86.6/86.1 |
QQP (Accuracy/F1) | 89.3/72.1 | 89.3/71.6 | 89.6/72.2 | 89.7/72.5 |
STS-B (Pearson Corr./Spearman Corr.) | 87.6/86.5 | 88.4/87.3 | 88.3/87.5 | 88.1/87.2 |
WNLI (Accuracy) | 65.1 | 65.8 | 65.8 | 66.4 |
AX(Matthew Corr.) | 39.6 | 39.3 | 40.2 | 40.3 |
Average | 80.5 | 80.6 | 81.0 | 81.3 |