Performance of different LMs on the MC-test dataset. “Original” indicates the original language model, and “+ UnifiedQA” indicates fine-tuning following the recipe of UnifiedQA.
Method . | BART . | GPT-2 large . | ||
---|---|---|---|---|
. | ACC . | ECE . | ACC . | ECE . |
Original | 0.295 | 0.225 | 0.272 | 0.244 |
+ UnifiedQA | 0.662 | 0.166 | 0.414 | 0.243 |
+ softmax | 0.658 | 0.097 | 0.434 | 0.177 |
+ margin | 0.632 | 0.090 | 0.450 | 0.123 |
+ Temp. | 0.632 | 0.064 | 0.450 | 0.067 |
+ XGB | 0.624 | 0.090 | 0.440 | 0.080 |
+ Para. | 0.624 | 0.084 | 0.436 | 0.104 |
+ Aug. | 0.600 | 0.089 | 0.441 | 0.126 |
+ Combo | 0.591 | 0.065 | 0.429 | 0.069 |
Method . | BART . | GPT-2 large . | ||
---|---|---|---|---|
. | ACC . | ECE . | ACC . | ECE . |
Original | 0.295 | 0.225 | 0.272 | 0.244 |
+ UnifiedQA | 0.662 | 0.166 | 0.414 | 0.243 |
+ softmax | 0.658 | 0.097 | 0.434 | 0.177 |
+ margin | 0.632 | 0.090 | 0.450 | 0.123 |
+ Temp. | 0.632 | 0.064 | 0.450 | 0.067 |
+ XGB | 0.624 | 0.090 | 0.440 | 0.080 |
+ Para. | 0.624 | 0.084 | 0.436 | 0.104 |
+ Aug. | 0.600 | 0.089 | 0.441 | 0.126 |
+ Combo | 0.591 | 0.065 | 0.429 | 0.069 |