Table 4: 

Macro F1-score test performance of models and an ensemble (Ens.) (§5.3) trained on the supervised training splits of each dataset (Supervised), and in addition with the contrastive objective (+CL) (§5.1) and the counterfactually augmented data (+CAD) (§5.2). Results are the average of three different seed runs. The highest results for a test dataset and a model are in bold, and the overall highest result of a model for a test dataset are additionally underlined.

DatasetModelVeracity Pred. / Orig.TestEvidence Sufficiency Pred. / Suff.Facts
BERTRoBERTaALBERTEns.BERTRoBERTaALBERTEns.
FEVER Supervised 87.16 88.69 86.67 88.81 59.51 59.10 63.00 61.36 
 + CL 87.62 88.81 86.62 89.02 65.79 67.98 70.83 69.90 
 + CAD 87.86 89.23 87.31 89.14 67.18 69.58 68.56 69.25 
 
HoVer Supervised 80.75 83.37 76.88 82.73 58.15 64.81 66.28 65.88 
 + CL 81.82 83.38 77.62 83.08 74.91 75.41 72.83 78.05 
 + CAD 81.87 83.65 79.44 83.65 74.98 77.14 76.12 79.07 
 
VitaminC Supervised 82.26 84.98 83.38 86.01 58.51 69.07 66.57 66.76 
 + CL 83.00 85.54 83.48 86.22 62.34 72.18 68.13 70.42 
 + CAD 83.56 85.65 83.82 86.14 72.93 75.79 75.13 78.60 
DatasetModelVeracity Pred. / Orig.TestEvidence Sufficiency Pred. / Suff.Facts
BERTRoBERTaALBERTEns.BERTRoBERTaALBERTEns.
FEVER Supervised 87.16 88.69 86.67 88.81 59.51 59.10 63.00 61.36 
 + CL 87.62 88.81 86.62 89.02 65.79 67.98 70.83 69.90 
 + CAD 87.86 89.23 87.31 89.14 67.18 69.58 68.56 69.25 
 
HoVer Supervised 80.75 83.37 76.88 82.73 58.15 64.81 66.28 65.88 
 + CL 81.82 83.38 77.62 83.08 74.91 75.41 72.83 78.05 
 + CAD 81.87 83.65 79.44 83.65 74.98 77.14 76.12 79.07 
 
VitaminC Supervised 82.26 84.98 83.38 86.01 58.51 69.07 66.57 66.76 
 + CL 83.00 85.54 83.48 86.22 62.34 72.18 68.13 70.42 
 + CAD 83.56 85.65 83.82 86.14 72.93 75.79 75.13 78.60 
Close Modal

or Create an Account

Close Modal
Close Modal