Skip to Main Content
Table 5: 

Label accuracy of models on FEVER-development(DEV) and Symmetric FEVER with and without fine tuning. All results marked with * and # are statistically significant (unpaired t-test) with p < 0.05 against their FT and Original variants respectively. FEVER-DEV predictions are using gold standard evidence.

ModelDataset
FEVER-DEVSymmetric FEVER
OriginalFTFT+L2OriginalFTFT+L2
ProoFVer 89.07±0.3 86.41±0.8 87.95±1.0* 81.70±0.4 85.88±1.3# 83.37±1.3#* 
KGAT 86.02±0.2 76.67±0.3 79.93± 0.9* 65.73±0.3 84.94± 1.1# 73.34± 1.5#* 
CorefBERT 88.26±0.4 78.79±0.2 84.22± 1.5* 68.49±0.6 85.45± 0.2# 77.37± 0.5#* 
ModelDataset
FEVER-DEVSymmetric FEVER
OriginalFTFT+L2OriginalFTFT+L2
ProoFVer 89.07±0.3 86.41±0.8 87.95±1.0* 81.70±0.4 85.88±1.3# 83.37±1.3#* 
KGAT 86.02±0.2 76.67±0.3 79.93± 0.9* 65.73±0.3 84.94± 1.1# 73.34± 1.5#* 
CorefBERT 88.26±0.4 78.79±0.2 84.22± 1.5* 68.49±0.6 85.45± 0.2# 77.37± 0.5#* 
Close Modal

or Create an Account

Close Modal
Close Modal