Label accuracy of models on FEVER-development(DEV) and Symmetric FEVER with and without fine tuning. All results marked with * and # are statistically significant (unpaired t-test) with p < 0.05 against their FT and Original variants respectively. FEVER-DEV predictions are using gold standard evidence.
Model . | Dataset . | |||||
---|---|---|---|---|---|---|
FEVER-DEV . | Symmetric FEVER . | |||||
Original . | FT . | FT+L2 . | Original . | FT . | FT+L2 . | |
ProoFVer | 89.07±0.3 | 86.41±0.8 | 87.95±1.0* | 81.70±0.4 | 85.88±1.3# | 83.37±1.3#* |
KGAT | 86.02±0.2 | 76.67±0.3 | 79.93± 0.9* | 65.73±0.3 | 84.94± 1.1# | 73.34± 1.5#* |
CorefBERT | 88.26±0.4 | 78.79±0.2 | 84.22± 1.5* | 68.49±0.6 | 85.45± 0.2# | 77.37± 0.5#* |
Model . | Dataset . | |||||
---|---|---|---|---|---|---|
FEVER-DEV . | Symmetric FEVER . | |||||
Original . | FT . | FT+L2 . | Original . | FT . | FT+L2 . | |
ProoFVer | 89.07±0.3 | 86.41±0.8 | 87.95±1.0* | 81.70±0.4 | 85.88±1.3# | 83.37±1.3#* |
KGAT | 86.02±0.2 | 76.67±0.3 | 79.93± 0.9* | 65.73±0.3 | 84.94± 1.1# | 73.34± 1.5#* |
CorefBERT | 88.26±0.4 | 78.79±0.2 | 84.22± 1.5* | 68.49±0.6 | 85.45± 0.2# | 77.37± 0.5#* |