Table A2: 

Performance of Summary Inconsistency Detection models on the test portion of the SummaC Benchmark in terms of ROC-AUC metric. The metric is computed for each model on the six datasets in the benchmark, and the average is computed as the overall performance on the benchmark. Confidence intervals comparing the SummaC models to prior work: * indicates an improvement with 95% confidence, and ** 99% confidence (details in Section 5.2.1).

Model TypeModel NameSummaC Benchmark DatasetsOverall
CGSXSFPolytopeFactCCSummEvalFRANK
Baseline NER-Overlap 53.0 61.7 51.6 53.1 56.8 60.9 56.2 
MNLI-doc 59.4 59.4 62.6 62.1 70.0 67.2 63.4 
 
Classifier FactCC-CLS 65.0 59.2 63.5 79.6 61.4 62.7 65.2 
 
Parsing DAE 67.8 41.3 64.1 82.7 77.4 64.3 66.3 
 
QAG FEQA 60.8 53.4 54.6 50.7 52.2 74.8 57.7 
QuestEval 64.4 66.4 72.2 71.5 79.0 87.9 73.6 
 
NLI SummaCZS 73.1 58.0 60.3 83.7 85.5 85.3 74.3 
SummaCConv 67.6 70.2 62.4 92.2** 86.0* 88.4 77.8** 
Model TypeModel NameSummaC Benchmark DatasetsOverall
CGSXSFPolytopeFactCCSummEvalFRANK
Baseline NER-Overlap 53.0 61.7 51.6 53.1 56.8 60.9 56.2 
MNLI-doc 59.4 59.4 62.6 62.1 70.0 67.2 63.4 
 
Classifier FactCC-CLS 65.0 59.2 63.5 79.6 61.4 62.7 65.2 
 
Parsing DAE 67.8 41.3 64.1 82.7 77.4 64.3 66.3 
 
QAG FEQA 60.8 53.4 54.6 50.7 52.2 74.8 57.7 
QuestEval 64.4 66.4 72.2 71.5 79.0 87.9 73.6 
 
NLI SummaCZS 73.1 58.0 60.3 83.7 85.5 85.3 74.3 
SummaCConv 67.6 70.2 62.4 92.2** 86.0* 88.4 77.8** 
Close Modal

or Create an Account

Close Modal
Close Modal