Skip to Main Content
Table 2: 

Performance of Summary Inconsistency Detection models on the test set of the SummaC Benchmark. Balanced accuracy is computed for each model on the six datasets in the benchmark, and the average is computed as the overall performance on the benchmark. We obtain confidence intervals comparing the SummaC models to prior work: * indicates an improvement with 95% confidence, and ** 99% confidence (details in Section 5.2.1). The results of the throughput analysis of Section 5.2.2 are in column Doc./min (Documents per minute).

Model TypeModel NameSummaC Benchmark Datasets
CGSXSFPolytopeFactCCSummEvalFRANKOverallDoc./min.
Baseline NER-Overlap 53.0 63.3 52.0 55.0 56.8 60.9 56.8 55,900 
MNLI-doc 57.6 57.5 61.0 61.3 66.6 63.6 61.3 6,200 
 
Classifier FactCC-CLS 63.1 57.6 61.0 75.9 60.1 59.4 62.8 13,900 
 
Parsing DAE 63.4 50.8 62.8 75.9 70.3 61.7 64.2 755 
 
QAG FEQA 61.0 56.0 57.8 53.6 53.8 69.9 58.7 33.9 
QuestEval 62.6 62.1 70.366.6 72.5 82.1 69.4 22.7 
 
NLI SummaCZS 70.458.4 62.0 83.8* 78.7 79.0 72.1* 435 
SummaCConv 64.7 66.462.7 89.5** 81.7** 81.6 74.4** 433 
Model TypeModel NameSummaC Benchmark Datasets
CGSXSFPolytopeFactCCSummEvalFRANKOverallDoc./min.
Baseline NER-Overlap 53.0 63.3 52.0 55.0 56.8 60.9 56.8 55,900 
MNLI-doc 57.6 57.5 61.0 61.3 66.6 63.6 61.3 6,200 
 
Classifier FactCC-CLS 63.1 57.6 61.0 75.9 60.1 59.4 62.8 13,900 
 
Parsing DAE 63.4 50.8 62.8 75.9 70.3 61.7 64.2 755 
 
QAG FEQA 61.0 56.0 57.8 53.6 53.8 69.9 58.7 33.9 
QuestEval 62.6 62.1 70.366.6 72.5 82.1 69.4 22.7 
 
NLI SummaCZS 70.458.4 62.0 83.8* 78.7 79.0 72.1* 435 
SummaCConv 64.7 66.462.7 89.5** 81.7** 81.6 74.4** 433 
Close Modal

or Create an Account

Close Modal
Close Modal