Performance of Summary Inconsistency Detection models on the test portion of the SummaC Benchmark in terms of ROC-AUC metric. The metric is computed for each model on the six datasets in the benchmark, and the average is computed as the overall performance on the benchmark. Confidence intervals comparing the SummaC models to prior work: * indicates an improvement with 95% confidence, and ** 99% confidence (details in Section 5.2.1).
Model Type . | Model Name . | SummaC Benchmark Datasets . | Overall . | |||||
---|---|---|---|---|---|---|---|---|
CGS . | XSF . | Polytope . | FactCC . | SummEval . | FRANK . | |||
Baseline | NER-Overlap | 53.0 | 61.7 | 51.6 | 53.1 | 56.8 | 60.9 | 56.2 |
MNLI-doc | 59.4 | 59.4 | 62.6 | 62.1 | 70.0 | 67.2 | 63.4 | |
Classifier | FactCC-CLS | 65.0 | 59.2 | 63.5 | 79.6 | 61.4 | 62.7 | 65.2 |
Parsing | DAE | 67.8 | 41.3 | 64.1 | 82.7 | 77.4 | 64.3 | 66.3 |
QAG | FEQA | 60.8 | 53.4 | 54.6 | 50.7 | 52.2 | 74.8 | 57.7 |
QuestEval | 64.4 | 66.4 | 72.2 | 71.5 | 79.0 | 87.9 | 73.6 | |
NLI | SummaCZS | 73.1 | 58.0 | 60.3 | 83.7 | 85.5 | 85.3 | 74.3 |
SummaCConv | 67.6 | 70.2 | 62.4 | 92.2** | 86.0* | 88.4 | 77.8** |
Model Type . | Model Name . | SummaC Benchmark Datasets . | Overall . | |||||
---|---|---|---|---|---|---|---|---|
CGS . | XSF . | Polytope . | FactCC . | SummEval . | FRANK . | |||
Baseline | NER-Overlap | 53.0 | 61.7 | 51.6 | 53.1 | 56.8 | 60.9 | 56.2 |
MNLI-doc | 59.4 | 59.4 | 62.6 | 62.1 | 70.0 | 67.2 | 63.4 | |
Classifier | FactCC-CLS | 65.0 | 59.2 | 63.5 | 79.6 | 61.4 | 62.7 | 65.2 |
Parsing | DAE | 67.8 | 41.3 | 64.1 | 82.7 | 77.4 | 64.3 | 66.3 |
QAG | FEQA | 60.8 | 53.4 | 54.6 | 50.7 | 52.2 | 74.8 | 57.7 |
QuestEval | 64.4 | 66.4 | 72.2 | 71.5 | 79.0 | 87.9 | 73.6 | |
NLI | SummaCZS | 73.1 | 58.0 | 60.3 | 83.7 | 85.5 | 85.3 | 74.3 |
SummaCConv | 67.6 | 70.2 | 62.4 | 92.2** | 86.0* | 88.4 | 77.8** |