Statistics of the six datasets in the SummaC Benchmark. For each dataset, we report the validation and test set sizes, the percentage of summaries with positive (consistent) labels (% Positive), the inter-annotator agreement (when available, IAA), the source of the documents (Source: C for CNN/DM, X for XSum), the number of summarizers evaluated, and the number of sublabels annotated.
Dataset . | Size . | % Positive . | IAA . | Source . | # Summarizer . | # Sublabel . | |
---|---|---|---|---|---|---|---|
Valid. . | Test . | ||||||
CoGenSumm (Falke et al., 2019) | 1281 | 400 | 49.8 | 0.65 | C | 3 | 0 |
XSumFaith (Maynez et al., 2020) | 1250 | 1250 | 10.2 | 0.80 | X | 5 | 2 |
Polytope (Huang et al., 2020) | 634 | 634 | 6.6 | − | C | 10 | 8 |
FactCC (Kryscinski et al., 2020) | 931 | 503 | 85.0 | − | C | 10 | 0 |
SummEval (Fabbri et al., 2021) | 850 | 850 | 90.6 | 0.7 | C | 23 | 4 |
FRANK (Pagnoni et al., 2021) | 671 | 1575 | 33.2 | 0.53 | C+X | 9 | 7 |
Dataset . | Size . | % Positive . | IAA . | Source . | # Summarizer . | # Sublabel . | |
---|---|---|---|---|---|---|---|
Valid. . | Test . | ||||||
CoGenSumm (Falke et al., 2019) | 1281 | 400 | 49.8 | 0.65 | C | 3 | 0 |
XSumFaith (Maynez et al., 2020) | 1250 | 1250 | 10.2 | 0.80 | X | 5 | 2 |
Polytope (Huang et al., 2020) | 634 | 634 | 6.6 | − | C | 10 | 8 |
FactCC (Kryscinski et al., 2020) | 931 | 503 | 85.0 | − | C | 10 | 0 |
SummEval (Fabbri et al., 2021) | 850 | 850 | 90.6 | 0.7 | C | 23 | 4 |
FRANK (Pagnoni et al., 2021) | 671 | 1575 | 33.2 | 0.53 | C+X | 9 | 7 |