The QA performance on the summarization datasets drops significantly compared to its performance on SQuAD, especially for TAC’08. This is expected due to the domain shift, however we suspect the drop is smaller for CNN/DailyMail because the generated and reference summaries are far more similar than for TAC, thus making it easier to answer questions.
Dataset . | %IsAns . | IsAns-F1 . | Given IsAns . | ||
---|---|---|---|---|---|
EM . | F1 . | Acc . | |||
SQuAD 2.0 | 50.0% | 92.0 | 88.0 | 94.5 | – |
TAC’08 | 14.2% | 52.4 | 56.5 | 69.5 | 84.3 |
CNN/DM | 36.3% | 75.3 | 73.8 | 83.6 | 86.3 |
Dataset . | %IsAns . | IsAns-F1 . | Given IsAns . | ||
---|---|---|---|---|---|
EM . | F1 . | Acc . | |||
SQuAD 2.0 | 50.0% | 92.0 | 88.0 | 94.5 | – |
TAC’08 | 14.2% | 52.4 | 56.5 | 69.5 | 84.3 |
CNN/DM | 36.3% | 75.3 | 73.8 | 83.6 | 86.3 |