Overall QA performance (%) in NarrativeQA Book QA setting. Oracle IR combines question and true answers for BM25 retrieval. We use an asterisk (*) to indicate the best results reported in (Kočiskỳ et al., 2018) with multiple hyper-parameters on dev set. The dagger (†) indicates significance with p-value < 0.01.
System . | ROUGE-L . | |
---|---|---|
dev . | test . | |
Public Extractive Baselines | ||
BiDAF (Kočiskỳ et al., 2018) | 6.33 | 6.22 |
R3 (Wang et al., 2018a) | 11.40 | 11.90 |
DS-ranker + BERT (Mou et al., 2020) | 14.76 | 15.49 |
BERT-heur (Frermann, 2019) | – | 15.15 |
ReadTwice (Zemlyanskiy et al., 2021) | 22.7 | 23.3 |
Public Generative Baselines | ||
Seq2Seq (Kočiskỳ et al., 2018) | 13.29 | 13.15 |
AttSum* (Kočiskỳ et al., 2018) | 14.86 | 14.02 |
IAL-CPG (Tay et al., 2019) | 17.33 | 17.67 |
DS-Ranker + GPT2 (Mou et al., 2020) | 21.89 | 22.36 |
Our Book QA Systems | ||
BART-no-context (baseline) | 16.86 | 16.83 |
BM25 + BART reader (baseline) | 23.16 | 24.47 |
Our best ranker + BART reader | 25.83 | 26.95† |
Our best ranker + our best reader | 27.91 | 29.21† |
repl ranker with oracle IR | 37.75 | 39.32 |
System . | ROUGE-L . | |
---|---|---|
dev . | test . | |
Public Extractive Baselines | ||
BiDAF (Kočiskỳ et al., 2018) | 6.33 | 6.22 |
R3 (Wang et al., 2018a) | 11.40 | 11.90 |
DS-ranker + BERT (Mou et al., 2020) | 14.76 | 15.49 |
BERT-heur (Frermann, 2019) | – | 15.15 |
ReadTwice (Zemlyanskiy et al., 2021) | 22.7 | 23.3 |
Public Generative Baselines | ||
Seq2Seq (Kočiskỳ et al., 2018) | 13.29 | 13.15 |
AttSum* (Kočiskỳ et al., 2018) | 14.86 | 14.02 |
IAL-CPG (Tay et al., 2019) | 17.33 | 17.67 |
DS-Ranker + GPT2 (Mou et al., 2020) | 21.89 | 22.36 |
Our Book QA Systems | ||
BART-no-context (baseline) | 16.86 | 16.83 |
BM25 + BART reader (baseline) | 23.16 | 24.47 |
Our best ranker + BART reader | 25.83 | 26.95† |
Our best ranker + our best reader | 27.91 | 29.21† |
repl ranker with oracle IR | 37.75 | 39.32 |