Skip to Main Content
Table 1: 

Overall QA performance (%) in NarrativeQA Book QA setting. Oracle IR combines question and true answers for BM25 retrieval. We use an asterisk (*) to indicate the best results reported in (Kočiskỳ et al., 2018) with multiple hyper-parameters on dev set. The dagger () indicates significance with p-value < 0.01.

SystemROUGE-L
devtest
Public Extractive Baselines 
BiDAF (Kočiskỳ et al., 2018) 6.33 6.22 
R3 (Wang et al., 2018a) 11.40 11.90 
DS-ranker + BERT (Mou et al., 2020) 14.76 15.49 
BERT-heur (Frermann, 2019) – 15.15 
ReadTwice (Zemlyanskiy et al., 2021) 22.7 23.3 
Public Generative Baselines 
Seq2Seq (Kočiskỳ et al., 2018) 13.29 13.15 
AttSum* (Kočiskỳ et al., 2018) 14.86 14.02 
IAL-CPG (Tay et al., 2019) 17.33 17.67 
DS-Ranker + GPT2 (Mou et al., 2020) 21.89 22.36 
Our Book QA Systems 
BART-no-context (baseline) 16.86 16.83 
BM25 + BART reader (baseline) 23.16 24.47 
Our best ranker + BART reader 25.83 26.95 
Our best ranker + our best reader 27.91 29.21 
repl ranker with oracle IR 37.75 39.32 
SystemROUGE-L
devtest
Public Extractive Baselines 
BiDAF (Kočiskỳ et al., 2018) 6.33 6.22 
R3 (Wang et al., 2018a) 11.40 11.90 
DS-ranker + BERT (Mou et al., 2020) 14.76 15.49 
BERT-heur (Frermann, 2019) – 15.15 
ReadTwice (Zemlyanskiy et al., 2021) 22.7 23.3 
Public Generative Baselines 
Seq2Seq (Kočiskỳ et al., 2018) 13.29 13.15 
AttSum* (Kočiskỳ et al., 2018) 14.86 14.02 
IAL-CPG (Tay et al., 2019) 17.33 17.67 
DS-Ranker + GPT2 (Mou et al., 2020) 21.89 22.36 
Our Book QA Systems 
BART-no-context (baseline) 16.86 16.83 
BM25 + BART reader (baseline) 23.16 24.47 
Our best ranker + BART reader 25.83 26.95 
Our best ranker + our best reader 27.91 29.21 
repl ranker with oracle IR 37.75 39.32 
Close Modal

or Create an Account

Close Modal
Close Modal