PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ's strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to ``back-off"to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.


Introduction
Open-domain QA (ODQA) systems usually have access to a background corpus that can be used to answer questions. Models which explicitly exploit this corpus are commonly referred to as Open-book models . They typically index the whole corpus, and then retrieve-and-read documents in order to answer questions on-the-fly (Chen et al., 2017;Lee et al., 2019a, inter alia).
A second class of models, closed-book question answering (CBQA) models, have recently been proposed. They learn to directly map questions to answers from training question-answer (QA) pairs without access to a background corpus Ye et al., 2021). These models usually take the form of pretrained seq2seq models such as T5  or BART , fine-tuned on QA-pairs. It has recently been shown that current closed-book models mostly memorise training QA-pairs, and can struggle to answer questions that do not overlap with training data .
Models which explicitly retrieve (training) QApairs, rather than memorizing them in parameters, have been shown to perform competitively with CBQA models Xiao et al., 2020). These models have a number of useful properties, such as fast inference, interpretable outputs (by inspecting retrieved QA-pairs), and the ability to update the model's knowledge at test time by adding or removing QA-pairs. However, CBQA and QA-pair retriever models are currently not competitive with retrieve-and-read systems in terms of accuracy, largely because the training QA-pairs they operate on cover substantially less knowledge than background corpora like Wikipedia. In this paper, we explore whether massively expanding the coverage of QA-pairs enables CBQA and QA-pair retriever models which are competitive with retrieve-and-read models.
We present Probably Asked Questions (PAQ), a semi-structured Knowledge Base (KB) of 65M natural language QA-pairs, which models can memorise and/or learn to retrieve from. PAQ differs from traditional KBs in that questions and answers are stored in natural language, and that questions are generated such that they are likely to appear in ODQA datasets. PAQ is automatically constructed using a question generation model and Wikipedia. To ensure generated questions are not only answerable given the passage they are generated from, we employ a global filtering post-processing step employing a state-of-the-art ODQA system. This greatly reduces the amount of wrong and ambiguous questions compared other approaches (Fang et al., 2020;Alberti et al., 2019), and is critical for high-accuracy, downstream QA models.
To complement PAQ we develop RePAQ, a question answering model based on question retrieval/matching models, using dense Maximum Inner Product Search-based retrieval, and optionally, re-ranking. We show that PAQ and RePAQ provide accurate ODQA predictions, at the level of relatively recent large-scale retrieve-and-read systems such as RAG  on Nat-uralQuestions (Kwiatkowski et al., 2019a) and Triv-iaQA (Joshi et al., 2017). PAQ instances are annotated with scores that reflect how likely we expect questions to appear, which can be used to control the memory footprint of RePAQ by filtering the KB accordingly. As a result, RePAQ is extremely flexible, allowing us to configure QA systems with near state-of-the-art results, very small memory size, or inference speeds of over 1,000 questions per second. Memory-optimised configurations of RePAQ won two of the four tracks of the 2020 Efficien-tQA NeurIPS competition (Min et al., 2020a), with system sizes of 336MB and 29MB, respectively.
We also show that PAQ is a useful source of training data for CBQA models. BART models trained on PAQ outperform baselines trained on standard data by 5%. However, these models struggle to effectively memorise all the knowledge in PAQ, lagging behind RePAQ by 15%. This demonstrates the effectiveness of RePAQ at leveraging PAQ.
Finally, we show that since RePAQ's question matching score correlates well with QA accuracy, it effectively "knows when it doesn't know", allowing for selective question answering (Rodriguez et al., 2019) where QA systems may abstain from answering if confidence is too low. Whilst answer abstaining is important in its own right, it also enables an elegant "back-off" approach where we can defer to a more accurate but expensive QA system when answer confidence is low. This enables us to make use of the best of both speed and accuracy.
In summary, we make the following contribu-tions: i) introduce PAQ, 65M QA-pairs automatically generated from Wikipedia, and demonstrate the importance of global filtering for high quality ii) introduce RePAQ, a QA system designed to utilize PAQ and demonstrate how it can be optimised for memory, speed or accuracy iii) investigate the utility of PAQ for CBQA models, improving by 5% but note significant headroom to RePAQ iv) demonstrate RePAQ's strength on selective QA, enabling us to combine RePAQ with a state-of-the-art QA model, making it both more accurate and 2x faster 1 2 Open-Domain Question Answering ODQA is the task of answering natural language factoid question from an open set of domains. A typical question might be "when was the last year astronauts landed on the moon?", with a target answer "1972". The goal of ODQA is to develop an answer function m : Q → A, where Q and A respectively are the sets of all possible questions and answers. We assume there is a distribution P (q, a) of QA-pairs, defined over Q × A. A good answer function will minimise the expected error over P (q, a) with respect to some loss function, such as answer string match. In practice, we do not have access to P (q, a), and instead rely on an empirical sample of QA-pairs K drawn from P , and measure the empirical loss of answer functions on K. Our goal in this work is to implicitly model P (q, a) so that we can draw a large sample of QApairs, PAQ, which we can train on and/or retrieve from. Drawing a sufficiently large sample will overlap with K, essentially pre-empting and caching questions that humans may ask at test-time. This allows us to shift computation from test-time to train-time compared to retrieve-and-read methods.

Generating Question-Answer Pairs
In this section, we describe the process for generating PAQ. Given a large background corpus C, our QA-pair generation process consists of the following components: 1. A passage selection model p s (c), to identify passages which humans are likely to ask questions about.
2. An answer extraction model p a (a | c), for identifying spans in a passage that are more likely to be answers to a question. 3. A question generator model p q (q | a, c) that, given a passage and an answer, generates a question.
4. A filtering QA model p f (a | q, C) that generates an answer for a given question. If an answer generated by p f does not match the answer a question was generated from, the question is discarded. This ensures generated questions are consistent (Alberti et al., 2019).
As shown in Fig. 1, these models are applied sequentially to generate QA-pairs for PAQ, in a similar manner to related work in contextual QA generation (Alberti et al., 2019;. First a passage c is selected with a high probability according to p s . Next, candidate answers a are extracted from c using p a , and questions q are generated for each answer in the passage using p q . Lastly, p f generates a new answer a for the question. If the source answer a matches a , then (q, a) is deemed consistent and is added to PAQ. In the following, we describe each component in detail.

Passage Selection, p s
The passage selection model p s is used to find passages which are likely to contain information that humans may ask about, and would thus be good candidates to generate questions from. We learn p s using a similar method to Karpukhin et al. (2020b). Concretely, we assume access to a set of positive passages C + ⊂ C, which we obtain from answercontaining passages from an ODQA training set. Since we do not have a set of labelled negative passages, we sample negatives from the corpus, ei-ther randomly or using heuristics. We then train a model to minimise negative log-likelihood of positive passages relative to negatives.
We implement p s with RoBERTa  and obtain positive passages from Natural Questions (NQ, Kwiatkowski et al., 2019b). We sample easy negatives at random from Wikipedia, and hard negatives from the same Wikipedia article as the positive passage. Easy negatives help the model to learn topics of interest, and hard negatives help to differentiate between interesting and non-interesting passages from the same article. We evaluate by measuring how highly positive validation passages are ranked amongst negatives.

Answer Extraction, p a
Given a passage, this component identifies spans that are likely to be answers to questions. We consider two alternatives: an off-the-shelf Named Entity Recogniser (NER) or training a BERT  answer extraction model on NQ.
The NER answer extractor simply extracts all named entities from a passage. 2 The majority of questions in ODQA datasets consist of entity mentions (Kwiatkowski et al., 2019a;Joshi et al., 2017), so this approach can achieve high answer coverage. However, as we extract all entity mentions in a passage, we may extract unsuitable mentions, or miss answers that do not conform to the NER system's annotation schema. The trained answer span extractor aims to address these issues.
BERT answer span extraction is typically performed by modelling answer start and end independently, obtaining answer probabilities via p a (a | c) = p(a start | c) × p(a end | c) . We found this approach be sub-optimal for modelling multiple span occurrences in a passage. We instead use an approach that breaks the conditional independence of answer spans by directly predicting p a (a | c) = p([a start , a end ] | c). This model first feeds a passage through BERT, before concatenating the start and end token representations of all possible spans of up to length 30, before feeding them into a MLP to compute p a (a | c). At generation time, the answer extraction component extracts a constant number of spans from each passage, ranked by their extraction probabilities.

Question Generation, p q
Given a passage and an answer, this component generates likely questions with that answer. To indicate the answer and its occurrence in the passage, we prepend the answer to the passage and label the answer span with surrounding special tokens. We construct the dataset from NQ, TriviaQA, and SQuAD, and perform standard fine-tuning of BART-base  to obtain p q .

Filtering, p f
The filtering model p f improves the quality of generated questions, by ensuring that they are consistent: that the answer they were generated is likely to be a valid answer to the question. Previous work (Alberti et al., 2019;Fang et al., 2020) has employed a machine reading comprehension (MRC) QA model for this purpose, p f (a | q, c), which produces an answer when supplied with a question and the passage it was generated from. We refer to this as local filtering. However, local filtering will not remove questions which are ambiguous , and can only be answered correctly with access to the source passage. Thus, we use an ODQA model for filtering, p f (a | q, C), supplied with only the generated question, and not the source passage. We refer to this as global filtering, and later show that it is vital for strong downstream results. We use FiD-base with 50 retrieved passages, trained on NQ .

Question Answering using PAQ
We consider two uses of PAQ for building QA models. The first is to use PAQ as a source of training QA-pairs for CBQA models. The second treats PAQ as a KB, which models learn to directly retrieve from. These use-cases are related, as CBQA models have been shown to memorise the train data in their parameters, latently retrieving from them at test time Domingos, 2020).

PAQ for Closed-Book QA
We fine-tune a BART-large  with QA-pairs from the concatenation of the training data and PAQ, using a similar training procedure to . We use early stopping on the validation set and a batch size of 512, and note learning is slow, requiring 70 epochs on PAQ. Following recent best practices (Alberti et al., 2019;, we then fine-tune on the training QA-pairs only, using validation Exact Match score for early stopping (Rajpurkar et al., 2018).
We note that an effective CBQA model must be able to understand the semantics of questions and how to generate answers, in addition to being able to store a large number of facts in its parameters. This model thus represents a kind of combined parametric knowledgebase and retrieval system (Petroni et al., 2020a). The model proposed in the next section, RePAQ, represents an explicit non-parametric instantiation of this idea.

RePAQ
RePAQ is a retrieval model which operates on KBs of QA-pairs, such as PAQ. RePAQ extends recently proposed nearest neighbour QA-pair retriever models Xiao et al., 2020). These models assume access to a KB of N QA-pairs K = {(q 1 , a 1 )...(q N , a N )}. These models provide an answer to a test question q by finding the most relevant QA-pair (q , a ) in K, using a scalable relevance function, then returning a as the answer to q. Such a function could be implemented using standard information retrieval techniques, (e.g. TF-IDF) or learnt from training data. RePAQ is learnt from ODQA data and consists of a neural dense retriever, optionally followed by a neural reranker.

RePAQ Retriever
Our retriever adopts the dense Maximum Inner Product Search (MIPS) retriever paradigm, that has recently been shown to obtain state-of-the-art results in a number of settings (Karpukhin et al., 2020b;Lee et al., 2021, inter alia). Our goal is to embed queries q and indexed items d into a representation space via embedding functions g q and g d , so that the inner product g q (q) g d (d) is maximised for items relevant to q. In our case, queries are questions and indexed items are QA-pairs (q , a ). We make our retriever symmetric by embedding q rather than (q , a ), meaning that only one embedding function g q is required, which maps questions to embeddings. This applies a useful inductive bias, and we find that it aids stability during training.
Learning the embedding function g q is complicated by the lack of labelled question pair paraphrases in ODQA datasets. We propose a latent variable approach similar to retrieval-augmented generation (RAG, , 3 where we we index training QA-pairs rather than documents. For an input question q, the top K QApairs (q , a ) are retrieved by a retriever p ret where p ret (q|q ) ∝ exp(g q (q) g q (q )). These are then fed into a seq2seq model p gen which generates an answer for each retrieved QA-pair, before a final answer is produced by marginalising, As p gen generates answers token-by-token, credit can be given for retrieving helpful QA-pairs which do not exactly match the target answer. For example, for the question "when was the last time anyone was on the moon" and target answer "December 1972", retrieving "when was the last year astronauts landed on the moon" with answer "1972" will help to generate the target answer, despite the answers having different granularity. After training, we discard p ret 4 , retaining only the question embedder g. We implement p ret with ALBERT (Lan et al., 2020) with an output dimension of 768, and p gen with BART-large . We train using the top 100 retrieved QA-pairs, and refresh the embedding index every 5 training steps.
Once the embedder g q is trained, we build a test-time QA system by embedding and indexing a QA KB such as PAQ. Answering is achieved by retrieving the most similar stored question, and returning its answer. The matched QA-pair can be displayed to the user, providing a mechanism for more interpretable answers than CBQA models and many retrieve-and-read generators which consume thousands of tokens to generate an answer. Efficient MIPS libraries such as FAISS (Johnson et al., 2017) enable RePAQ's retriever to answer 100s to 1000s of questions per second (see Section 5.2.3). We use a KB for RePAQ consisting of training set QA-pairs concatenated with QA-pairs from PAQ.

RePAQ Reranker
The accuracy of RePAQ can be improved using a reranker on the top-K QA-pairs from the retriever. The reranker uses cross-encoding (Humeau et al., 2020), and includes the retrieved answer in the scoring function for richer featurisation. We concatenate the input question q, the retrieved question q and its answer a with separator tokens, and feed it through ALBERT. We obtain training data in the following manner: For a training QA-pair, we first retrieve the top 2K QA-pairs from PAQ using RePAQ's retriever. If one of the retrieved QA-pairs has the correct answer, we treat that QA-pair as a positive, and randomly sample K-1 of the incorrect retrieved questions as negatives. We train by minimising negative log likelihood of positives relative to 10 negatives, and rerank 50 retrieved pairs at test time. The reranker improves accuracy at the expense of some speed. However, as QA-pairs consist of fewer tokens than passages, the reranker is still faster than retrieve-and-read models, even when using architectures such as ALBERT-xxlarge.

Results
We first examine the PAQ resource in general, before exploring how both CBQA models and RePAQ perform using PAQ, comparing to recently published systems. We use the Natural Questions (NQ, Kwiatkowski et al., 2019b) and TriviaQA (Joshi et al., 2017) datasets to assess performance, evaluating with the standard Exact Match (EM) score.

Examining PAQ
We generate PAQ by applying the pipeline described in Section 3 to the Wikipedia dump from Karpukhin et al. (2020a), which splits Wikipedia into 100-word passages. We use passage selection model p s to rank all 21M passages, and generate from the top 10M, before applying global filtering. 5 We are interested in understanding the effectiveness of different answer extractors, and whether generating more questions per answer span results leads to better results. To address these questions, we create three versions of PAQ, described below. PAQ L uses the learnt answer extractor, and a question generator trained on NQ and TriviaQA. We extract 8 answers per passage and use a beam size of 4 for question generation. In PAQ L,1 we only use the top scoring question in the beam, whereas in PAQ L,4 we use all four questions from the beam, allowing for several questions to be generated from one answer in a passage. PAQ N E,1 uses the NER answer extractor, and a generator trained only on NQ. PAQ N E,1 allow us assess whether diversity in the form of answer extractors and question generators leads to better results. The final KB, referred to as just "PAQ", is the union of PAQ L and PAQ N E . As shown in Table 1, PAQ consists of 65M filtered QA pairs. 6 This was obtained by extracting 165M answer spans and generating 279M unique questions before applying global filtering. Table 1 shows that the PAQ L pipeline is more efficient than PAQ N E , with 24.4% of QA-pairs surviving filtering, compared to 18%.

PAQ Answer
Coverage To evaluate answer extractors, we calculate how many answers in the validation sets of TriviaQA and NQ also occur in PAQ's filtered QA-pairs. 7 Table 1 shows that the answer coverage of PAQ is very high -over 90% for both TriviaQA and NQ. Comparing PAQ L with PAQ N E shows that the learnt extractor achieves higher coverage, but the union of the two leads to the highest coverage overall. Comparing PAQ L,1 and PAQ L,4 indicates that using more questions from the beam also results in higher coverage.

PAQ Question Generation Quality
Illustrative examples from PAQ can be seen in Table 2. Manual inspection of 50 questions from PAQ reveals that 82% of questions accurately capture information from the passage and contain sufficient details to locate the answer. 16% of questions confuse the semantics of certain answer types, either by conflating similar entities in the passage or by misinterpreting rare phrases (see examples 7 and 8 in Table 2). Finally, we find small numbers of grammar errors (such as example 5) and mismatched wh-words (5% and 2% respectively). 8 Other observations PAQ often contains several paraphrases of the same QA-pair. This redundancy reflects how information is distributed in Wikipedia, with facts often mentioned on several different pages. Generating several questions per answer span also increases redundancy. Whilst this means that PAQ could be more information-dense if a de-duplication step was applied, we later show that RePAQ always improves with more questions in its KB (section 5.2.1). This suggests that it is worth increasing redundancy for greater coverage.

Question Answering Results
In this section, we shall compare how the PAQleveraging models proposed in section 4 compare to existing approaches. We primarily compare to a state-of-the-art retrieve-and-read model, Fusionin-Decoder (FiD, . FiD uses DPR (Karpukhin et al., 2020b) to retrieve relevant passages from Wikipedia, and feeds them into T5  to generate a final answer. Table 3 shows the highest-accuracy configurations of our models alongside recent state-of-the-art approaches. We make the following observations: Comparing rows 2 and 7 shows that a CBQA BART model trained with PAQ outperforms a comparable NQ-only model by 5%, and only 3% behind T5-11B (row 1) which has 27x more parameters. Second, we note strong results for RePAQ on NQ (47.7%, row 9), outperforming recent retrieve-andread systems such as RAG by over 3% (row 4).
Multi-task training RePAQ on NQ and TriviaQA improves TriviaQA results by 1-2% (comparing rows 8-9 with 10-11). RePAQ does not perform quite as strongly on TriviaQA (see section 5.2.6), but is within 5% of RAG, and outperforms concurrent work on real-time QA, DensePhrases (row 6, Lee et al., 2021). Lastly, row 12 shows that combining RePAQ and FiD-large into a combined system is 0.9% more accurate than FiD-large (see Section 5.2.4 for more details).  Table 3: Exact Match score for highest accuracy RePAQ configurations in comparison to recent state-of-the-art systems. Highest score indicated in bold, highest non-retrieve-and-read model underlined. Table 4 shows RePAQ's accuracy when using different PAQ variants. To establish the effect of filtering, we evaluate RePAQ with unfiltered, locally-filtered and globally-filtered QA-pairs on PAQ L,1 . Comparing rows 2, 3 and 4 shows that global filtering is crucial, leading to a 9% and 14% improvement over locally-filtered and unfiltered datasets respectively. We also note a general trend in Table 4 that adding more globally-filtered questions improves accuracy. Rows 4 and 5 show that using four questions per answer span is better than generating one (+0.9%), and Rows 5,6 and 7 show that combining PAQ N E and PAQ L results in a further 1.2% improvement. Empirically we did not observe any cases where increasing the number of globally filtered QA-pairs reduced accuracy, even when there were millions of QA-pairs already.

System Size vs Accuracy
PAQ's QA-pairs are accompanied by scores of how likely they are to be asked. These scores can be used to filter the KB and reduce the RePAQ system size. A similar procedure can be used to fil-  ter the background corpus for a retrieve-and-read model . We compare the system size of a FiD-large system and RePAQ as the number of items (passages and QA-pairs respectively) in their indexes are reduced. We select which passages and QA-pairs are included using the passage selection model p s . 9 Further experimental details can be found in appendix A.4. shows the that both system sizes can be reduced several-fold with only a small drop in accuracy, demonstrating the effectiveness of p s . FiD can achieve a higher accuracy, but requires larger system sizes. RePAQ can be reduced to a smaller size before a significant accuracy drop, driven primarily by the higher information density of QA-pairs relative to passages, and fewer model parameters used by RePAQ compared to FiD. Highly-optimized RePAQ models won the "500MB" and "Smallest System" tracks of the EfficientQA NeurIPS competition with disk images of 336MB and 29MB respectively. For further details on EfficientQA, the reader is referred to Min et al. (2020a).

Inference Speed vs Accuracy
We train a variety of differently-sized RePAQ models to explore the relationship between accuracy and inference speed. We use a fast Hierarchical Navigable Small World (HNSW) index in FAISS (Malkov and Yashunin, 2018;Johnson et al., 2017) 10 and measure the time required to evaluate the NQ test set on a system with access to one GPU. 11 Table 5 shows these results. Some retrieveronly RePAQ models can answer over 1000 questions per second, and are relatively insensitive to model size, with ALBERT-base only scoring 0.5% lower than ALBERT-xlarge. They also outperform retrieve-and-read models like REALM (40.4%, Guu et al., 2020) and recent real-time QA models like DensePhrases (40.9%, Lee et al., 2021). We find that larger, slower RePAQ rerankers achieve higher accuracy. However, even the slowest RePAQ is 3x faster than FiD-base, whilst only being 0.8% less accurate, and 12x faster than FiD-large. 10 The HNSW index has negligible (∼0.1%) drop in retriever accuracy compared to a flat index 11 System details can be found in Appendix A.5

Selective Question Answering
QA systems should not just be able to answer accurately, but also "know when they don't know", and abstain from answering when they are unlikely to produce good answers. This task is challenging for current systems (Asai and Choi, 2020;Jiang et al., 2020b), and has been approached in MRC by training on unanswerable questions (Rajpurkar et al., 2018) and for trivia systems by leveraging incremental QA formats (Rodriguez et al., 2019).
We find that RePAQ's retrieval and reranking scores are well-correlated with answering correctly. This allows RePAQ to be used for selective question answering by abstaining when the score is below a certain threshold. Figure 3 shows a risk-coverage plot (Wang et al., 2018) for RePAQ and FiD, where we use FiD's answer log probability for its answer confidence. 12 The plot shows the accuracy on the top N% highest confidence answers for NQ. If we require models to answer 75% of user questions, RePAQ's accuracy on the questions it does answer is 59%, whereas FiD, which has poorer calibration, scores only 55%. This difference is even more pronounced with stricter thresholds -with coverage of 50%, RePAQ outperforms FiD by over 10%. FiD only outperforms RePAQ when we require systems to answer more than 85% of questions.
Whilst RePAQ's selective QA is useful in its own right, it also allows us to combine the slow but accurate FiD with the fast and precise RePAQ, which we refer to as backoff. We first try to answer with RePAQ, and if the confidence is below a threshold determined on validation data, we pass the question onto FiD. For NQ, the combined system is 2.1x faster than FiD-large, with RePAQ answering 57% of the questions, and the overall accuracy is 1% higher than FiD-large (see table 3).
If inference speed is a priority, the threshold can be decreased so that RePAQ answers 80% of the questions, which retains the same overall accuracy as FiD, with a 4.6x speedup. For TriviaQA, the combined system backs off to FiD earlier, due to the stronger relative performance of FiD. Additional details can be found in appendix A.6

Analysing RePAQ's Predictions
Some examples of top retrieved questions are shown Table 6. When RePAQ answers correctly, the retrieved question is a paraphrase of the test question from PAQ in 89% of cases. As such, there is high (80.8 ROUGE-L) similarity between correctly-answered test questions and the top retrieved questions. 9% of test questions even exist verbatim in PAQ, and are thus trivial to answer. The reranker primarily improves over the retriever for ambiguous cases, and cases where the top retrieved answer does not have the right granularity. In 32% of cases, RePAQ does not retrieve the correct answer in the top 50 QA-pairs, suggesting a lack of coverage may be a significant source of error. In these cases, retrieved questions are much less similar to the test question than for correctly answered questions, dropping by 20 ROUGE-L. We also observe cases where retrieved questions match the test question, but the retrieved answer does not match the desired answer. This is usually due to different answer granularity, but in a small number of cases was due to factually incorrect answers.

Does the Filtering Model Limit
RePAQ's Accuracy?
As RePAQ relies on retrieving paraphrases of test questions, we may expect that the ODQA filtering model places an upper bound on it's performance. For example, if a valid QA-pair is generated which overlaps with a test QA-pair, but the filter cannot    . "Q overlap" are test questions with paraphrases in training data. "A-only" are test questions where answers appear in training data, but questions do not. "No overlap" where neither question or answer overlap.
answer it correctly, that QA-pair will not be added to PAQ, and RePAQ cannot use it to answer the test question. The NQ FiD-base-50-passage model used for filtering scores 46.1% and 53.1% for NQ and TriviaQA respectively. RePAQ actually outperforms the filter model on NQ by 1.6%. This is possible because generated questions may be phrased in such a way that they are easier to answer, e.g. being less ambiguous . RePAQ can then retrieve the paraphrased QA-pair and answer correctly, even if the filter could not answer the test question directly. The filtering model's weaker scores on TriviaQA helps explain why RePAQ is not as strong on this dataset. We speculate that using a stronger filtering model for TriviaQA would in turn improve RePAQ's results. Table 7 shows results on test set splits which measure how effectively models memorise QA-pairs from the NQ train set ("Q overlap"), and generalise to novel questions ("A overlap only" and "No overlap"). 13 Comparing CBQA models trained on NQ vs those trained on NQ and PAQ show that models trained with PAQ answer more questions correctly from the "A-only overlap" and "No overlap" categories, indicating they have learnt facts not present in the NQ train set. Applying additional NQ finetuning on the PAQ CBQA model improves scores on "Q overlap" (indicating greater memorisation of NQ), but scores on the other categories drop (indicating reduced memorisation of PAQ). However, RePAQ, which explicitly retrieves from PAQ rather than memorising it in parameters, strongly outperforms the CBQA model in all categories, demonstrating that the CBQA model struggles to memorise enough facts from PAQ. CBQA models with more parameters may be better able to memorise PAQ, but have downsides in terms of system resources. Future work should address how to better store PAQ in CBQA model parameters.

Related Work
ODQA has received much attention in both for its practical applications, and as a benchmark for how NLP models store and access knowledge (Chen and Yih, 2020;Petroni et al., 2020b).
KBQA A number of early approaches in ODQA focused on using structured KBs (Berant et al., 2013) such as Freebase (Bollacker et al., 2007), with recent examples from Févry et al. (2020) and Verga et al. (2020). This approach often has high precision but suffers when KB doesn't match user requirements, or where the schema limits what knowledge can be stored. We populate our knowledgebase with semi-structured QA pairs which are specifically likely to be relevant at test time mitigating both of these drawbacks, and sharing many of the benefits, such as precision and extensibility.
Open Information Extraction Our work touches on automatic KB construction and open information extraction (OpenIE) (Angeli et al., 2015). Here, the goal is to mine facts from free text into structured or semi-structured forms, typically (subject, relation, object) triples for use in tasks such as slot-filling (Surdeanu, 2013). We generate natural language QA-pairs rather than OpenIE triples, and we do not attempt to extract all possible facts in a corpus, focusing only those 13 See  for further details.
that are likely to be asked. QA-pairs have also been used in semantic role labelling, such as QA-SRL (FitzGerald et al., 2018).
Real-time ODQA Systems that prioritise fast runtimes over accuracy are sometimes referred to as real-time QA systems (Seo et al., 2018). Den-SPI (Seo et al., 2019) and a contemporary work, DensePhrases (Lee et al., 2021), approach this by indexing all possible phrases in a background corpus, and learn mappings from questions to passagephrase pairs. We also build an index for fast answering, but generate and index globally-answerable questions. Indexing QA-pairs can be considered as indexing summaries of important facts from the corpus, rather than indexing the corpus itself. We also generate and store multiple questions per passageanswer pair, relieving information bottlenecks from encoding a passage-answer pair into a single vector.
Question Generation for QA Question generation has been used for various purposes, such as data augmentation (Alberti et al., 2019;Lee et al., 2021), improved retrieval (Nogueira et al., 2019), generative modelling for contextual QA (Lewis and Fan, 2018), as well as being studied in its own right (Du et al., 2017;Hosking and Riedel, 2019). Serban et al. (2016) generate large numbers of questions from Freebase, but do not address how to use them for QA. Closest to our work is the recently-proposed OceanQA (Fang et al., 2020). OceanQA first generates contextual QA-pairs from Wikipedia. At test-time, a document retrieval system is used to retrieve the most relevant passage for a question and the closest pre-generated QA-pair from that passage is then selected. In contrast, we focusing on generating a large KB of non-contextual, globallyconsistent ODQA questions and exploring what QA systems are facilitated by such a resource.

Conclusion
We have introduced PAQ, a dataset of 65M QApairs, and explored how it could be used to improve ODQA models. We demonstrated the effectiveness of RePAQ, a system which retrieves from PAQ, in terms of accuracy, speed, space efficiency and selective QA. Generating PAQ is computationally intensive due to its large scale, but should be a useful, re-usable resource for more accurate, smaller and faster QA models. Nevertheless, future work should be carried out to improve the efficiency of generation in order to expand PAQ's coverage.
We also demonstrated PAQ's utility for improved CBQA models, but note a large accuracy gap between our CBQA models and RePAQ. Exploring the trade-offs between storing and retrieving knowledge parametrically or non-parametrically is a topic of great current interest , and PAQ should be a useful testbed for probing this relationship further. We also note that PAQ could be used as general dataaugmentation when training any open-domain QA model or retriever. Whilst we consider such work out-of-scope for this paper, leveraging PAQ to improve retrieve-and-read and other systems systems should be explored in future work.

A.1 Dataset splits
For Natural Questions (NQ, Kwiatkowski et al., 2019b), we use the standard Open-Domain splits in our experiments, consisting of 79,168 train, 8,757 development, and 3,610 test question-answer pairs.
For TriviaQA, We use the open-domain train-test splits, which correspond to the unfiltered-train and unfiltered-dev reading comprehension splits Min et al., 2019;Karpukhin et al., 2020a).

A.2 Futher details on Passage selection
Our passage selection model is based on RoBERTa BASE (Liu et al., 2019b). We feed each passage we select into the model as input and use an MLP on top of RoBERTa's [CLS] token to predict if it is positive or negative. We then use this model to perform inference and obtain a score for every single passage in the whole wikipedia corpus. The top N passages ranked by its score are served as the candidate pool we generate answers from. The model is optimized for a higher recall, such that the positive passages should be identified with a high probability. Our model achieved 84.7% recall on the Natural Questions dev set.

A.3 Further Details on Question Quality
For NQ, we find that the retrieved questions are paraphrases of the test questions in the majority of cases. The test questions in TriviaQA are mostly very specific, and whilst retrieved questions in PAQ contain less details, they still usually have correct answers. To better evaluate the quality of the questions, we conduct human evaluation on 50 random sampled questions generated from the wikipedia passage pool. We have the following observations: 1) the majority (82%) questions accurately capture the context of the answer in the passage, and contain relevant and informative details to locate the answer. 2) 16% questions fail to understand the semantics of certain types of answers. The failure can be traced to: Mistaking extremely similar entities. For example, given the sentence "The Kerch Peninsula is ... located at the eastern end of the Crimean Peninsula" and the answer "the Crimean Peninsula", the generated question is "what is the eastern end of the kerch peninsula"; Generalisation to rare phrases. The digit combinations appearing in passages mostly stand for date or timestamp. However, it is not applied to all the cases and the model fails to capture that. For example for "under a 109-124 loss to the Milwaukee Bucks", the question is generated as "when did ... play for the toronto raptors". 3) Very few (2%) questions mismatch question type words (Wh-types) with answers when the answers are words rarely asked (e.g. first).

A.4 Further details on System Size vs Accuracy
The system experiment described in Section 5.2.2 measures the bytes required to store the models, the text of the documents/QA-pairs, and a dense index. For figure 2, We assume models are stored at fp16 precision, the text has been compressed using LZMA 14 , and the indexes both use 768d vectors, and Product Quantization. These are realistic defaults, with usage in the question answering literature Min et al., 2020a). The RePAQ model used in this figure consists of an ALBERT-base retriever and ALBERTxxlarge reranker, and the FID system consists of DPR (Karpukhin et al., 2020b) (itself consisting of two BERT-base retrievers) and a T5-large reader . Using a different system setup (for example using full precision to store the models, no text compression, and fp16 index quantization, shown in figure 4) shifts the relative position of the curves in Figure 2, but not the qualitative observation that RePAQ models can be compressed to smaller sizes before significant drop in accuracy.

A.5 Further details on Inference speed
The system used for inference speed benchmarking is a machine learning workstation with 80 CPU 14 https://tukaani.org/xz/ TriviaQA. FID has higher overall accuracy, but RePAQ with a reranker still outperforms it for coverages <50% cores, 512GB of CPU RAM and access to one 32GB NVIDIA V100 GPU. Inference is carried out at mixed precision for all systems, and questions are allowed to be answered in parallel. Models are implemented in Pytorch (Paszke et al., 2019) using Transformers (Wolf et al., 2020). Measurements are repeated 3 times and the mean time is reported, rounded to an appropriate significant figure. The HNSW index used in this experiment indexes all 65M PAQ QA-pairs with vectors of dimension 768 dimensions, uses an ef construction of 80, ef search of 32, and store n of 256, and performs up to 2048 index searches in parallel. This index occupies 220GB, but could be considerably compressed with techniques including scalar or product quantization, or training retrievers with smaller projection dimensions. The reader is referred to  and Lewis et al. (2020a, Appendix C) for experiments demonstrating index compression with little-to-no drop in accuracy.
A.6 Further details on Selective Question Answering Figure 5 shows the Risk-Coverage plot for Trivi-aQA. The results are qualitatively similar as those for NQ (see Figure 3 in the main paper), although FiD's stronger overall performance on TriviaQA shifts its risk-coverage curve up the accuracy axis relative to RePAQ. FiD also seems a little better calibrated on TriviaQA than NQ, indicated by greater gradient. However, RePAQ remains better calibrated than FiD, and outperforms it for answer coverages below 50%. We also investigate ways to improve the calibration of FiD-large on NQ, using post-hoc calibration, using an approach similar to Jiang et al. RePAQ's answer confidence scores to calibrate FiD leads to the best results for FiD (2020a). We train a Gradient Boosting Machine (GBM, Friedman, 2001) on the dev set of NQ, which attempts to predict whether FID-large has answered correctly or not. The GBM is given FID's answer loss, answer log probability and the retrieval score of the top 100 retrieved documents from DPR. Figure 6 shows these results. We first note that FiD-Large's answer loss and answer log probabilities perform similarly, and both struggle to calibrate FiD, as mentioned in the main paper. The GBM does improve calibration, especially at lower coverages, but still lags behind RePAQ by 7% EM at 50% coverage. We also note a surprising finding that we can acutally use RePAQ's scores to calibrate FiD. Here, we use FiD's predicted answer, but RePAQ's confidence score to decide whether to answer or not. This result is also plotted in Figure  6, and results in the best risk-coverage curve for FiD. However, this is still not as well-calibrated as simply using RePAQ, highlighting the strength of RePAQ for selective QA further.

A.7 Additional Model training details
RePAQ models were trained for up to 3 days on a machine with 8 NVIDIA 32GB V100 GPUs. Validation Exact match score was used to determine when to stop training in all cases. RePAQ retrievers were trained using Fairseq (Ott et al., 2019), and rerankers were trained in Transformers (Wolf et al., 2020) in Pytorch. The PAQ BART CBQA models were trained in Fairseq for 6 days on 8 V100s, after which validation accuracy had plateaued. Standard BART hyperparameters were used, apart from batch size and learning rate, which were tuned to try to promote faster learning, but learning became unstable with learning rates greater than 0.0001.