Abstract
We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g., questions and potential answer passages). It uses a new passage-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence passages, and (2) the passages are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both passage and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.1
Our code and model checkpoints are available at: https://github.com/DevSinghSachan/art.
1 Introduction
Dense passage retrieval methods (Karpukhin et al., 2020; Xiong et al., 2021), initialized with encoders such as BERT (Devlin et al., 2019) and trained using supervised contrastive losses (Oord et al., 2018), have surpassed the performance achieved by previously popular keyword-based approaches like BM25 (Robertson and Zaragoza, 2009). Such retrievers are core components in models for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. In this paper, we introduce the first unsupervised method, based on a new corpus-level autoencoding approach, that can match or surpass strong supervised performance levels with no labeled training data or task-specific losses.
We propose ART, Autoencoding-based Retriever Training, which only assumes access to sets of unpaired questions and passages. Given an input question, ART first retrieves a small set of possible evidence passages. It then reconstructs the original question by attending to these passages (see Figure 1 for an overview). The key idea in ART is to consider the retrieved passages as a noisy representation of the original question and question reconstruction probability as a way of denoising that provides soft-labels for how likely each passage is to have been the correct result.
ART maximizes the retrieved passage likelihood computed from the dense retriever by considering the language model question reconstruction score conditioned on the passage as a soft-label. Colored blocks indicate trainable parameters. Red arrows show gradient flow during backpropagation.
ART maximizes the retrieved passage likelihood computed from the dense retriever by considering the language model question reconstruction score conditioned on the passage as a soft-label. Colored blocks indicate trainable parameters. Red arrows show gradient flow during backpropagation.
To bootstrap the training of a strong model, it is important to both have a strong initial retrieval model and to be able to compute reliable initial estimates of question reconstruction probability when conditioned on a (retrieved) passage. Although passage representations from BERT-style models are known to be reasonable retrieval baselines, it is less clear how to do zero-shot question generation. We use a generative pre-trained language model (PLM) and prompt it with the passage as input to generate the question tokens using teacher-forcing. As finetuning of the question-generation PLM is not needed, only the retrieval model, ART can use large PLMs and obtain accurate soft-label estimates of which passages are likely to be the highest quality.
The retriever is trained to penalize the divergence of a passage likelihood from its soft-label score. For example, if the question is “Where is the bowling hall of fame located?” as shown in Figure 1, then the training process will boost the retrieval likelihood of the passage “Bowling Hall of Fame is located in Arlington,” as it is relevant and would lead to a higher question reconstruction likelihood, while the likelihood of the passage “Hall of Fame is a song by …” would be penalized as it is irrelevant. In this manner, the training process encourages correct retrieval results and vice-versa, leading to an iterative improvement in passage retrieval.
Comprehensive experiments on five benchmark QA datasets demonstrate the usefulness of our proposed training approach. By simply using questions from the training set, ART outperforms models like DPR by an average of 5 points absolute in top-20 and 4 points absolute in top-100 accuracy. We also train using all the questions contained in the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019) and find that even with a mix of answerable and unanswerable questions, ART achieves strong generalization on out-of-distribution datasets due to relying on PLM. Our analysis further reveals that ART is highly sample-efficient, outperforming BM25 and DPR with just 100 and 1000 questions, respectively, on the NQ-Open dataset, and that scaling up to larger retriever models consistently improves performance.
2 Method
2.1 Problem Definition
We focus on open-domain retrieval, where, given a question q, the task is to select a small set of matching passages (i.e., 20 or 100) from a large collection of evidence passages . Our goal is to train a retriever in a zero-shot manner, that is, without using question-passage pairs, such that it retrieves relevant passages to answer the question. Our proposed approach consists of two core modeling components (§2.2, §2.3) and a novel training method (§2.4).
2.2 Dual Encoder Retriever
For the retriever, we use the dual-encoder model (Bromley et al., 1994) which consists of two encoders, where
one encoder computes the question embedding , and
the other encoder computes the passage embedding .
We use the transformer network (Vaswani et al., 2017) with BERT tokenization (Devlin et al., 2019) to model both the encoders. To obtain the question or passage embedding, we do a forward pass through the transformer and select the last layer hidden state corresponding to the [CLS] token. As the input passage representation, we use both the passage title and text separated by [SEP] token.
2.3 Zero-Shot Cross-Attention Scorer
We obtain an estimate of the relevance score for a question-(retrieved) passage pair (q, z) by using a PLM. In order to do this in a zero-shot manner, we use a large generative PLM to compute the likelihood score of a passage conditioned on the question p(z∣q).
We hypothesize that calculating the relevance score using Eq. 2b would be accurate because it requires performing deep cross-attention involving all the question and passage tokens. In a large PLM, the cross-attention step is highly expressive, and in combination with teacher-forcing, requires the model to explain every token in the question resulting in a better estimation.
As the input passage representation, we concatenate the passage title and its text. In order to prompt the PLM for question generation, we follow Sachan et al. (2022) and append a simple natural language instruction “Please write a question based on this passage.” to the passage text.
2.4 Training Algorithm
For training the model, our only assumption is that a collection of questions () and evidence passages () are provided as input. During training, the weights of the retriever are updated while the PLM is not finetuned, i.e., it is used in inference mode. Our training algorithm consists of five core steps. The first four steps are performed at every training iteration while the last step is performed every few hundred iterations. Figure 1 presents an illustration of our approach.
Step 1: Top-K Passage Retrieval
Step 2: Retriever Likelihood Calculation
Step 3: PLM Relevance Score Estimation
Step 4: Loss Calculation and Optimization
Step 5: Updating Evidence Embeddings
During training, we update the parameters of both the question encoder (Φq) and passage encoder (Φd). Due to this, the pre-computed evidence embeddings that was computed using initial retriever parameters () becomes stale, which may affect top-K passage retrieval. To prevent staleness, we re-compute the evidence passage embeddings using current passage encoder parameters (Φd) after every 500 training steps.
2.5 ART as an Autoencoder
Since our encoder takes as input question q and the PLM scores (or reconstructs) the same question when computing the relevance score, we can consider our training algorithm as an autoencoder with a retrieved passage as the latent variable.
In the generative process, we start with an observed variable (the collection of evidence passages), which is the support set for our latent variable. Given an input q, we generate an index i and retrieve the passage zi. This index generation and retrieval process is modeled by our dual encoder architecture. Given zi, we decode it back into the question using our PLM.
Recall that our decoder (the PLM) is frozen and its parameters are not updated. However, the signal from the decoder output is used to train parameters of the dual encoder such that the log-likelihood of reconstructing the question q is maximized. In practice, this improves the dual encoder to select the best passage for a given question, since the only way to maximize the objective is by choosing the most relevant zi given the input q.
3 Experimental Setup
In this section, we describe the datasets, evaluation protocol, implementation details, and baseline methods for our passage retrieval experiments.
3.1 Datasets and Evaluation
Evidence Passages
The evidence corpus includes the preprocessed English Wikipedia dump from December 2018 (Karpukhin et al., 2020). Following convention, we split an article into non-overlapping segments containing 100 words each, resulting in over 21 million passages. The same evidence is used for both training and evaluation.
Question-Answering Datasets
Following previous work, we use the open-retrieval version of Natural Questions (NQ-Open; Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), SQuAD-1.0 (SQuAD-Open; Rajpurkar et al., 2016), WebQuestions (WebQ; Berant et al., 2013), and EntityQuestions (EQ; Sciavolino et al., 2021) datasets. Table 1 lists their training, development, and test set sizes.
Dataset statistics. During the training process, ART only uses the questions while evaluation is performed over the canonical development and test sets.
Dataset . | Train Questions . | Dev . | Test . |
---|---|---|---|
Question-Answering Datasets | |||
WebQ | 3,417 | 361 | 2,032 |
NQ-Open | 79,168 | 8,757 | 3,610 |
SQuAD-Open | 78,713 | 8,886 | 10,570 |
TriviaQA | 78,785 | 8,837 | 11,313 |
EQ | – | 22,068 | 22,075 |
All Questions Datasets | |||
NQ-Full | 307,373 | – | – |
MS MARCO | 502,939 | – | – |
Dataset . | Train Questions . | Dev . | Test . |
---|---|---|---|
Question-Answering Datasets | |||
WebQ | 3,417 | 361 | 2,032 |
NQ-Open | 79,168 | 8,757 | 3,610 |
SQuAD-Open | 78,713 | 8,886 | 10,570 |
TriviaQA | 78,785 | 8,837 | 11,313 |
EQ | – | 22,068 | 22,075 |
All Questions Datasets | |||
NQ-Full | 307,373 | – | – |
MS MARCO | 502,939 | – | – |
All Questions Datasets
For our transfer learning experiments, we use all the questions from Natural Questions (henceforth referred to as NQ- Full) and MS MARCO passage ranking (Bajaj et al., 2016) datasets. Table 1 lists the number of questions. The questions in NQ-Full are information-seeking, as they were asked by real users. Its size is four times that of NQ-Open. NQ-Full consists of questions having just long-form of answers such as paragraphs, all the questions in NQ-Open (which have both long-form and short-form answers), questions having yes/no answers, and questions that do not contain the answer or are unanswerable. For MS MARCO, we use its provided passage collection (around 8.8 million passages in total) as the evidence corpus.
Evaluation
To evaluate retriever performance, we report the conventional top-K accuracy metric. It is the fraction of questions for which at least one passage among the top-K retrieved passages contains a span of words that matches human-annotated answer(s) to the question.
3.2 Implementation Details
Model Sizes
We use BERT base configuration (Devlin et al., 2019) for the retriever, which consists of 12 layers, 12 attention heads, and 768 embedding dimensions, leading to around 220M trainable parameters. For the teacher PLM, we use two configurations: (i) T5-XL configuration (Raffel et al., 2020) consisting of 24 layers, 32 attention heads, and 2048 embedding dimensions, leading to 3B parameters, and (ii) a larger T5-XXL configuration consisting of 11B parameters.
Model Initialization
We initialize the retriever with unsupervised masked salient spans (MSS) pre-training (Sachan et al., 2021a) as it provides an improved zero-shot retrieval over BERT pre-training.3 We initialize the cross-attention (or teacher) PLM with the T5-lm-adapted (Lester et al., 2021) or instruction-tuned T0 (Sanh et al., 2022) language models, which have been shown to be effective zero-shot re-rankers for information retrieval tasks (Sachan et al., 2022).
Compute Hardware
We perform training on instances containing 8 or 16 A100 GPUs, each containing 40 GB RAM.
Passage Retrieval
To perform fast top-K passage retrieval at every training step, we pre-compute the embeddings of all the evidence passages. Computing embeddings of 21M passages takes roughly 10 minutes on 16 GPUs. The total size of these embeddings is around 30 GB (768-dimensional vectors in FP16 format). For scalable retrieval, we shard these embeddings across all the GPUs and perform exact maximum inner product search using distributed matrix multiplication.
Training Details
When training with T0 (3B) PLM, for all the datasets except WebQ, we perform training for 10 epochs using Adam with a batch size of 64, 32 retrieved passages, dropout value of 0.1, and peak learning rate of 2 × 10−5 with warmup and linear scheduling. Due to the smaller size of WebQ, we train for 20 epochs with a batch size of 16. When training with the T5-lm-adapted (11B) PLM, we use a smaller batch size of 32 with 16 retrieved passages. We save the retriever checkpoint every 500 steps and perform model selection by evaluating it on the development set. We use mixed precision training to train the retriever and perform inference over the PLM using bfloat16 format. We set the value of the temperature hyperparameter (τ) using cross-validation.
3.3 Baselines
We compare ART to both unsupervised and supervised models. Unsupervised models train a single retriever using unlabeled text corpus from the Internet while supervised models train a separate retriever for each dataset. We report the performance numbers from the original papers when the results are available or run their open-source implementations in case the results are not available.
Unsupervised Models
These include the popular BM25 algorithm (Robertson and Zaragoza, 2009) that is based on the sparse bag-of-words representation of text. Dense models typically use Wikipedia paragraphs to create (pseudo-) query and context pairs to perform contrastive training of the retriever. These differ in how the negative examples are obtained during contrastive training: They can be from the same batch (ICT; Lee et al., 2019; Sachan et al., 2021a), or contexts passages from previous batches (Contriever; Izacard et al., 2022), or by using other passages in the same article (Spider; Ram et al., 2022). Context passages can also be sampled from articles connected via hyperlinks (HLP; Zhou et al., 2022).
Supervised Models
These consist of approaches that use questions and positive passages to perform contrastive training of the retriever. To obtain improved performance an additional set of hard-negative passages is often used (DPR; Karpukhin et al., 2020), iterative mining of negative contexts is done using model weights (ANCE; Xiong et al., 2021), or the retriever is first initialized with ICT or MSS pre-training followed by DPR-style finetuning (ICT-DPR / MSS-DPR; Sachan et al., 2021a). The pre-trained retriever can be further trained by ANCE-style mining of hard-negative passages to further improve accuracy (coCondenser; Gao and Callan, 2022). Previous methods have also explored finetuning the cross-encoder PLM jointly with the retriever such that cross-encoder provides more accurate training signals to improve retrieval accuracy. Among them include the approaches of end-to-end training of PLM and retriever, which infuses supervision from the annotated answers to a question (EMDR2; Sachan et al., 2021b), and a multi-stage mixed objective distillation approach to jointly train the re-ranker (Nogueira and Cho, 2019) and retriever (RocketQAv2; Ren et al., 2021). A combination of adversarial and distillation-based training of re-ranker and retriever has been shown to obtain state-of-the-art performance (AR2; Zhang et al., 2022).
4 Experiments and Results
4.1 Zero-Shot Passage Retrieval
For the passage retrieval task, we report results on SQuAD-Open, TriviaQA, NQ-Open, and WebQ and train ART under two settings. In the first setting, we train a separate retriever for each dataset using questions from their training set. In the second setting, to examine the robustness of ART training to different question types, we train a single retriever by combining the questions from all the four datasets, which we refer to as ART-Multi. For both these settings, we train ART using T5-lm-adapted (11B) and T0 (3B) cross-attention PLM scorers. As our training process does not require annotated passages for a question, we refer to this as zero-shot passage retrieval.
Table 2 presents the top-20 and top-100 retrieval accuracy in these settings alongside recent baselines that train a similarly sized retriever (110M). All the variants of ART achieve substantially better performance than previous unsupervised approaches. For example, ART trained with T0 (3B) outperforms the recent Spider and Contriever models by an average of 9 points on top-20 and 6 points on top-100 accuracy. When comparing with supervised models, despite using just questions, ART outperforms strong baselines like DPR and ANCE and is on par or slightly better than pre-trained retrievers like MSS-DPR. In addition, ART-Multi obtains comparable performance to its single dataset version, a considerable advantage in practical applications as a single retriever can be deployed rather than training a custom retriever for each use case.
Top-20 and top-100 retrieval accuracy on the test set of datasets. For more details regarding the unsupervised and supervised models, please see §3.3 in the text. Best supervised results are highlighted in bold while best results from the our proposed model (ART) are underlined. ART substantially outperforms previous unsupervised models and comes close to or matches the performance of supervised models by just using questions during training. *indicates that the cross-attention PLM is finetuned. † denotes that ‘cpt-text S’ model (Neelakantan et al., 2022) contains around 300M parameters. ‡ denotes that DPR-Multi was not trained on SQuAD-Open. ◇ indicates that the results on SQuAD-Open and WebQ are obtained by finetuning the open-source MSS checkpoint. ∘ indicates that EMDR2 results are obtained using their open-source checkpoints.
Retriever . | Cross-Attention Language Model . | SQuAD-Open . | TriviaQA . | NQ-Open . | WebQ . | ||||
---|---|---|---|---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Unsupervised Approaches (trained using Wikipedia / Internet data) | |||||||||
BERT | 5.2 | 13.5 | 7.2 | 17.8 | 9.4 | 20.3 | 3.7 | 12.8 | |
ICT | 45.1 | 65.2 | 57.5 | 73.6 | 50.6 | 66.8 | 43.4 | 65.7 | |
MSS | T5* (220M) | 51.3 | 68.4 | 68.2 | 79.4 | 59.8 | 74.9 | 49.2 | 68.4 |
BM25 | 71.1 | 81.8 | 76.4 | 83.2 | 62.9 | 78.3 | 62.4 | 75.5 | |
Contriever | 63.4 | 78.2 | 74.2 | 83.2 | 67.8 | 82.1 | 74.9 | 80.1 | |
Spider | 61.0 | 76.0 | 75.8 | 83.5 | 68.3 | 81.2 | 65.9 | 79.7 | |
cpt-text S† | – | – | 75.1 | 81.7 | 65.5 | 77.2 | – | – | |
HLP | – | – | 76.9 | 84.0 | 70.2 | 82.0 | 66.9 | 80.8 | |
Supervised Approaches (trained using question-passage aligned data) | |||||||||
DPR | 63.2 | 77.2 | 79.4 | 85.0 | 78.4 | 85.4 | 73.2 | 81.4 | |
DPR-Multi‡ | 51.6 | 67.6 | 78.8 | 84.7 | 79.4 | 86.0 | 75.0 | 82.9 | |
ANCE | – | – | 80.3 | 85.3 | 81.9 | 87.5 | – | – | |
ICT-DPR | – | – | 81.7 | 86.3 | 81.8 | 88.0 | 72.5 | 82.3 | |
MSS-DPR◇ | 73.1 | 84.5 | 81.8 | 86.6 | 82.1 | 87.8 | 76.9 | 84.6 | |
coCondenser | – | – | 83.2 | 87.3 | 84.3 | 89.0 | – | – | |
RocketQAv2 | ERNIE* (110M) | – | – | – | – | 83.7 | 89.0 | – | – |
EMDR 2∘ | T5* (220M) | – | – | 83.4 | 87.3 | 85.3 | 89.7 | 79.1 | 85.2 |
AR2 | ERNIE* (330M) | – | – | 84.4 | 87.9 | 86.0 | 90.1 | – | – |
Our Approach (trained using questions and Wikipedia text) | |||||||||
ART | T5-lm-adapt (11B) | 74.2 | 84.3 | 82.5 | 86.6 | 80.2 | 88.4 | 74.4 | 82.7 |
ART-Multi | T5-lm-adapt (11B) | 72.8 | 83.2 | 82.2 | 86.6 | 81.5 | 88.5 | 74.8 | 83.7 |
ART | T0 (3B) | 75.3 | 85.0 | 82.9 | 87.1 | 81.6 | 89.0 | 75.7 | 84.3 |
ART-Multi | T0 (3B) | 74.7 | 84.5 | 82.9 | 87.0 | 82.0 | 88.9 | 76.6 | 85.0 |
Retriever . | Cross-Attention Language Model . | SQuAD-Open . | TriviaQA . | NQ-Open . | WebQ . | ||||
---|---|---|---|---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Unsupervised Approaches (trained using Wikipedia / Internet data) | |||||||||
BERT | 5.2 | 13.5 | 7.2 | 17.8 | 9.4 | 20.3 | 3.7 | 12.8 | |
ICT | 45.1 | 65.2 | 57.5 | 73.6 | 50.6 | 66.8 | 43.4 | 65.7 | |
MSS | T5* (220M) | 51.3 | 68.4 | 68.2 | 79.4 | 59.8 | 74.9 | 49.2 | 68.4 |
BM25 | 71.1 | 81.8 | 76.4 | 83.2 | 62.9 | 78.3 | 62.4 | 75.5 | |
Contriever | 63.4 | 78.2 | 74.2 | 83.2 | 67.8 | 82.1 | 74.9 | 80.1 | |
Spider | 61.0 | 76.0 | 75.8 | 83.5 | 68.3 | 81.2 | 65.9 | 79.7 | |
cpt-text S† | – | – | 75.1 | 81.7 | 65.5 | 77.2 | – | – | |
HLP | – | – | 76.9 | 84.0 | 70.2 | 82.0 | 66.9 | 80.8 | |
Supervised Approaches (trained using question-passage aligned data) | |||||||||
DPR | 63.2 | 77.2 | 79.4 | 85.0 | 78.4 | 85.4 | 73.2 | 81.4 | |
DPR-Multi‡ | 51.6 | 67.6 | 78.8 | 84.7 | 79.4 | 86.0 | 75.0 | 82.9 | |
ANCE | – | – | 80.3 | 85.3 | 81.9 | 87.5 | – | – | |
ICT-DPR | – | – | 81.7 | 86.3 | 81.8 | 88.0 | 72.5 | 82.3 | |
MSS-DPR◇ | 73.1 | 84.5 | 81.8 | 86.6 | 82.1 | 87.8 | 76.9 | 84.6 | |
coCondenser | – | – | 83.2 | 87.3 | 84.3 | 89.0 | – | – | |
RocketQAv2 | ERNIE* (110M) | – | – | – | – | 83.7 | 89.0 | – | – |
EMDR 2∘ | T5* (220M) | – | – | 83.4 | 87.3 | 85.3 | 89.7 | 79.1 | 85.2 |
AR2 | ERNIE* (330M) | – | – | 84.4 | 87.9 | 86.0 | 90.1 | – | – |
Our Approach (trained using questions and Wikipedia text) | |||||||||
ART | T5-lm-adapt (11B) | 74.2 | 84.3 | 82.5 | 86.6 | 80.2 | 88.4 | 74.4 | 82.7 |
ART-Multi | T5-lm-adapt (11B) | 72.8 | 83.2 | 82.2 | 86.6 | 81.5 | 88.5 | 74.8 | 83.7 |
ART | T0 (3B) | 75.3 | 85.0 | 82.9 | 87.1 | 81.6 | 89.0 | 75.7 | 84.3 |
ART-Multi | T0 (3B) | 74.7 | 84.5 | 82.9 | 87.0 | 82.0 | 88.9 | 76.6 | 85.0 |
ART’s performance also comes close to the state-of-the-art supervised models like AR2 and EMDR2, especially on the top-100 accuracy but lags behind in the top-20 accuracy. In addition to obtaining reasonable performance and not requiring aligned passages for training, ART’s training process is much simpler than AR2. It also does not require cross-encoder finetuning and is thus faster to train. As generative language models continue to become more accurate (Chowdhery et al., 2022), we hypothesize that the performance gap between state-of-the-art supervised models and ART would further narrow down.
Our results showcase that both the PLM scorers, T5-lm-adapt (11B) and T0 (3B), achieve strong results on the QA retrieval tasks, with T0 achieving higher performance gains. This illustrates that the relevance score estimates of candidate passages obtained in the zero-shot cross-attention step are accurate enough to provide strong supervision for retriever training. We believe that this is a direct consequence of the knowledge stored in the PLM weights. While T5-lm-adapt’s knowledge is obtained by training on unsupervised text corpora, T0 was further finetuned using instruction-prompted datasets of tasks such as summarization, QA, text classification, etc.4 Hence, in addition to learning from instructions, the performance gains from T0 can be attributed to the knowledge infused in its weights by (indirect) supervision from these manually curated datasets. Instruction-based finetuning is helpful in the case of smaller datasets like WebQ and especially in improving the performance on lower values of top-K accuracy (such as top-20).5
Overall, our results suggest that an accurate and robust passage retrieval can be achieved by training with questions alone. This presents a considerably more favorable setting than the current approaches which require obtaining positive and hard-negative passages for such questions. Due to its better performance, we use the T0 (3B) PLM for subsequent experiments unless stated otherwise.
4.2 Sample Efficiency
To measure the sample efficiency of ART, we train the model by randomly selecting a varying number of questions from NQ-Open training questions and compute the top-K accuracy on its development set. These results are presented in Figure 2 and we also include the results of BM25 and DPR for comparison. We see that performance increases with the increase in questions until about 10k questions, after which the gains become less pronounced.
Top-K accuracy as the number of training questions (denoted as ‘Q’ in the legend) is varied. When trained with 100 questions, ART outperforms BM25 and when trained with 1k questions, it matches DPR’s performance for top-K > 50 passages, illustrating that ART is highly sample efficient.
Top-K accuracy as the number of training questions (denoted as ‘Q’ in the legend) is varied. When trained with 100 questions, ART outperforms BM25 and when trained with 1k questions, it matches DPR’s performance for top-K > 50 passages, illustrating that ART is highly sample efficient.
When trained with just 100 questions, ART significantly outperforms BM25 and when trained with 1k questions, it matches DPR performance levels for top-{50, …, 100} accuracy. This demonstrates thatARTin addition to using just questions is also much more data efficient than DPR, as it requires almost ten times fewer questions to reach a similar performance.
4.3 Zero-Shot Out-of-Distribution Transfer
In the previous experiments, both the training and test sets contained questions that were sampled from the same underlying distribution, a setting that we refer to as in-distribution training. However, obtaining in-domain questions for training is not always feasible in practice. Instead, a model trained on an existing collection of questions must be evaluated on new datasets, a setting that we refer to as out-of-distribution (OOD) transfer.
We train ART using NQ-Open and NQ-Full questions and then evaluate its performance on SQuAD-Open, TriviaQA, WebQ, and EQ datasets. It is desirable to train on answerable questions such as the ones included in NQ-Open, but this is not always possible, as real user questions are often imprecisely worded or ambiguous. Due to this, training on NQ-Full can be considered as a practical testbed for evaluating true OOD generalization as a majority of the questions (51%) were marked as unanswerable from Wikipedia by human annotators.6
Table 3 presents OOD generalization results on the four QA datasets including the results of DPR and Spider models trained on NQ-Open.7ART trained on NQ-Open always performs significantly better than both DPR and Spider, showing that it is better at generalization than supervised models. When trained using NQ-Full, ART performance further improves by 3 and 0.5–1 points on EQ and other datasets, respectively, over NQ-Open. This highlights that in addition to questions annotated as having short answers, questions annotated with long answers also provide meaningful supervisory signals and unanswerable questions do not necessarily degrade performance.
Top-20 and top-100 retrieval accuracy when evaluating zero-shot out-of-distribution (OOD) generalization of models on the test set of datasets. † denotes that these results are from Ram et al. (2022). ART generalizes better than supervised models on OOD evaluation even when trained on all the questions of the Natural Questions dataset, which contains a mix of answerable and unanswerable questions.
Retriever . | Training Dataset . | SQuAD-Open . | TriviaQA . | WebQ . | EQ . | ||||
---|---|---|---|---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Training on answerable questions | |||||||||
BM25 | – | 71.1 | 81.8 | 76.4 | 83.2 | 62.4 | 75.5 | 71.2 | 79.8 |
DPR† | NQ-Open | 48.9 | 65.2 | 69.0 | 78.7 | 68.8 | 78.3 | 49.7 | 63.2 |
EMDR2 | NQ-Open | 66.8 | 79.0 | 79.7 | 85.3 | 74.2 | 83.2 | 62.7 | 75.1 |
Spider† | NQ-Open | 57.7 | 72.8 | 77.2 | 83.7 | 74.2 | 82.5 | 61.9 | 74.1 |
ART | NQ-Open | 68.0 | 80.2 | 79.8 | 85.1 | 73.4 | 83.1 | 64.3 | 75.5 |
ART | MS MARCO | 68.4 | 80.4 | 78.0 | 84.1 | 74.8 | 83.2 | 75.3 | 81.9 |
Training on a mix of answerable and unanswerable questions | |||||||||
ART | NQ-Full | 69.4 | 81.1 | 80.3 | 85.7 | 74.3 | 83.9 | 67.8 | 78.3 |
ART | MS MARCO + NQ-Full | 69.6 | 81.1 | 80.7 | 85.7 | 75.3 | 84.5 | 69.2 | 79.1 |
Retriever . | Training Dataset . | SQuAD-Open . | TriviaQA . | WebQ . | EQ . | ||||
---|---|---|---|---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Training on answerable questions | |||||||||
BM25 | – | 71.1 | 81.8 | 76.4 | 83.2 | 62.4 | 75.5 | 71.2 | 79.8 |
DPR† | NQ-Open | 48.9 | 65.2 | 69.0 | 78.7 | 68.8 | 78.3 | 49.7 | 63.2 |
EMDR2 | NQ-Open | 66.8 | 79.0 | 79.7 | 85.3 | 74.2 | 83.2 | 62.7 | 75.1 |
Spider† | NQ-Open | 57.7 | 72.8 | 77.2 | 83.7 | 74.2 | 82.5 | 61.9 | 74.1 |
ART | NQ-Open | 68.0 | 80.2 | 79.8 | 85.1 | 73.4 | 83.1 | 64.3 | 75.5 |
ART | MS MARCO | 68.4 | 80.4 | 78.0 | 84.1 | 74.8 | 83.2 | 75.3 | 81.9 |
Training on a mix of answerable and unanswerable questions | |||||||||
ART | NQ-Full | 69.4 | 81.1 | 80.3 | 85.7 | 74.3 | 83.9 | 67.8 | 78.3 |
ART | MS MARCO + NQ-Full | 69.6 | 81.1 | 80.7 | 85.7 | 75.3 | 84.5 | 69.2 | 79.1 |
We also train ART using MS MARCO questions and perform OOD evaluation. Due to the larger size of MS MARCO and a smaller number of evidence passages, we use a batch size of 512 and retrieve 8 passages for training. Quite surprisingly, it obtains much better performance than previous approaches including BM25 on EQ (more than 10 points gain on top-20 over training ART on NQ-Open). We suspect that this may be due to the similar nature of questions in MS MARCO and EQ. Further finetuning the pre-trained MS MARCO model on NQ-Full significantly improves performance on WebQ.
4.4 Scaling Model Size
We examine if scaling up the retriever parameters can offer further performance improvements. To this end, we train a retriever of BERT-large configuration (24 layers, 16 attention heads, 1024 embedding dimensions) containing around 650M parameters on NQ-Open and TriviaQA. Results are presented in Table 4 for both the development and test sets. We also include the results of other relevant baselines containing a similar number of trainable parameters.
Top-20 and top-100 accuracy when training large configuration retriever, which contains around 650M parameters. EMDR2 (base configuration) (Sachan et al., 2021b) contains 440M parameters. Best supervised results are underlined while the best unsupervised results are highlighted in bold.
. | Retriever . | NQ-Open . | TriviaQA . | ||
---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Dev | ICT | 44.2 | 61.0 | 58.8 | 74.4 |
DPR | 79.1 | 85.5 | 81.1 | 85.9 | |
ICT-DPR | 81.4 | 87.4 | 82.8 | 86.9 | |
EMDR2 | 83.1 | 88.0 | 83.7 | 87.4 | |
ART-base | 80.6 | 87.4 | 83.6 | 87.4 | |
ART-large | 81.0 | 87.8 | 83.7 | 87.5 | |
Test | ICT | 49.3 | 66.1 | 58.5 | 74.1 |
DPR | 81.0 | 87.2 | 81.4 | 86.0 | |
ICT-DPR | 82.6 | 88.3 | 82.9 | 87.1 | |
EMDR2 | 85.3 | 89.7 | 83.4 | 87.3 | |
ART-base | 81.6 | 89.0 | 82.9 | 87.1 | |
ART-large | 82.1 | 88.8 | 83.6 | 87.6 |
. | Retriever . | NQ-Open . | TriviaQA . | ||
---|---|---|---|---|---|
. | . | Top-20 . | Top-100 . | Top-20 . | Top-100 . |
Dev | ICT | 44.2 | 61.0 | 58.8 | 74.4 |
DPR | 79.1 | 85.5 | 81.1 | 85.9 | |
ICT-DPR | 81.4 | 87.4 | 82.8 | 86.9 | |
EMDR2 | 83.1 | 88.0 | 83.7 | 87.4 | |
ART-base | 80.6 | 87.4 | 83.6 | 87.4 | |
ART-large | 81.0 | 87.8 | 83.7 | 87.5 | |
Test | ICT | 49.3 | 66.1 | 58.5 | 74.1 |
DPR | 81.0 | 87.2 | 81.4 | 86.0 | |
ICT-DPR | 82.6 | 88.3 | 82.9 | 87.1 | |
EMDR2 | 85.3 | 89.7 | 83.4 | 87.3 | |
ART-base | 81.6 | 89.0 | 82.9 | 87.1 | |
ART-large | 82.1 | 88.8 | 83.6 | 87.6 |
By scaling up the retriever size, we see small but consistent improvements in retrieval accuracy across both the datasets. Especially on TriviaQA, ART matches or exceeds the performance of previous best models. On NQ-Open, it comes close to the performance of EMDR2 (Sachan et al., 2021b), a supervised model trained using thousands of question-answer pairs.
We also attempted to use larger teacher PLMs such as T0 (11B). However, our initial experiments did not lead to any further improvements over the T0 (3B) PLM. We conjecture that this might be either specific to these QA datasets or that we need to increase the capacity of the teacher PLM even more to observe improvements. We leave an in-depth analysis of using larger teacher PLMs as part of the future work.
4.5 Analysis
Sensitivity to Retriever Initialization
To examine how the convergence of ART training is affected by the initial retriever parameters, we initialize the retriever with (1) BERT weights, (2) ICT weights (as trained in Sachan et al., 2021a), and (3) MSS weights, and train using NQ-Open questions. Figure 3 displays the top-20 performance on the NQ development set as the training progresses. It reveals that ART training is not sensitive to the initial retriever parameters as all three initialization schemes converge to similar results. However, the convergence properties might be different under low-resource settings, an exploration of which we leave for future work.
Effect of retriever initialization on ART training. The plot reveals that the training process is not sensitive to initial retriever parameters.
Effect of retriever initialization on ART training. The plot reveals that the training process is not sensitive to initial retriever parameters.
Effect of the Number of Retrieved Passages
Table 5 quantifies the effect of the number of retrieved passages used during training on performance. A smaller number of retrieved passages such as 2 or 4 leads to a somewhat better top- {1,5} accuracy, at the expense of a drop in top- {20,100} accuracy. Retrieving 32 passages offers a reasonable middle ground and beyond that, the top-K retrieval performance tends to drop.
Effect of using a different number of retrieved passages during ART training as evaluated on the NQ-Open development set. For each case, we list the absolute gain or loss in top-K accuracy when compared to the setting utilizing 32 retrieved passages.
Retrieved Passages . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|
32 | 36.7 | 65.8 | 80.6 | 87.4 |
2 | +2.4 | +0.4 | −0.9 | −0.6 |
4 | +1.9 | +0.9 | −0.6 | −0.6 |
8 | +0.8 | +0.8 | −0.5 | −0.1 |
16 | +0.9 | +0.9 | −0.3 | −0.1 |
64 | −0.5 | −0.7 | −0.3 | 0 |
128 | −2.3 | −1.7 | −0.8 | −0.2 |
Retrieved Passages . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|
32 | 36.7 | 65.8 | 80.6 | 87.4 |
2 | +2.4 | +0.4 | −0.9 | −0.6 |
4 | +1.9 | +0.9 | −0.6 | −0.6 |
8 | +0.8 | +0.8 | −0.5 | −0.1 |
16 | +0.9 | +0.9 | −0.3 | −0.1 |
64 | −0.5 | −0.7 | −0.3 | 0 |
128 | −2.3 | −1.7 | −0.8 | −0.2 |
A Closer Inspection of ART with Supervised Models
In order to have a better understanding of the tradeoff between supervised models and ART, we examine their top-1 and top-5 accuracy in addition to the commonly reported top-20 and top-100 scores. Table 6 presents these results for ART (large) along with supervised models of DPR (large) and EMDR2. Supervised models achieve much better performance for top-K ∈{1,…,5} passages, i.e., these models are more precise. This is likely because DPR is trained with hard-negative passages and EMDR2 finetunes PLM using answers resulting in an accurate relevance feedback to the retriever. When considering top-K ∈{20,…,100} passages, ART comes close or matches the performance of EMDR2. As top-performing models for knowledge-intensive tasks such as open-domain QA rely on a larger set of retrieved passages, such as top-K = 100 (Sachan et al., 2022), this justifies the argument to adopt zero-shot ART over supervised retrievers.
Analysis reveals that ART (large) can even match the performance of end-to-end trained models like EMDR2 when retrieving a larger number of passages. However, DPR (large) and EMDR2 still outperform ART when retrieving a small number of passages such as top-K ∈{1,…,5} (highlighted in bold).
Retriever . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|
NQ-Open (dev) | ||||
DPR | 50.1 | 69.6 | 79.1 | 85.5 |
EMDR2 | 55.3 | 74.9 | 83.1 | 88.0 |
ART | 37.6 | 66.8 | 81.0 | 87.8 |
TriviaQA (dev) | ||||
DPR | 59.6 | 74.4 | 81.1 | 85.9 |
EMDR2 | 63.7 | 78.0 | 83.7 | 87.4 |
ART | 58.3 | 77.5 | 83.7 | 87.5 |
Retriever . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|
NQ-Open (dev) | ||||
DPR | 50.1 | 69.6 | 79.1 | 85.5 |
EMDR2 | 55.3 | 74.9 | 83.1 | 88.0 |
ART | 37.6 | 66.8 | 81.0 | 87.8 |
TriviaQA (dev) | ||||
DPR | 59.6 | 74.4 | 81.1 | 85.9 |
EMDR2 | 63.7 | 78.0 | 83.7 | 87.4 |
ART | 58.3 | 77.5 | 83.7 | 87.5 |
Why Training using Passage Retrieval?
To assess the importance of passages in during the training process, we train the retriever under different settings by varying the passage types. Specifically, we train with a mix of positive, hard-negative, and uniformly sampled passages. We also perform in-batch training by defining to be the union of positive and hard-negative passages for all the questions in a batch. Results in Table 7 illustrate that when consists of uniformly sampled passages, it leads to poor performance. Including a (gold) positive passage in leads to good performance improvements. Results further improve with the inclusion of a hard-negative passage in . However, in-batch training leads to a slight drop in performance. As the gold passages are not always available, our method of selecting the top passages from evidence at every training step can be seen as an approximation to using the gold passages. With this, ART obtains even better results than the previous settings, an improvement by 4 points absolute in the top-20 accuracy.
Effect of passage types on ART training when evaluated on the NQ-Open development set. P denotes a positive passage, N denotes a hard-negative passage (mined using BM25), U denotes that the passages are randomly sampled from the evidence, and IB denotes in-batch training.
P . | N . | U . | IB . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|---|---|---|
0 | 0 | 32 | ✗ | 6.0 | 16.6 | 30.8 | 46.7 |
1 | 0 | 31 | ✗ | 31.8 | 58.9 | 74.8 | 84.4 |
1 | 1 | 30 | ✗ | 33.7 | 61.0 | 76.0 | 85.5 |
1 | 1 | 0 | ✓ | 32.6 | 59.5 | 75.1 | 84.9 |
Top-32 passages | 36.7 | 65.8 | 80.6 | 87.4 |
P . | N . | U . | IB . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
---|---|---|---|---|---|---|---|
0 | 0 | 32 | ✗ | 6.0 | 16.6 | 30.8 | 46.7 |
1 | 0 | 31 | ✗ | 31.8 | 58.9 | 74.8 | 84.4 |
1 | 1 | 30 | ✗ | 33.7 | 61.0 | 76.0 | 85.5 |
1 | 1 | 0 | ✓ | 32.6 | 59.5 | 75.1 | 84.9 |
Top-32 passages | 36.7 | 65.8 | 80.6 | 87.4 |
Impact of Language Model Training Strategy
We examine which PLMs can provide good cross-attention scores during training. We compare across PLMs trained using three different objectives: (i) generative denoising of masked spans (T5 series; Raffel et al., 2020), (ii) further pre-training using autoregressive language modeling objective (T5-lm-adapt series; Lester et al., 2021), and (iii) finetuning T5-lm-adapt models on unrelated tasks using instructions (T0 series; Sanh et al., 2022). Our results in Table 8 highlight that PLM training methodology and model size can have a large effect on retrieval performance. T5 base model leads to low scores possibly because pre-training using predicting masked spans is not ideal for question reconstruction. However, the accuracy improves with an increase in model size. T5-lm-adapt models are more stable and lead to improved performance with the best result achieved by the 11B model. Instruction finetuned T0 models outperforms the T5-lm-adapt models. However, scaling up the size of T0 to 11B parameters does not result in meaningful improvements.
Comparison of different PLMs when used as cross-attention scorers during training (§2.3). T0 (3B) PLM achieves the highest accuracy among the compared PLMs showcasing that training language models using instruction-tuning provides accurate relevance scores.
. | NQ-Open (dev) . | |||
---|---|---|---|---|
Language Model (Θ) . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
Models trained using Denoising Masked Spans | ||||
T5-base (250M) | 12.8 | 30.9 | 47.8 | 63.0 |
T5-xl (3B) | 25.0 | 53.9 | 74.4 | 85.3 |
T5-xxl (11B) | 29.5 | 59.8 | 77.8 | 86.3 |
Models trained using Language Modeling Objective | ||||
T5-lm-adapt (250M) | 29.4 | 56.6 | 74.4 | 84.7 |
T5-lm-adapt (800M) | 30.9 | 59.1 | 76.5 | 85.9 |
T5-lm-adapt (3B) | 31.8 | 61.0 | 77.9 | 86.5 |
T5-lm-adapt (11B) | 32.7 | 62.6 | 78.6 | 87.0 |
Model trained using Natural Language Instructions | ||||
T0 (3B) | 36.7 | 65.8 | 80.6 | 87.4 |
T0 (11B) | 34.3 | 64.5 | 79.8 | 87.2 |
. | NQ-Open (dev) . | |||
---|---|---|---|---|
Language Model (Θ) . | Top-1 . | Top-5 . | Top-20 . | Top-100 . |
Models trained using Denoising Masked Spans | ||||
T5-base (250M) | 12.8 | 30.9 | 47.8 | 63.0 |
T5-xl (3B) | 25.0 | 53.9 | 74.4 | 85.3 |
T5-xxl (11B) | 29.5 | 59.8 | 77.8 | 86.3 |
Models trained using Language Modeling Objective | ||||
T5-lm-adapt (250M) | 29.4 | 56.6 | 74.4 | 84.7 |
T5-lm-adapt (800M) | 30.9 | 59.1 | 76.5 | 85.9 |
T5-lm-adapt (3B) | 31.8 | 61.0 | 77.9 | 86.5 |
T5-lm-adapt (11B) | 32.7 | 62.6 | 78.6 | 87.0 |
Model trained using Natural Language Instructions | ||||
T0 (3B) | 36.7 | 65.8 | 80.6 | 87.4 |
T0 (11B) | 34.3 | 64.5 | 79.8 | 87.2 |
Ad-Hoc Retrieval Tasks
While the previous experiments were conducted on QA datasets, here we examine the robustness of the ART model trained using questions to different ad-hoc retrieval tasks. For this analysis, we evaluate the performance of ART on the BEIR benchmark (Thakur et al., 2021). It is a heterogeneous collection of many retrieval datasets, with each dataset consisting of test set queries, evidence documents, and gold document annotations. BEIR spans multiple domains and diverse retrieval tasks presenting a strong challenge suite, especially to the dense retrievers. We train ART using MS MARCO questions and report its nDCG@10 and Recall@100 scores on each dataset. For comparison, we include the results of three baselines: BM25, Contriever, and DPR trained using NQ-Open. Our results presented in Table 9 show strong generalization performance of ART as it outperforms DPR and Contriever results. ART also achieves at par results with the strong BM25 baseline outperforming BM25 on 8 out of the 15 datasets (according to nDCG@10 scores).
Results on the BEIR benchmark. #Q and #E denotes the size of the test set and evidence, respectively. Best scores for each dataset are highlighted in bold. ART is trained using MS MARCO questions. DPR is trained using NQ-Open. † denotes that these results are from Thakur et al. (2021).
Dataset . | #Q . | #E . | nDCG@10 . | Recall@100 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | DPR† . | BM25† . | Contriever . | ART . | DPR† . | BM25† . | Contriever . | ART . |
Scifact | 300 | 5K | 31.8 | 66.5 | 64.9 | 55.2 | 72.7 | 90.8 | 92.6 | 88.0 |
Scidocs | 1000 | 25K | 7.7 | 15.8 | 14.9 | 14.4 | 21.9 | 35.6 | 36.0 | 32.4 |
Nfcorpus | 323 | 3.5K | 18.9 | 32.5 | 31.7 | 29.9 | 20.8 | 25.0 | 29.0 | 26.6 |
FIQA-2018 | 648 | 57K | 11.2 | 23.6 | 24.5 | 26.5 | 34.2 | 53.9 | 56.2 | 55.4 |
Trec-covid | 50 | 0.2M | 33.2 | 65.5 | 27.4 | 50.3 | 21.2 | 49.8 | 17.2 | 36.9 |
Touche-2020 | 49 | 0.4M | 13.1 | 36.8 | 19.3 | 16.2 | 30.1 | 53.8 | 22.5 | 44.7 |
NQ | 3452 | 2.7M | 47.4 | 32.9 | 25.4 | 40.5 | 88.0 | 76.0 | 77.1 | 88.7 |
MS-Marco | 6980 | 8.8M | 17.7 | 22.8 | 20.6 | 32.6 | 55.2 | 65.8 | 67.2 | 81.7 |
HotpotQA | 7405 | 5.2M | 39.1 | 60.3 | 48.1 | 61.0 | 59.1 | 74.0 | 70.4 | 73.9 |
ArguAna | 1406 | 8.7K | 17.5 | 31.5 | 37.9 | 32.2 | 75.1 | 94.2 | 90.1 | 95.3 |
CQADupStack | 13145 | 0.5M | 15.3 | 29.9 | 28.4 | 33.5 | 40.3 | 60.6 | 61.4 | 62.6 |
Quora | 10000 | 0.5M | 24.8 | 78.9 | 83.5 | 84.2 | 47.0 | 97.3 | 98.7 | 98.8 |
DBpedia | 400 | 4.6M | 26.3 | 31.3 | 29.2 | 36.3 | 34.9 | 39.8 | 45.3 | 47.2 |
Fever | 6666 | 5.4M | 56.2 | 75.3 | 68.2 | 72.4 | 84.0 | 93.1 | 93.6 | 93.1 |
Climate-Fever | 1535 | 5.4M | 14.8 | 21.3 | 15.5 | 21.4 | 39.0 | 43.6 | 44.1 | 47.1 |
Average Score | 25.0 | 41.6 | 36.0 | 40.4 | 48.2 | 63.6 | 60.1 | 64.8 |
Dataset . | #Q . | #E . | nDCG@10 . | Recall@100 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | . | DPR† . | BM25† . | Contriever . | ART . | DPR† . | BM25† . | Contriever . | ART . |
Scifact | 300 | 5K | 31.8 | 66.5 | 64.9 | 55.2 | 72.7 | 90.8 | 92.6 | 88.0 |
Scidocs | 1000 | 25K | 7.7 | 15.8 | 14.9 | 14.4 | 21.9 | 35.6 | 36.0 | 32.4 |
Nfcorpus | 323 | 3.5K | 18.9 | 32.5 | 31.7 | 29.9 | 20.8 | 25.0 | 29.0 | 26.6 |
FIQA-2018 | 648 | 57K | 11.2 | 23.6 | 24.5 | 26.5 | 34.2 | 53.9 | 56.2 | 55.4 |
Trec-covid | 50 | 0.2M | 33.2 | 65.5 | 27.4 | 50.3 | 21.2 | 49.8 | 17.2 | 36.9 |
Touche-2020 | 49 | 0.4M | 13.1 | 36.8 | 19.3 | 16.2 | 30.1 | 53.8 | 22.5 | 44.7 |
NQ | 3452 | 2.7M | 47.4 | 32.9 | 25.4 | 40.5 | 88.0 | 76.0 | 77.1 | 88.7 |
MS-Marco | 6980 | 8.8M | 17.7 | 22.8 | 20.6 | 32.6 | 55.2 | 65.8 | 67.2 | 81.7 |
HotpotQA | 7405 | 5.2M | 39.1 | 60.3 | 48.1 | 61.0 | 59.1 | 74.0 | 70.4 | 73.9 |
ArguAna | 1406 | 8.7K | 17.5 | 31.5 | 37.9 | 32.2 | 75.1 | 94.2 | 90.1 | 95.3 |
CQADupStack | 13145 | 0.5M | 15.3 | 29.9 | 28.4 | 33.5 | 40.3 | 60.6 | 61.4 | 62.6 |
Quora | 10000 | 0.5M | 24.8 | 78.9 | 83.5 | 84.2 | 47.0 | 97.3 | 98.7 | 98.8 |
DBpedia | 400 | 4.6M | 26.3 | 31.3 | 29.2 | 36.3 | 34.9 | 39.8 | 45.3 | 47.2 |
Fever | 6666 | 5.4M | 56.2 | 75.3 | 68.2 | 72.4 | 84.0 | 93.1 | 93.6 | 93.1 |
Climate-Fever | 1535 | 5.4M | 14.8 | 21.3 | 15.5 | 21.4 | 39.0 | 43.6 | 44.1 | 47.1 |
Average Score | 25.0 | 41.6 | 36.0 | 40.4 | 48.2 | 63.6 | 60.1 | 64.8 |
5 Related Work
Our work is based on training a dense retriever using PLMs, which we have covered in previous sections. Here, we instead focus on other related approaches.
A popular method to train the dual-encoder retriever is to optimize contrastive loss using in-batch negatives (Gillick et al., 2019) and hard-negatives (Karpukhin et al., 2020; Xiong et al., 2021). Alternatives to using hard-negatives such as sampling from cached evidence embeddings have also shown to work well in practice (Lindgren et al., 2021). Multi-vector encoders for questions and passages are more accurate than dual-encoders, (Luan et al., 2021; Khattab and Zaharia, 2020; Humeau et al., 2020), although at the cost of an increased latency and storage requirements.
PLMs have been shown to improve passage rankings as they can perform cross-attention between the question and the retrieved passages (Lin et al., 2021). Supervised approaches to re-rank either finetune PLMs using question-passage pairs (Nogueira et al., 2020) or finetune PLMs to generate question conditioned on the passage (Nogueira dos Santos et al., 2020) while unsupervised re-rankers are based on zero-shot question scoring (Sachan et al., 2022). The re-ranking process is slow due to the cross-attention step and is bottlenecked by the accuracy of first-stage retrievers. To address these limitations, cross-attention distillation approaches from the PLM to retriever have been proposed (Qu et al., 2021). Such distillation can be performed either in a single end-to-end training step (Guu et al., 2020; Sachan et al., 2021b) or in a multi-stage process (Khattab et al., 2021; Izacard and Grave, 2021).
An alternative approach to using PLMs is to generate data that can aid retrieval. The data can be either the title or an answer that provides more information about the question (Mao et al., 2021). Generating new questions to augment the training data has also been shown to improve performance (Ma et al., 2021; Bonifacio et al., 2022; Dai et al., 2022). In comparison, we do not generate new questions but train the retriever using existing questions and PLM feedback. Data augmentation is likely complementary, and can further improve accuracy.
6 Conclusions and Future Work
We introduced ART, a novel approach to train a dense passage retriever using only questions. ART does not require question-passage pairs or hard-negative examples for training and yet achieves state-of-the-art results. The key to making ART work is to optimize the retriever to select relevant passages such that conditioning on them, the question generation likelihood computed using a large pre-trained language model iteratively improves. Despite requiring much less supervision, ART substantially outperforms DPR when evaluated on multiple QA datasets and also generalizes better on out-of-distribution questions.
ART presents several directions for future work. It would be interesting to apply this approach in low-resource retrieval including multi-lingual (Clark et al., 2020) and cross-lingual question answering (Asai et al., 2021). Our training framework can also be extended to train cross-modality retrievers such as for image or code search (Li et al., 2022; Neelakantan et al., 2022) using textual queries. Finally, other directions worth exploring would be to make use of labeled data when available such as by finetuning PLM on passage-question aligned data and to train multi-vector retrievers (Luan et al., 2021) with ART.
Acknowledgments
We are grateful to the TACL action editor and the three anonymous reviewers for providing us valuable feedback and useful suggestions that helped to improve this work. We would also like to thank Elena Gribovskaya from DeepMind for providing us valuable comments to improve the paper.
Notes
Our code and model checkpoints are available at: https://github.com/DevSinghSachan/art.
As the selection operation requires performing inner-product with millions of passage embeddings, this can be efficiently performed on accelerators such as GPUs.
We use the open-source MSS retriever checkpoint from https://github.com/DevSinghSachan/emdr2.
However, we note that T0 was not finetuned on the question generation task and not trained on any of the datasets we have used in this work. We refer the reader to the original paper for more training details.
§4.5 includes more detailed comparisons of different PLMs as cross-encoders.
The reasons for question unanswerability can be partly attributed to imprecise Wikipedia article retrieval during the annotation process, ambiguity in information-seeking questions, information required to answer not being localized to a single paragraph, etc. (Kwiatkowski et al., 2019).
We also include BM25 results for reference but do not directly compare with them because there is a high lexical overlap between question and passage tokens in the SQuAD-Open and EQ datasets which renders dense retrievers at a disadvantage over BM25, especially in the transfer setting.
References
Author notes
Action Editor: Jimmy Lin