Questions Are All You Need to Train a Dense Passage Retriever

We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g., questions and potential answer passages). It uses a new passage-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence passages, and (2) the passages are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both passage and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.1 Our code and model checkpoints are available at: https://github.com/DevSinghSachan/art.


Introduction
Dense passage retrieval methods (Karpukhin et al., 2020;Xiong et al., 2021), initialized with encoders such as BERT (Devlin et al., 2019) and trained using supervised contrastive losses (Oord et al., 2018), have surpassed the performance achieved by previously popular keyword-based approaches like BM25 (Robertson and Zaragoza, 2009).Such retrievers are core components in models for opendomain tasks, such as Open QA, where state-ofthe-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples.In this paper, we introduce the first unsupervised method, based on a new corpus-level autoencoding approach, that can match or surpass strong supervised performance levels with no labeled training data or task-specific losses.
We propose ART: Autoencoding-based Retriever Training which only assumes access to sets of unpaired questions and passages.Given an input question, ART first retrieves a small set of possible evidences passages.It then reconstructs the original question by attending to these passages (see Figure 1 for an overview).The key idea in ART is to consider the retrieved passages as a noisy representation of the original question and question reconstruction probability as a way of denoising that provides soft-labels for how likely each passage is to have been the correct result.
To bootstrap the training of a strong model, it is important to both have a strong initial retrieval model and to be able to compute reliable initial estimates of question reconstruction probability when conditioned on a (retrieved) passage.Although passage representations from BERT-style models are known to be reasonable retrieval baselines, it is less clear how to do zero-shot question generation.We use a generative pre-trained language model (PLM) and prompt it with the passage as input to generate the question tokens using teacher-forcing.As finetuning of the questiongeneration PLM is not needed, only the retrieval model, ART can use large PLMs and obtain accurate soft-label estimates of which passages are likely to be the highest quality.
The retriever is trained to penalize the divergence of a passage likelihood from its soft-label score.
For example, if the question is "Where is the bowling hall of fame located?" as shown in Figure 1, then the training process will boost the retrieval like-lihood of the passage "Bowling Hall of Fame is located in Arlington," as it is relevant and would lead to a higher question reconstruction likelihood, while the likelihood of the passage "Hall of Fame is a song by ..." would be penalized as it is irrelevant.In this manner, the training process encourages correct retrieval results and vice-versa, leading to an iterative improvement in passage retrieval.
Comprehensive experiments on five benchmark QA datasets demonstrate the usefulness of our proposed training approach.By simply using questions from the training set, ART outperforms models like DPR by an average of 5 points absolute in top-20 and 4 points absolute in top-100 accuracy.We also train using all the questions contained in the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019) and find that even with a mix of answerable and unanswerable questions, ART achieves strong generalization on out-of-distribution datasets due to relying on PLM.Our analysis further reveals that ART is highly sample efficient, outperforming BM25 and DPR with just 100 and 1000 questions, respectively, on the NQ-Open dataset, and that scaling up to larger retriever models consistently improves performance.

Problem Definition
We focus on open-domain retrieval, where given a question q, the task is to select a small set of matching passages (i.e.20 or 100) from a large collection of evidence passages D = {d 1 , . . ., d m }.Our goal is to train a retriever in a zero-shot manner, i.e., without using questionpassage pairs, such that it retrieves relevant passages to answer the question.Our proposed approach consists of two core modeling components ( §2.2, §2.3) and a novel training method ( §2.4).

Dual Encoder Retriever
For the retriever, we use the dual-encoder model (Bromley et al., 1994) which consists of two encoders, where • one encoder computes the question embedding f q (q; Φ q ) : X → R d , and • the other encoder computes the passage embedding Here, X = V n denotes the universal set of text sequences, V denotes the vocabulary consisting of discrete tokens, and R d denotes the (latent) embedding space.We assume that both the question and passage embeddings lie in the same latent space.The retrieval score for a question-passage pair (q, d) is then defined as the inner product between their respective embeddings, score(q, (1) where Φ = [Φ q , Φ d ] denotes the retriever parameters.We select the top-K passages with maximum inner product scores and denote them as We use the transformer network (Vaswani et al., 2017) with BERT tokenization (Devlin et al., 2019) to model both the encoders.To obtain the question or passage embedding, we do a forward pass through the transformer and select the last layer hidden state corresponding to the [CLS] token.As the input passage representation, we use both the passage title and text separated by [SEP] token.

Zero-Shot Cross-Attention Scorer
We obtain an estimate of the relevance score for a question-(retrieved) passage pair (q, z) by using a pre-trained language model (PLM).In order to do this in a zero-shot manner, we use a large generative PLM to compute the likelihood score of a passage conditioned on the question p(z | q).
The quantity p(z | q) can be better approximated by the autoregressive generation of question tokens conditioned on the passage and teacherforcing (Sachan et al., 2022).More formally, this can be written as where Θ denotes the parameters of the PLM, c is a constant independent of the passage z, and |q| denotes the number of question tokens.Here, Eq. 2a follows from a simple application of Bayes' rule to p(z | q) and assuming that the passage prior p(z) in Eq. 2b is uniform for all z ∈ Z.We hypothesize that calculating the relevance score using Eq.2b would be accurate because it requires performing deep cross-attention involving all the question and passage tokens.In a large PLM, the cross-attention step is highly expressive, and in combination with teacher-forcing, requires the model to explain every token in the question resulting in a better estimation.
As the input passage representation, we concatenate the passage title and its text.In order to prompt the PLM for question generation, we follow Sachan et al. ( 2022) and append a simple natural language instruction "Please write a question based on this passage."to the passage text.

Training Algorithm
For training the model, our only assumption is that a collection of questions (T ) and evidence passages (D) are provided as input.During training, the weights of the retriever are updated while the PLM is not finetuned, i.e., it is used in inference mode.Our training algorithm consists of five core steps.The first four steps are performed at every training iteration while the last step is performed every few hundred iterations.Figure 1 presents an illustration of our approach.
Step 1: Top-K Passage Retrieval For fast retrieval, we pre-compute the evidence passage embedding using the initial retriever parameters ( Φd ).Given a question q, we compute its embedding using the current question encoder parameters (Φ q ) and then retrieve the top-K passages (Z) according to Eq. 1.We then embed these top-K passages using the current passage encoder parameters (Φ d ) and compute fresh retriever scores as, Step 2: Retriever Likelihood Calculation Computing the exact likelihood of the passage conditioned on the question requires normalizing over all the evidence passages , where τ is a temperature hyperparameter.Computing this term is intractable, as this would require re-embedding all the evidence passages using Φ d .Hence, we define a new distribution to approximate the likelihood of z i as which we also refer to as the student distribution.
We assume that passages beyond the top-K contribute a very small probability mass, so we only sum over all the retrieved passages Z in the denominator.While this approximation leads to a biased estimate of retrieved passage likelihood, it works well in practice.Computing Eq. 3 is tractable as it requires embedding and backpropagating through a much smaller set of passages.
Step 3: PLM Relevance Score Estimation We compute the relevance score log p(z i | q) of all the passages in Z using a large PLM (Θ).This requires scoring the question tokens using teacherforcing conditioned on a passage as described in §2.3.We then define a teacher distribution by applying softmax to the relevance scores .
Step 4: Loss Calculation and Optimization We train the retriever (Φ) by minimizing the KL divergence loss between the teacher distribution (obtained by PLM) and the student distribution (computed by retriever).
Intuitively, optimizing the KL divergence pushes the passage likelihood scores of the retriever to match the passage relevance scores from PLM by considering the relevance scores as soft-labels.
Step 5: Updating Evidence Embeddings During training, we update the parameters of both the question encoder (Φ q ) and passage encoder (Φ d ).Due to this, the pre-computed evidence embeddings that was computed using initial retriever parameters ( Φd ) becomes stale, which may affect top-K passage retrieval.To prevent staleness, we re-compute the evidence passage embeddings using current passage encoder parameters (Φ d ) after every 500 training steps.

ART as an Autoencoder
Since our encoder takes as input question q and the PLM scores (or reconstructs) the same question when computing the relevance score, we can consider our training algorithm as an autoencoder with a retrieved passage as the latent variable.
In the generative process, we start with an observed variable D (the collection of evidence passages), which is the support set for our latent variable.Given an input q, we generate an index i and retrieve the passage z i .This index generation and retrieval process is modeled by our dual encoder architecture.Given z i , we decode it back into the question using our PLM.
Recall that our decoder (the PLM) is frozen and its parameters are not updated.However, the signal from the decoder output is used to train parameters of the dual encoder such that the loglikelihood of reconstructing the question q is maximized.In practice, this improves the dual encoder to select the best passage for a given question, since the only way to maximize the objective is by choosing the most relevant z i given the input q.
3 Experimental Setup In this section, we describe the datasets, evaluation protocol, implementation details, and baseline methods for our passage retrieval experiments.

Datasets and Evaluation
Evidence Passages The evidence corpus includes the preprocessed English Wikipedia dump from December 2018 (Karpukhin et al., 2020).Following convention, we split an article into nonoverlapping segments containing 100 words each resulting in over 21 million passages.The same evidence is used for both training and evaluation.
Question-Answering Datasets Following previous work, we use the open-retrieval version of Natural Questions (NQ-Open; Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), SQuAD-1.0 (SQuAD-Open; Rajpurkar et al., 2016), We-bQuestions (WebQ; Berant et al., 2013), and Enti-tyQuestions (EQ; Sciavolino et al., 2021) datasets.All Questions Datasets For our transfer learning experiments, we use all the questions from Natural Questions (henceforth referred to as NQ-Full) and MS MARCO passage ranking (Bajaj et al., 2016) datasets.Table 1 lists the number of questions.The questions in NQ-Full are information-seeking, as they were asked by real users.Its size is four times that of NQ-Open.NQ-Full consists of questions having just long-form of answers such as paragraphs, all the questions in NQ-Open (which have both long-form and shortform answers), questions having yes/no answers, and questions that do not contain the answer or are unanswerable.For MS MARCO, we use its provided passage collection (around 8.8 million passages in total) as the evidence corpus.
Evaluation To evaluate retriever performance, we report the conventional top-K accuracy metric.It is the fraction of questions for which at least one passage among the top-K retrieved passages contains a span of words that matches humanannotated answer(s) to the question.

Implementation Details
Model Sizes We use BERT base configuration (Devlin et al., 2019) for the retriever, which consists of 12 layers, 12 attention heads, and 768 embedding dimensions, leading to around 220M trainable parameters.For the teacher PLM, we use two configurations: (i) T5-XL configuration (Raffel et al., 2020) consisting of 24 layers, 32 attention heads, and 2048 embedding dimensions, leading to 3B parameters, and (ii) a larger T5-XXL configuration consisting of 11B parameters.

Model Initialization
We initialize the retriever with unsupervised masked salient spans (MSS) pre-training (Sachan et al., 2021a) as it provides an improved zero-shot retrieval over BERT pretraining. 3We initialize the cross-attention (or teacher) PLM with the T5-lm-adapted (Lester et al., 2021) or instruction-tuned T0 (Sanh et al., 2022) language models, which have been shown to be effective zero-shot re-rankers for information retrieval tasks (Sachan et al., 2022).
Compute Hardware We perform training on instances containing 8 or 16 A100 GPUs, each containing 40 GB RAM.
Passage Retrieval To perform fast top-K passage retrieval at every training step, we precompute the embeddings of all the evidence passages.Computing embeddings of 21M passages takes roughly 10 minutes on 16 GPUs.The total size of these embeddings is around 30 GB (768dimensional vectors in FP16 format).For scalable retrieval, we shard these embeddings across all the GPUs and perform exact maximum inner product search using distributed matrix multiplication.
Training Details When training with T0 (3B) PLM, for all the datasets except WebQ, we perform training for 10 epochs using Adam with a batch size of 64, 32 retrieved passages, dropout value of 0.1, peak learning rate of 2 × 10 −5 with warmup and linear scheduling.Due to the smaller size of WebQ, we train for 20 epochs with a batch size of 16.When training with the T5-lm-adapted (11B) PLM, we use a smaller batch size of 32 with 16 retrieved passages.We save the retriever checkpoint every 500 steps and perform model selection by evaluating it on the development set.We use mixed precision training to train the retriever and perform inference over the PLM using bfloat16 format.We set the value of the temperature hyperparameter (τ ) using cross-validation.

Baselines
We compare ART to both unsupervised and supervised models.Unsupervised models train a single retriever using unlabeled text corpus from the Internet while supervised models train a separate retriever for each dataset.We report the performance numbers from the original papers when the results are available or run their open-source implementations in case the results are not available.
Unsupervised Models These include the popular BM25 algorithm (Robertson and Zaragoza, 2009)

Zero-Shot Passage Retrieval
For the passage retrieval task, we report results on SQuAD-Open, TriviaQA, NQ-Open, and WebQ and train ART under two settings.In the first setting, we train a separate retriever for each dataset using questions from their training set.In the second setting, to examine the robustness of ART training to different question types, we train a single retriever by combining the questions from all the four datasets, which we refer to as ART-Multi.For both these settings, we train ART using T5-lm-adapted (11B) and T0 (3B) cross-attention PLM scorers.As our training process does not require annotated passages for a question, we refer to this as zero-shot passage retrieval.
Table 2 presents the top-20 and top-100 retrieval accuracy in these settings alongside recent baselines that train a similarly sized retriever (110M).All the variants of ART achieve substantially better performance than previous unsupervised approaches.For example, ART trained with T0 (3B) outperforms the recent Spider and Contriever models by an average of 9 points on top-20 and 6 points on top-100 accuracy.When comparing to supervised models, despite using just questions, ART outperforms strong baselines like DPR and ANCE and is at par or slightly better than pre-trained retrievers like MSS-DPR.In addition, ART-Multi obtains comparable performance to its single dataset version, a considerable advantage in practical applications as a single retriever can be deployed rather than training a custom retriever for each use case.
ART's performance also comes close to the state-of-the-art supervised models like AR2 and EMDR 2 , especially on the top-100 accuracy but lags behind in the top-20 accuracy.In addition to obtaining reasonable performance and not requiring aligned passages for training, ART's training process is much simpler than AR2.It also does not require cross-encoder finetuning and is thus faster to train.As generative language models continue to become more accurate (Chowdhery et al., 2022), we hypothesize that the performance gap between state-of-the-art supervised models and ART would further narrow down.
Our results showcase that both the PLM scorers, T5-lm-adapt (11B) and T0 (3B), achieve strong results on the QA retrieval tasks, with T0 achieving higher performance gains.This illustrates that the relevance score estimates of candidate passages obtained in the zero-shot cross-attention step are accurate enough to provide strong supervision for retriever training.We believe that this is a direct consequence of the knowledge stored in the PLM weights.While T5-lm-adapt's knowledge is obtained by training on unsupervised text corpora, T0 was further finetuned using instructionprompted datasets of tasks such as summarization, QA, text classification, etc. 4 Hence, in addition to learning from instructions, the performance gains from T0 can be attributed to the knowledge infused in its weights by (indirect) supervision from 4 However, we note that T0 was not finetuned on the question generation task and not trained on any of the datasets we have used in this work.We refer the reader to the original paper for more training details.When trained with 100 questions, ART outperforms BM25 and when trained with 1k questions, it matches DPR's performance for top-K > 50 passages, illustrating that ART is highly sample efficient.
these manually curated datasets.Instruction-based finetuning is helpful in the case of smaller datasets like WebQ and especially in improving the performance on lower values of top-K accuracy (such as top-20).5 Overall, our results suggest that an accurate and robust passage retrieval can be achieved by training with questions alone.This presents a considerably more favorable setting than the current approaches which require obtaining positive and hard-negative passages for such questions.Due to its better performance, we use the T0 (3B) PLM for subsequent experiments unless stated otherwise.

Sample Efficiency
To measure the sample efficiency of ART, we train the model by randomly selecting a varying number of questions from NQ-Open training questions and compute the top-K accuracy on its development set.These results are presented in Figure 2 and we also include the results of BM25 and DPR for comparison.We see that performance increases with the increase in questions until about 10k questions, after which the gains become less pronounced.
When trained with just 100 questions, ART significantly outperforms BM25 and when trained Table 3: Top-20 and top-100 retrieval accuracy when evaluating zero-shot out-of-distribution (OOD) generalization of models on the test set of datasets.† denotes that these results are from Ram et al. (2022).For EQ, we report macro-average scores.ART generalizes better than supervised models on OOD evaluation even when trained on all the questions of the Natural Questions dataset which contains a mix of answerable and unanswerable questions.
with 1k questions, it matches DPR performance levels for top-{50, . . ., 100} accuracy.This demonstrates that ART in addition to using just questions is also much more data efficient than DPR, as it requires almost ten times fewer questions to reach a similar performance.

Zero-Shot Out-of-Distribution Transfer
In the previous experiments, both the training and test sets contained questions that were sampled from the same underlying distribution, a setting that we refer to as in-distribution training.However, obtaining in-domain questions for training is not always feasible in practice.Instead, a model trained on an existing collection of questions must be evaluated on new datasets, a setting that we refer to as out-of-distribution (OOD) transfer.We train ART using NQ-Open and NQ-Full questions and then evaluate its performance on SQuAD-Open, TriviaQA, WebQ, and EQ datasets.While it is desirable to train on answerable questions such as the ones included in NQ-Open but this is not always possible, as real user questions are often imprecisely worded or ambiguous.Due to this, training on NQ-Full can be considered as a practical testbed for evaluating true OOD generalization as a majority of the questions (51%) were marked as unanswerable from Wikipedia by human annotators. 6able 3 presents OOD generalization results on the four QA datasets including the results of DPR and Spider models trained on NQ-Open.7 ART trained on NQ-Open always performs significantly better than both DPR and Spider, showing that it is better at generalization than supervised models.When trained using NQ-Full, ART performance further improves by 3 and 0.5-1 points on EQ and other datasets, respectively, over NQ-Open.This highlights that in addition to questions annotated as having short answers, questions annotated with long answers also provide meaningful supervisory signals and unanswerable questions do not necessarily degrade performance.
We also train ART using MS MARCO questions and perform OOD evaluation.Due to the larger size of MS MARCO and a smaller number of evidence passages, we use a batch size of 512 and retrieve 8 passages for training.Quite surprisingly, it obtains much better performance than previous approaches including BM25 on EQ (more than 10 points gain on top-20 over training ART on NQ-Open).We suspect that this may be due to the similar nature of questions in MS MARCO and EQ.Further finetuning the pretrained MS MARCO model on NQ-Full significantly improves performance on WebQ.

Scaling Model Size
We examine if scaling up the retriever parameters can offer further performance improvements.this end, we train a retriever of BERT-large configuration (24 layers, 16 attention heads, 1024 embedding dimensions) containing around 650M parameters on NQ-Open and TriviaQA.Results are presented in Table 4 for both the development and test sets.We also include the results of other relevant baselines containing a similar number of trainable parameters.
By scaling up the retriever size, we see small but consistent improvements in retrieval accuracy across both the datasets.Especially on TriviaQA, ART matches or exceeds the performance of previous best models.On NQ-Open, it comes close to the performance of EMDR 2 (Sachan et al., 2021b), a supervised model trained using thousands of question-answer pairs.
We also attempted to use larger teacher PLMs such as T0 (11B).However, our initial experiments did not lead to any further improvements over the T0 (3B) PLM.We conjecture that this might be either specific to these QA datasets or that we need to increase the capacity of the teacher PLM even more to observe improvements.We leave an in-depth analysis of using larger teacher PLMs as part of the future work.

Analysis
Sensitivity to Retriever Initialization To examine how the convergence of ART training is affected by the initial retriever parameters, we initialize the retriever with (1) BERT weights, (2) 0 1 2 3 4 5 6 7 8 9 10 11   Table 5: Effect of using a different number of retrieved passages during ART training as evaluated on the NQ-Open development set.For each case, we list the absolute gain or loss in top-K accuracy when compared to the setting utilizing 32 retrieved passages.
ICT weights (as trained in Sachan et al., 2021a), and (3) MSS weights, and train using NQ-Open questions. Figure 3 displays the top-20 performance on the NQ development set as the training progresses.It reveals that ART training is not sensitive to the initial retriever parameters as all three initialization schemes converge to similar results.However, the convergence properties might be different under low-resource settings, an exploration of which we leave for future work.

Effect of the Number of Retrieved Passages
Table 5 quantifies the effect of the number of retrieved passages used during training on performance.A smaller number of retrieved passages such as 2 or 4 leads to a somewhat better top-{1, 5} accuracy, at the expense of a drop in top-{20, 100} accuracy.Retrieving 32 passages offers a reasonable middle ground and beyond that, the top-K retrieval performance tends to drop.
A Closer Inspection of ART with Supervised Models In order to have a better understanding of the tradeoff between supervised models and ART, we examine their top-1 and top-5 accuracy in addition to the commonly reported top-20 and top-100 scores.Table 6 presents these results for ART (large) along with supervised models of DPR (large) and EMDR 2 .Supervised models achieve much better performance for top-K ∈ {1, . . ., 5} passages, i.e., these models are more precise.This is likely because DPR is trained with hard-negative passages and EMDR 2 finetunes PLM using answers resulting in an accurate relevance feedback to the retriever.the training process, we train the retriever under different settings by varying the passage types.Specifically, we train with a mix of positive, hardnegative, and uniformly sampled passages.We also perform in-batch training by defining Z to be the union of positive and hard-negative passages for all the questions in a batch.Results in Table 7 illustrate that when Z consists of uniformly sampled passages, it leads to poor performance.Including a (gold) positive passage in Z leads to good performance improvements.Results further improve with the inclusion of a hard-negative passage in Z.However, in-batch training leads to a slight drop in performance.As the gold passages are not always available, our method of selecting the top passages from evidence at every training step can be seen as an approximation to using the gold passages.With this, ART obtains even better results than the previous settings, an improvement by 4 points absolute in the top-20 accuracy.unrelated tasks using instructions (T0 series; Sanh et al., 2022).Our results in Table 8 highlight that PLM training methodology and model size can have a large effect on retrieval performance.T5 base model leads to low scores possibly because pre-training using predicting masked spans is not ideal for question reconstruction.However, the accuracy improves with an increase in model size.T5-lm-adapt models are more stable and lead to improved performance with the best result achieved by the 11B model.Instruction finetuned T0 models outperforms the T5-lm-adapt models.However, scaling up the size of T0 to 11B parameters does not result in meaningful improvements.

Impact of Language
Ad-Hoc Retrieval Tasks While the previous experiments were conducted on QA datasets, here we examine the robustness of the ART model trained using questions to different ad-hoc retrieval tasks.For this analysis, we evaluate the performance of ART on the BEIR benchmark (Thakur et al., 2021).It is a heterogeneous collection of many retrieval datasets, with each dataset consisting of test set queries, evidence documents, and gold document annotations.BEIR spans multiple domains and diverse retrieval tasks presenting a strong challenge suite, especially to the dense retrievers.We train ART using MS MARCO questions and report its nDCG@10 and Recall@100 scores on each dataset.For comparison, we include the results of three baselines: BM25, Contriever, and DPR trained using NQ-Open.Our results presented in Table 9 show strong generalization performance of ART as it outperforms DPR and Contriever results.ART also achieves at par results with the strong BM25 baseline outperforming BM25 on 8 out of the 15 datasets (according to nDCG@10 scores).

Related Work
Our work is based on training a dense retriever using pre-trained language models (PLMs), which we have covered in previous sections.Here, we instead focus on other related approaches.
A popular method to train the dual-encoder retriever is to optimize contrastive loss using inbatch negatives (Gillick et al., 2019) and hardnegatives (Karpukhin et al., 2020;Xiong et al., 2021).Alternatives to using hard-negatives such as sampling from cached evidence embeddings have also shown to work well in practice (Lindgren et al., 2021).Multi-vector encoders for questions and passages are more accurate than dualencoders, (Luan et al., 2021;Khattab and Zaharia, 2020;Humeau et al., 2020), although at the cost of an increased latency and storage requirements.
PLMs have been shown to improve passage rankings as they can perform cross-attention between the question and the retrieved passages (Lin et al., 2021).Supervised approaches to rerank either finetune PLMs using question-passage pairs (Nogueira et al., 2020) or finetune PLMs to generate question conditioned on the passage (Nogueira dos Santos et al., 2020) while unsupervised re-rankers are based on zero-shot question scoring (Sachan et al., 2022).The re-ranking process is slow due to the cross-attention step and is bottlenecked by the accuracy of first-stage retrievers.To address these limitations, crossattention distillation approaches from the PLM to retriever have been proposed (Qu et al., 2021).Such distillation can be performed either in a single end-to-end training step (Guu et al., 2020;Sachan et al., 2021b) or in a multi-stage process (Khattab et al., 2021;Izacard and Grave, 2021).
An alternative approach to using PLMs is to generate data that can aid retrieval.The data can be either the title or an answer that provides more information about the question (Mao et al., 2021).Generating new questions to augment the training data has also been shown to improve performance (Ma et al., 2021;Bonifacio et al., 2022;Dai et al., 2022).In comparison, we do not generate new questions but train the retriever using existing questions and PLM feedback.Data augmentation is likely complementary, and can further improve accuracy.

Conclusions and Future Work
We introduced ART, a novel approach to train a dense passage retriever using only questions.ART does not require question-passage pairs or hard-negative examples for training and yet achieves state-of-the-art results.The key to making ART work is to optimize the retriever to select relevant passages such that conditioning on them, the question generation likelihood computed using a large pre-trained language model iteratively improves.Despite requiring much less supervision, ART substantially outperforms DPR when evaluated on multiple QA datasets and also generalizes better on out-of-distribution questions.
ART presents several directions for future work.It would be interesting to apply this approach in low-resource retrieval including multilingual (Clark et al., 2020) and cross-lingual question answering (Asai et al., 2021).Our training framework can also be extended to train crossmodality retrievers such as for image or code search (Li et al., 2022;Neelakantan et al., 2022) using textual queries.Finally, other directions worth exploring would be to make use of labeled data when available such as by finetuning PLM on passage-question aligned data and to train multivector retrievers (Luan et al., 2021) with ART.

Figure 1 :
Figure 1: ART maximizes the retrieved passage likelihood computed from the dense retriever by considering the language model question reconstruction score conditioned on the passage as a soft-label.Colored blocks indicate trainable parameters.Red arrows show gradient flow during backpropagation.

Figure 2 :
Figure2: Top-K accuracy as the number of training questions (denoted as 'Q' in the legend) is varied.When trained with 100 questions, ART outperforms BM25 and when trained with 1k questions, it matches DPR's performance for top-K > 50 passages, illustrating that ART is highly sample efficient.

Figure 3 :
Figure 3: Effect of retriever initialization on ART training.The plot reveals that the training process is not sensitive to initial retriever parameters.

Table 1 :
Dataset statistics.During the training process, ART only uses the questions while evaluation is performed over the canonical development and test sets.

Table 2 :
Top-20 and top-100 retrieval accuracy on the test set of datasets.For more details regarding the unsupervised and supervised models, please see §3.3 in the text.Best supervised results are highlighted in bold while best results from the our proposed model (ART) are underlined.ART substantially outperforms previous unsupervised models and comes close to or matches the performance of supervised models by just using questions during training.
Zhang et al., 2022), 2022) al., 2021a)ds representation of text.Dense models typically use Wikipedia paragraphs to create (pseudo-) query and context pairs to perform contrastive training of the retriever.These differ in how the negative examples are obtained during contrastive training: they can be from the same batch (ICT;Lee et al., 2019;Sachan et al., 2021a), or contexts passages from previous batches (Contriever;Izacard et al., 2022), or by using other passages in the same article (Spider;Ram et al., 2022).Context passages can also be sampled from articles connected via hyperlinks (HLP;Zhou et al., 2022).(DPR;Karpukhinetal., 2020), iterative mining of negative contexts is done using model weights (ANCE;Xiong et al., 2021), or the retriever is first initialized with ICT or MSS pre-training followed by DPR-style finetuning (ICT-DPR / MSS-* indicates that the cross-attention PLM is finetuned.†denotesthat'cpt-textS'model(Neelakantanetal., 2022)contains around 300M parameters.‡denotes that DPR-Multi was not trained on SQuAD-Open.indicates that the results on SQuAD-Open and WebQ are obtained by finetuning the open-source MSS checkpoint.•indicates that EMDR 2 results are obtained using their open-source checkpoints.triever(RocketQAv2;Renetal., 2021).A combination of adversarial and distillation-based training of re-ranker and retriever has been shown to obtain state-of-the-art performance (AR2;Zhang et al., 2022).

Table 4 :
To Top-20 and top-100 accuracy when training large configuration retriever, which contains around 650M parameters.EMDR 2 (base configuration) (Sachan et al., 2021b) contains 440M parameters.Best supervised results are underlined while the best unsupervised results are highlighted in bold.

Table 6 :
Analysis reveals that ART (large) can even match the performance of end-to-end trained models like EMDR 2 when retrieving a larger number of passages.However, DPR (large) and EMDR 2 still outperform ART when retrieving a small number of passages such as top-K ∈ {1, . . ., 5} (highlighted in bold).

Table 7 :
Effect of passage types on ART training when evaluated on the NQ-Open development set.P denotes a positive passage, N denotes a hard-negative passage (mined using BM25), U denotes that the passages are randomly sampled from the evidence, and IB denotes in-batch training.

Table 9 :
Thakur et al. (2021)benchmark.#Qand#E denotes the size of the test set and evidence, respectively.Best scores for each dataset are highlighted in bold.ART is trained using MS MARCO questions.DPR is trained using NQ-Open.†denotes that these results are fromThakur et al. (2021).