Aggretriever: A Simple Approach to Aggregate Textual Representations for Robust Dense Passage Retrieval

Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that models such as BERT are not “structurally ready” to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This “lack of readiness” results from the gap between language model pre-training and DPR fine-tuning. Previous solutions call for computationally expensive techniques such as hard negative mining, cross-encoder distillation, and further pre-training to learn a robust DPR model. In this work, we instead propose to fully exploit knowledge in a pre-trained language model for DPR by aggregating the contextualized token embeddings into a dense vector, which we call agg★. By concatenating vectors from the [CLS] token and agg★, our Aggretriever model substantially improves the effectiveness of dense retrieval models on both in-domain and zero-shot evaluations without introducing substantial training overhead. Code is available at https://github.com/castorini/dhr.


Introduction
A bi-encoder architecture (Reimers and Gurevych, 2019;Karpukhin et al., 2020) based on pre-trained language models (Devlin et al., 2018;Liu et al., 2019;Raffel et al., 2020) has been widely used for first-stage retrieval in knowledge-intensive tasks such as open-domain question answering and fact checking.Compared to bag-of-words models such as BM25, these approaches circumvent lexical mismatches between queries and passages by encoding text into dense vectors.
Despite their success, recent research calls into question the robustness of these single-vector models (Thakur et al., 2021).As shown in Fig. 1, singlevector dense retrievers (e.g., BERT CLS and TAS-B) trained with well-designed knowledge distillation strategies (Hofstätter et al., 2021) still underperform BM25 on out-of-domain datasets.Along the same lines, Sciavolino et al. (2021) find that simple entity-centric questions are challenging to these dense retrievers.
Recently, Gao and Callan (2021) observe that pre-trained language models such as BERT are not "structurally ready" for fine-tuning on downstream retrieval tasks.This is because the [CLS] token, pre-trained on the task of next sentence prediction (NSP), does not have the proper attention structure to aggregate fine-grained textual information.To address this issue, the authors propose to further pre-train the [CLS] vector before fine-tuning and show that the gap between pre-training and finetuning tasks can be mitigated (see coCondenser CLS illustrated in Fig. 1).However, further pre-training introduces additional computational costs, which motivates us to ask the following question: Can we directly bridge the gap between pre-training and fine-tuning without any further pre-training?
Before diving into our proposed solution, we briefly overview the language modeling pretraining and DPR fine-tuning tasks using BERT.[CLS] [CLS] MLM head Softmax (a) BERT pre-training (b) [CLS] DPR ne-tuning (c) Aggretriever DPR ne-tuning of fine-tuning a dense retriever.We observe that solely relying on the [CLS] vector as the dense representation does not exploit the full capacity of the pre-trained model, as the [CLS] vector participates directly only in NSP during pre-training, and therefore lacks information captured in the contextualized token embeddings.A simple solution is to aggregate the token embeddings by pooling (max or mean) into a single vector.However, information is lost in this process and empirical results do not show any consistent effectiveness gains.Hence, we see the need for better aggregation schemes.
In this paper, we propose a novel approach to generate textual representations for retrieval that fully exploit contextualized token embeddings from BERT, shown in Fig. 2(c).Specifically, we reuse the pre-trained MLM head to map each contextualized token embedding into a highdimensional wordpiece lexical space.Following a simple max-pooling and pruning strategy, we obtain a compact lexical vector that we call agg ⋆ .By concatenating agg ⋆ and the [CLS] vector, our novel Aggretriever dense retrieval model captures representations pre-trained from both NSP and MLM, improving retrieval effectiveness by a noticeable margin compared to fine-tuned models that solely rely on the [CLS] vector (see BERT AGG vs BERT CLS in Fig. 1).
Importantly, fine-tuning Aggretriever does not require any sophisticated and computationally expensive techniques, making it a simple yet competitive baseline for dense retrieval.However, our approach is orthogonal to previously proposed further pre-training strategies, and can still benefit from them to improve retrieval effectiveness even more (see coCondenser AGG in Fig. 1).To the best of our knowledge, this is the first work in the DPR literature that leverages the BERT pre-trained MLM head to encode textual information into a single dense vector.

Background and Motivation
Given a query q, our task is to retrieve a list of passages to maximize some ranking metric such as nDCG or MRR.Dense retrievers (Reimers and Gurevych, 2019;Karpukhin et al., 2020) based on pre-trained language models encode queries and passages as low dimensional vectors with a bi-encoder architecture and use the dot product between the encoded vectors as the similarity score: where e q [CLS] and e p [CLS] are the [CLS] vectors at the last layer of BERT (Devlin et al., 2018).Subsequent work leverages expensive fine-tuning strategies (e.g., hard negative mining, knowledge distillation) to guide models to learn more effective and robust single-vector representations (Xiong et al., 2021;Zhan et al., 2021b;Lin et al., 2021b;Hofstätter et al., 2021;Qu et al., 2021).
Recent work (Gao and Callan, 2021;Lu et al., 2021) shows that the [CLS] vector remains "dormant" in most layers of pre-trained models and fails to adequately aggregate information from the input sequence during pre-training.Thus, researchers argue that the models are not "structurally ready" for fine-tuning.To tackle this issue, unsupervised contrastive learning has been proposed, which creates pseudo relevance labels from the target corpus to "prepare" the [CLS] vector for retrieval.The most representative technique is the Inverse Cloze Task (ICT; Lee et al., 2019).However, since the generated relevance data is noisy, further pretraining with ICT often requires a huge amount of computation due to the need for large batch sizes or other sophisticated training techniques (Chang et al., 2020;Izacard et al., 2021;Ni et al., 2021).
Another thread of work (Gao and Callan, 2021;Lu et al., 2021) manages to guide transformers to aggregate textual information into the [CLS] vector through auto-encoding.This method does not require as much computation as unsupervised contrastive learning but is still much more computationally intensive than fine-tuning.For example, Gao and Callan (2021) report that the further pretraining process still requires one week on four RTX 2080 Ti GPUs, while fine-tuning consumes less than one day in the same environment.
Recent work on neural sparse retrievers (Bai et al., 2020;Formal et al., 2021b) projects contextualized token embeddings into a high-dimensional wordpiece lexical space through the BERT pretrained MLM projector and directly performs retrieval in wordpiece lexical space.These models demonstrate that MLM pre-trained weights can be used to learn effective lexical representations for retrieval tasks, a finding that has not been fully explored in the DPR literature.Inspired by this work, we explore reusing MLM pre-trained weights for DPR fine-tuning and further combine the [CLS] vector to fully exploit textual information in a pretrained language model.

Aggretriever
In this section, we first introduce our method for text aggregation to form agg ⋆ , which consists of two steps: pooling and pruning.Then, we describe how to concatenate the aggregated text representation agg ⋆ and [CLS] into a 768-dimensional dense vector for fine-tuning and retrieval.

Text Aggregation Pooling
The goal of text aggregation is to transform contextualized token embeddings into a singlevector token representation.Let the input sequence q denote a tokenized query sequence with a length of l, ([CLS], q 1 , q 2 , • • • q l , [SEP]), or alternatively, a passage p of length m, ).One simple approach is to directly pool (mean or max) contextualized token embeddings from the final layer.Such pooling strategies have been studied in previous work (Reimers and Gurevych, 2019), but do not appear to be consistently more effective than just using the [CLS] token; this is also confirmed in our ablation study (Section 5.4).
We instead propose to reuse the pre-trained MLM head to project each contextualized token embedding e q i into a high-dimensional vector in the wordpiece lexical space: where are the weights of the pre-trained MLM linear projector, and p q i ∈ R |V BERT | is the i-th contextualized token represented by a probability distribution over the 30522 tokens of BERT wordpiece vocabulary, V BERT .We then perform weighted max pooling for the sequential representations (p q 1 , p q 2 , • • • , p q l ) to obtain a single-vector lexical representation: where and b ∈ R 1 are trainable weights.Note that the scalar w i for each token q i is essential to capture term importance, which p q i alone cannot capture since it is normalized by softmax.We exclude the [CLS] token embedding at this stage since it is used for next-sentence prediction during pretraining and thus we argue that it does not carry much lexical information.
Our design has three advantages: (1) the MLM head with softmax is used for BERT pre-training; thus, the output probabilities can accurately model each contextualized token semantically.(2) In contrast to directly pooling contextualized embeddings, important dimensions of the token representations in the high-dimensional space are less likely to overlap, resulting in non-interfering maxpooling (Jang et al., 2021).(3) Finally, w i and p q i disentangle the effects of term importance from the MLM head.We will study the effectiveness of this design in Section 5.4 through ablations.Note that compared to previous work on sparse retrieval (Bai et al., 2020;Formal et al., 2021b), which switches softmax to ReLU to create sparse representations, our design sticks to the original activation function for MLM pre-training and directly outputs 30522dimensional dense lexical vectors (v q ). Figure 3: Illustration of text aggregation: (a) pooling of token representations to form v q ; (b) pruning of v q to form agg ⋆ q (or agg + q ).While pruning, agg ⋆ q [n] receives a negative value if the pooled element belongs to S − n ; i.e., the second element in each slice (red box).
Fig. 3(a) illustrates the generation of v q with |V BERT | = 10 for simplicity.Ideally, we can directly compute v q • v p as a lexical matching similarity score for the wordpiece lexical representations.However, the vectors (v q , v p ∈ R |V BERT |) are too large for efficient retrieval using dense vector search libraries such as Faiss.To address this issue, we introduce our non-parametric pruning method to convert v q (v p ) into a low-dimensional vector for dense retrieval.

Text Aggregation Pruning
We consider v q ∈ R |V BERT | as a bag-of-words representation with each dimension storing the corresponding term weight.Thus, dimensions with low term weights indicate that the corresponding terms are not important and can be pruned.
Based on this intuition, we propose to prune term weights in v q by evenly and randomly dividing the dimensions (vocabulary) into d slices, (S 1 , S 2 , • • • , S d ), where each slice consists of a set of |V BERT | d index positions.Then, we condense v q into a d-dimensional vector by pruning the term weights in each slice S n : We call the operation in Eq. ( 4) slice max pooling, where each value in agg + q represents the weight of the most important term in the slice.1 Slice max pooling is an important operation to prune the term weights while performing dimensionality reduction for dense passage retrieval.Other effective approaches to pruning lexical representations, e.g., top-k pruning (Yang et al., 2021) and FLOP regularization (Formal et al., 2021b), do not reduce the vector dimensionality.Thus, they generate sparse representation models that require inverted indexes for efficient retrieval.
We call agg + q ∈ R d the semi-aggregated lexical representation for query q since it only distributes vectors over the positive orthant and does not fully Our goal is to approximate the dot product between v q and v p in Eq. ( 3) by the ones in Eq. ( 4): Note that the approximation error in Eq. ( 5) partially comes from term misalignment: where the values in agg + q [n] and agg + p [n] do not represent the same term.Alternatively, this can be explained as fuzzy matching between two lexical representations since the two different wordpiece tokens may interact and contribute to the dot product.Term misalignment increases as d becomes smaller with respect to |V BERT |; thus, the error increases as well, which we show in Section 5.4.
To mitigate this error, we distribute the semiaggregated lexical representation to the negative orthants to form what we call the fully aggregated lexical representation, distributed over the entire d-dimensional space.
where S + n and S − n are disjoint subsets of S n (i.e., S + n ∪ S − n = S n and S + n ∩ S − n = ∅).That is, we evenly distribute the elements in S n to S + n and S − n .
The dot product between two fully aggregated lexical representations then becomes: where the cases are: That is, the dot product of agg ⋆ in Eq. ( 8) avoids interactions between misaligned terms in the above cases (with 50% of probability), which agg + in Eq. ( 5) does not consider.Note that we do not store the vectors id p and id q to compute Eq. ( 8).

Fine-Tuning and Retrieval
Although agg ⋆ can mitigate the issue of term misalignment, the approximation error cannot be completely eliminated unless d = |V BERT |.To enhance retrieval effectiveness, we concatenate the agg ⋆ vector with the [CLS] vector since they are pretrained to capture textual representations in different ways, focusing on the lexical and semantic, respectively.
In our Aggretriever model, the scoring function is the dot product of the concatenated vectors: where ⊕ means vector concatenation.The vector e q [CLS] ⊕ agg ⋆ q captures representations pre-trained from both NSP and MLM.
During fine-tuning, we minimize the negative log-likelihood of a relevant query-passage pair.Specifically, given a query q, its relevant passage p + and a set of negative passages {p , we train our model by minimizing the negative loglikelihood (NLL) of the positive {q, p + } pair over all the passages, i.e., L is − log exp(sim(q, p + )) exp(sim(q, p + )) + bs j=1 exp(sim(q, p − j )) .
Following Karpukhin et al. (2020), we include the positive and negative passages from the other queries in the same batch as the negatives.In addition, we also use the same NLL loss, L agg and L [CLS] , to optimize sim agg and sim [CLS] separately.The final loss is as follows: We set λ 1 and λ 2 to 0.5 in all our experiments.While conducting end-to-end retrieval, we use Flat IP in Faiss (Johnson et al., 2021) to index the passage vectors.Note that in our main experiments, we project e q [CLS] and e p [CLS] to 128 dimensions through a linear layer and set d = 640 for agg ⋆ so that the dimensionality is 768.

Datasets
In-Domain Evaluations.We evaluate in-domain retrieval effectiveness on web search and opendomain question answering.Table 1 provides statistics of the datasets.
For web search, we use the MS MARCO passage ranking dataset introduced by Bajaj et al. ( 2016), comprising a corpus with 8.8M passages and around 500K training queries.We evaluate model effectiveness on the following query sets: (1) MARCO Dev, 6980 queries from the development set with one relevant passage per query on average.Following the established procedure, we report RR@10 and R@1000 as the evaluation metrics.(2) TREC DL (Craswell et al., 2019(Craswell et al., , 2020)), created by the organizers of the 2019 (2020) Deep Learning Tracks at the Text REtrieval Conferences (TRECs), where 43 (53) queries with graded relevance labels are released.We report nDCG@10, used by the organizers as the main metric.
For open-domain question answering, we use the Wikipedia corpus released by Karpukhin et al. (2020) and conduct experiments on two query sets, Natural Questions (NQ; Kwiatkowski et al., 2019) and Trivia QA (TQA; Joshi et al., 2017).We directly use the training and test sets released by Karpukhin et al. (2020) for training and evaluation, respectively.For this task, we report hit accuracy at cutoffs 5, 20, and 100, denoted R@5/20/100.Model RR@10 R@1K nDCG@10 R@5 R@20 R@100 R@5 R@20 R@100 (a) BM25  Sciavolino et al., 2021), which are challenging for dense retrieval models.We report hit accuracy at cutoffs 20 and 100 (R@20/100).In addition, we use BEIR (Thakur et al., 2021), consisting of 18 distinct IR datasets spanning diverse domains and tasks, including retrieval, question answering, fact checking, question paraphrasing, and citation prediction.We conduct zero-shot retrieval on 14 of the 18 datasets that are publicly available. 2We report nDCG@10 averaged over the 14 datasets.

Models
Since our approach to text aggregation can be applied to any existing pre-trained encoder-only model, we test the effectiveness of Aggretriever on two pre-trained LM models and two further pretrained models: (1) BERT (Devlin et al., 2018); (2) DistilBERT (Sanh et al., 2019), a 6-layer transformer distilled from BERT; (3) Condenser (Gao and Callan, 2021), a BERT model further pretrained with the tasks of auto-encoding and skipconnection MLM; and (4) coCondenser (Gao and Callan, 2022), a corpus-aware Condenser combining the tasks of skip-connection MLM and an ICT variant that comes in two separate flavors, further pre-trained on the MS MARCO and Wikipedia corpora, respectively.All model checkpoints can be downloaded from the Hugging Face Model Hub. 3 We compare models fine-tuned using only the [CLS] vector and based on our approach with the subscripts "CLS" and "AGG", respectively, e.g., 2 We exclude BioASQ, Signal-1M, TREC-NEWS, Robust04.
3 https://huggingface.co/modelsBERT CLS and BERT AGG .In addition, we also report the effectiveness of BM25 as a reference point; these results come from the Pyserini IR toolkit (Lin et al., 2021a).
For implementation details, we refer readers to Appendix A.1.It is worth emphasizing that in our main experiments, we do not leverage any expensive fine-tuning strategies such as hard negative mining or knowledge distillation.Thus, we finetune all the DPR models under the same settings for a fair comparison.Additional detailed comparisons are provided in Appendix A.2.

In-Domain Evaluations
Fine-Tuning with Full Training Data.Table 2 compares in-domain retrieval effectiveness across the various models.We observe that our approach consistently improves on DistilBERT and BERT across all datasets, especially for metrics that emphasize top rankings.For example, Table 4: Near-domain zero-shot retrieval effectiveness comparisons using NQ or TQA for fine-tuning.Bold denotes the best model for that metric.
For the further pre-trained models, we observe that both Condenser AGG and coCondenser AGG yield effectiveness gains on MS MARCO and TQA (rows 6 and 8), which suggests that our approach is orthogonal and additive to further pre-training methods.We observe that in some cases, Aggretriever using pre-trained BERT as the backbone can obtain better retrieval effectiveness than further pre-trained models that are fine-tuned only on the [CLS] vector.For example, BERT AGG outperforms Condenser CLS for MS MARCO and TQA (row 4 vs 5).This indicates that existing language models pre-trained on MLM can serve as an effective single-vector dense retriever, without further pre-training, using our proposed methods.Without corpus-aware further pre-training, Condenser AGG is competitive with coCondenser CLS on MS MARCO and TQA (row 6 vs 7).
Fine-Tuning with Limited Data.Table 3 reports retrieval effectiveness when the models are fine-tuned on subsets of the MS MARCO training data.Specifically, we randomly sample 1K and 10K queries from the training queries and finetune the models on each set for 40 epochs.We first observe that with only 1K training queries, both DistilBERT CLS and BERT CLS underperform BM25 (rows 1, 3 vs a), while both DistilBERT AGG and BERT AGG surpass BM25 (rows 2, 4 vs a) and are on par with Condenser CLS (row 5), indicating that our approach successfully aggregates text in- formation into a single vector without any further pre-training.We observe similar trends when finetuning models with 10K training queries.Finally, we find that coCondenser CLS performs the best when fine-tuning with limited training data.This is probably because coCondenser's further pre-training is designed for the [CLS] vector to learn corpus-aware signals from pseudo relevance in addition to skip-connection MLM.Thus, the [CLS] vector is more "ready" for retrieval with small training data.

Zero-Shot Evaluations
Near-Domain Retrieval Effectiveness.In these experiments, we examine robustness in a zeroshot retrieval setting.We first consider transfer to "near-domain" (Wikipedia) datasets, reported in Table 4. Specifically, we perform retrieval on test queries from SQuAD and EntityQs using models fine-tuned on NQ or TQA.
We see that Aggretriever with any backbone yields sizable gains over its [CLS] counter- (and coCondenser CLS ) in SQuAD using NQ as the source (e.g., row 6 vs 5).It is worth mentioning that using TQA as the source, Aggretriever with any backbone is competitive with BM25 while the other [CLS] models still lag behind BM25 on the Entity-Qs test queries.Finally, we observe that models fine-tuned on TQA have better zero-shot retrieval effectiveness in near-domain datasets compared to those fine-tuned on NQ, which is also observed by Ram et al. (2022).
Multi-Domain Retrieval Effectiveness.In addition, we evaluate zero-shot retrieval effectiveness on the multi-domain BEIR dataset, reported in Table 5.We evaluate the models fine-tuned on three different sources: MS MARCO, NQ, and TQA.Similarly, Aggretriever shows better zero-shot retrieval effectiveness compared to its [CLS] counterpart with any backbone.For example, our model consistently and substantially outperforms the comparable baselines using MS MARCO and TQA as the source dataset for fine-tuning.Although models fine-tuned on NQ show the worst zero-shot retrieval capability, Aggretriever with any backbone still slightly outperforms its [CLS] counterpart.It is also worth mentioning that Aggretriever with any backbone fine-tuned on MS MARCO outperforms the strong BM25 baseline.

Fine-Tuning with Noisy Hard Negatives
In this experiment, we use DistilBERT AGG to examine Aggretriever's robustness to fine-tuning with noisy hard negatives.Following TCT (Lin et al., 2021b) and RocketQA (Qu et al., 2021), for each The results are listed in Table 6; we directly copy the numbers of TCT and RocketQA from the original papers.We notice that hard negatives reduce the effectiveness of both TCT and RocketQA since there are many false negatives in the candidates, as noted by Qu et al. (2021).They address this issue using expensive training strategies: knowledge distillation, denoising, and cross-batch negative sampling.On the other hand, DistilBERT AGG obtains competitive retrieval effectiveness without any expensive training strategies.This experiment demonstrates that Aggretriever is robust and able to extract useful information when fine-tuned with hard negatives.

Ablation Study
In this experiment, we use DistilBERT AGG finetuned on the MS MARCO dataset to conduct an ablation study.In addition to MARCO Dev, to understand the zero-shot effectiveness of each condition, we conduct retrieval on a subset of BEIR (denoted BEIR small), consisting of five datasets from different domains: NFCorpus, FiQA, ArguAna, SCI-DOCS, and SciFact.We report nDCG@10 averaged over these five datasets.
Dimensionality Ablation.We first study the effects of dimensionality on the [CLS] and agg ⋆ vectors in Table 7.We find that [CLS] alone slightly outperforms agg ⋆ alone (row 1 vs 4) on indomain evaluation while the reverse trend is seen on zero-shot evaluation.This observation indicates that the [CLS] and agg ⋆ vectors encode text in different ways and that combining them further improves retrieval effectiveness (row 5).Compared to [CLS] alone and agg ⋆ alone, we still see a slight improvement for in-domain evaluation at 256 dimensions (row 6 vs 1 and 4).Holding the number of dimensions constant (rows 1-4), the best condition (row 3) indicates that the agg ⋆ vector requires more space than the [CLS] vector.Finally, we report the retrieval effectiveness of the original wordpiece lexical representations before pruning (row 7), which can be considered the effectiveness upper bound of agg ⋆ .Although agg ⋆ with 768 dimensions has lower effectiveness (row 4 vs 7), combined with [CLS], Aggretriever reduces the gap (rows 3, 5 vs 7), with better retrieval efficiency in terms of smaller index size and lower retrieval latency.For example, on the MS MARCO dataset, representing each passage as a 768-dimensional vector in a Faiss Flat index with 32 (16) bits requires 26 (13) GB and 100 ms/q retrieval latency on a single V100 GPU, while the 30522-dimensional vectors (without pruning) require around 40 times more index storage and are not practical for end-to-end retrieval.
Pooling Stage Ablation.In the second ablation experiment, we fix [CLS] and agg ⋆ to 128 and 640 dimensions, respectively, and compare different designs of the pooling stage to form agg ⋆ , as discussed in Section 3.1.The results are reported in the first main block of Table 8; row 1 is our default condition.In row 2, we remove the term importance component and assign a term weight of one for weighted max pooling.A substantial drop in retrieval effectiveness can be observed.In row 3, we remove MLM projection and represent each query (or passage) token with the 30522dimensional indicator vector in Eq. ( 2); that is, p q i = x j ∈ {0, 1} |V BERT | for j ∈ {token_id(q i )}.We notice that skipping the MLM projector modestly harms retrieval effectiveness.This means that most textual information can be captured without the MLM projector, but it does help.This is sensible since the 30522-dimensional indicator vector still retains each original query (or passage) term.A comparison of row 2 and row 3 shows that learned term weights for each token are more important than the term semantic distribution (projected by MLM) over the wordpiece vocabulary.
Pruning Stage Ablation.In the second main block of Table 8, we study the effects of pruning wordpiece lexical representations on Aggretriever.For example, we semi-aggregate (linearly project) the lexical representations into 640-dimensional dense vectors, as shown in row 4 (5).We observe that our non-parametric pruning approaches are better than the learned ones (rows 1, 4 vs 5).Although agg + shows the same retrieval effectiveness as agg ⋆ on in-domain evaluation, a substantial drop can be observed on out-of-domain evaluation (row 1 vs 4).This result demonstrates that our fully aggregated representations better preserve information from lexical representations and appear to be more robust to domain shifts.
We observe that directly projecting averaged contextualized embedding (excluding the [CLS]), denoted AVERAGE, into 640 dimensions, and then concatenating with [CLS] (row 6), does not perform well, indicating that projecting contextualized token embeddings into the high-dimensional wordpiece lexical space before pooling is key to preserving lexical information.Finally, we also try average pooling over all contextualized embeddings (including the [CLS]), which corresponds to RepBERT (Zhan et al., 2020).This yields negligible effectiveness difference from AVERAGE concatenated with [CLS] (row 7 vs 6); i.e., 0.306 (RR@10) and 0.264 (nDCG@10) on MARCO dev and BEIR small, respectively.
To further understand the differences between pruned lexical representations (rows 1, 4, 5 in Ta- ble 8), we fine-tune DistilBERT using each representation alone (without using [CLS]) with 128, 256, and 768 dimensions on the MS MARCO dataset and compare their retrieval effectiveness on MS MARCO Dev and BEIR small in Fig. 4. We observe that agg ⋆ performs better than agg + under all conditions, demonstrating that distributing representations to the full vector space can mitigate the problem of term misalignment (rectangles vs triangles) mentioned in Section 3.2, especially when the number of dimensions is small.Although the linearly projected lexical representations (diamonds) show better in-domain retrieval effectiveness than our non-parametric pruning approaches (agg + and agg ⋆ ) with 128 and 256 dimensions, agg ⋆ still exhibits better zero-shot retrieval effectiveness.This indicates that the learned linear projector helps compress textual information into lowdimensional space in a way that is biased toward the training data.In addition, in Fig. 4, we also show the retrieval effectiveness of [CLS] and AVERAGE (solid and hollow circles) as comparisons.We observe that although all 768-dimensional textual representations reach similar in-domain retrieval effectiveness, [CLS] and AVERAGE show poor zero-shot retrieval effectiveness on BEIR small compared to the other models pruned from 30K-dimensional lexical representations.We hypothesize that [CLS] and AVERAGE capture textual information in a different manner than our lexical representations.This explains why fusing [CLS] with pruned lexical representations performs better than AVERAGE (rows 1, 4, 5 vs 6 in Table 8).
However, [CLS] and AVERAGE do not exhibit much retrieval effectiveness drop on both in-domain and zero-shot evaluations when reduc- ing the number of dimensions.This is probably because lexical representations contain finegrained textual information in 30K-dimensional lexical space while [CLS] and AVERAGE embeddings capture high-level textual information in lowdimensional semantic space.This result also explains the optimal balance in Table 7, where agg ⋆ requires more space than [CLS] when restricting the total vector dimension to 768.

Query Encoding Latency
Although different single-vector dense retrievers with the same vector dimensionality have similar retrieval latency under the same software and environment when performing top-k retrieval, query encoding latency is also an important component to consider.In this experiment, we compare the query encoding latency of DistilBERT AGG and DistilBERT CLS .We measure the time required to encode the 6980 queries from MARCO Dev with batch size one on the CPU and GPU, using one thread on a Linux machine with a 2.2 GHz Intel Xeon Silver 4210 CPU and a single Tesla V100 GPU (32GB), respectively.We report the latency at 1 th , 50 th and 99 th percentiles in Table 9.
We observe that query encoding with Aggretriever is slightly slower than its [CLS] counterpart on the GPU (row 2 vs 1).On the CPU, the gap is much larger, especially for tail queries.However, from row 3 (the same condition as row 3 in Table 8), we see that skipping the MLM head projection step reduces the query encoding latency with only a small retrieval effectiveness loss.For a real-world application, this might be a sensible option, bringing query encoding latency roughly in line with the [CLS]-only model.

Comparison with Sparse Retrievers
In our final set of experiments, we compare Aggretriever and sparse retrievers since we borrow ideas from existing learned sparse retrieval models such as SPLADE-max (Formal et al., 2021a), which uses a different activation function after the MLM projector and adds sparsity regularization to generate sparse lexical representations for inverted indexes.For comparison to a sparse retriever without MLM projection, we use uniCOIL without expansions from T5 (Nogueira and Lin, 2019).Both models are fine-tuned on MS MARCO with BM25 negatives; thus, they represent reasonably fair comparisons to DistilBERT AGG and its variant without MLM, respectively (although uniCOIL uses BERT as a backbone).We index and evaluate SPLADE-max and uniCOIL using the code provided by Formal et al. (2021a) 4 and Pyserini (Lin  et al., 2021a), respectively. 5  Results are shown in Table 10.We first observe that DistilBERT CLS shows competitive in-domain retrieval effectiveness but underperforms sparse retrievers on out-of-domain evaluations (row 1 vs 5).This indicates that sparse retrieval using lexical matching has better generalization across retrieval tasks than dense retrieval with [CLS] alone.On the other hand, DistilBERT AGG and its variant show equally good generalization capability compared to the sparse retrievers (rows 2, 3 vs 4, 5).We attribute the transferability of Aggretriever to agg ⋆ , which effectively aggregates and preserves information from wordpiece lexical representations.
Finally, we observe that without the MLM projector, the effectiveness of the sparse retrievers degrades, especially on in-domain evaluation (row 4 vs 5), while agg ⋆ only sees a slight degradation (row 2 vs 3).We hypothesize that the MLM projector helps sparse retrievers learn semantic matching as well as exact term matching.In contrast, Aggretriever can still learn semantic matching, even without the MLM projector, because it benefits from fusion with the [CLS] vector.

Related Work
Dense Retrieval.The most related line of research to our own work is the literature on how to effectively fine-tune a single-vector dense retriever.On the one hand, some researchers propose computationally expensive fine-tuning techniques such as hard negative mining strategies (Xiong et al., 2021;Zhan et al., 2021b), knowledge distillation (Lin et al., 2021b;Hofstätter et al., 2021), or their combination (Qu et al., 2021).On the other hand, others leverage further pre-training to improve the subsequent fine-tuning (Lee et al., 2019;Gao et al., 2021b;Lu et al., 2021;Gao and Callan, 2021;Izacard et al., 2021;Gao and Callan, 2022;Liu and Shao, 2022).As far as we are aware, our work is the first to discuss how to fine-tune dense retrieval models to effectively aggregate textual information from the pre-trained MLM head rather than directly using the [CLS] vector or contextualized embeddings from max or average pooling (Reimers and Gurevych, 2019).Sparse Retrieval.Previous work (Bai et al., 2020;Mallia et al., 2021;Formal et al., 2021b;Lin and Ma, 2021) has demonstrated that projecting contextualized token embeddings into a high-dimensional vector in the wordpiece vocabulary space is an effective way to represent token-level information from transformers for lexical matching.These models directly feed the high-dimensional vectors into an inverted index for retrieval.Thus, sparsity control for effectiveness-efficiency tradeoffs involves additional considerations (Mackenzie et al., 2021).In contrast, our approach converts highdimensional vectors into low-dimensional ones where top-k retrieval can be performed directly using ANN search libraries (Guo et al., 2020;Johnson et al., 2021).Hybrid Retrieval.Our work can be characterized as hybrid since we "fuse" semantic and lexical representations into a single dense vector.Recent work (Gao et al., 2021a;Hofstätter et al., 2022;Shen et al., 2022;Lin and Lin, 2022) proposes to jointly train [CLS] and token-level representations for semantic and lexical matching, respectively.The two kinds of representations require different implementations for top-k retrieval, so multiple software stacks are required to perform retrieval.In contrast, our representations retain the best of semantic and lexical matching, but entirely as dense vectors.Thus, retrieval can be performed in a simple execution environment.
In this paper, we present Aggretriever, a singlevector dense retrieval model that exploits all contextualized token embeddings from the input to BERT.We introduce a simple approach to aggregate the contextualized token embeddings into a dense vector, agg ⋆ .Experiments show that agg ⋆ combined with the standard [CLS] vector achieves better retrieval effectiveness than using the [CLS] vector alone for both in-domain and zero-shot evaluations.Our work demonstrates that MLM pretrained transformers can be fine-tuned into effective dense retrievers without further pre-training or expensive fine-tuning strategies.
Our work leads to a few open questions for future research: (1) Since we have demonstrated that Aggretriever still benefits from further pretraining, can we design additional pre-training tasks tailored directly to our model?The design of these tasks, of course, needs to be mindful of the computational costs.(2) Can we apply current state-of-the-art compression techniques to Aggretriever?Zhan et al. (2021aZhan et al. ( , 2022) ) has shown that 768-dimensional dense representations can be effectively compressed into much smaller vectors.However, it is still unknown if these techniques can be applied to Aggretriever to retain both indomain and zero-shot retrieval effectiveness.(3) Finally, can we apply Aggretriever to multi-lingual retrieval?Since in a multi-lingual BERT model, the MLM head can project into tokens in multiple languages, we can envision a natural extension.However, as shown in Section 5.5, MLM projection is expensive, and the issue becomes worse when using a pre-trained multi-lingual model since the vocabulary size is usually even larger.

A.1 Implementation Details
We implement our models using Tevatron (Gao et al., 2022) and apply its default training settings in most tasks.For MS MARCO, we train models for three epochs with a learning rate 5e − 6, and for each batch, we include 8 queries.Each of the queries is paired with a randomly sampled positive passage and 7 negative passages mined using BM25.The maximum query and passage lengths are set to 32 and 128, respectively.Note that we use the official training set and corpus 6 instead of the ones in Tevatron, which are further processed by Qu et al. (2021).For open-domain QA, we follow the original settings used by Karpukhin et al. (2020) except for two modifications: (1) we use shared instead of independent weights between the query and passage encoders; (2) we set the maximum query and passage lengths to 32 and 156 for faster fine-tuning and inference.Note that we use one and four Tesla V100 GPUs (32GB) for model fine-tuning on MS MARCO and open-domain QA, respectively.For BEIR evaluation, we use the APIs provided by Thakur et al. (2021) and set maximum query and passage input lengths to 512. 7

A.2 Comparison with Existing DPR Models
Table 11 compares Aggretriever with existing dense retrievers fine-tuned with more expensive strategies; i.e., cross-encoder knowledge distillation (KD), hard negative mining (HNM), and large in-batch negatives, on both in-domain and out-ofdomain evaluations.The two baseline models without further pre-training are: (1) TAS-B (Hofstätter et al., 2021), which distills ColBERT and a crossencoder to DPR with an efficient topic-aware sampling strategy; (2) CL-DRD (Zeng et al., 2022), which further improves TAS-B by combining curriculum learning, HNM, and cross-encoder KD.Three models with further pre-training are included: (1) coCondenser (Gao and Callan, 2022), already discussed in Section 4.2; (2) Contriever (Izacard et al., 2021), which leverages pre-training by combining advanced contrastive learning techniques with an Inverse Cloze Task (ICT) variant; (3) GTR-Base (Ni et al., 2021), which trains a T5-Base encoder model that combines pre-training,  KD, and HNM.For TAS-B, Contriever, and GTR-Base, we directly copy numbers from Izacard et al. (2021) and Ni et al. (2021), respectively.For CL-DRD8 and coCondenser,9 we use the models provided by the authors to conduct in-domain and out-of-domain evaluations ourselves.Note that the coCondenser model provided by the authors is fine-tuned in another round with self-mined hard negatives.Furthermore, they use a "nonstandard" MS MARCO corpus where each passage is concatenated with a title; thus, the MS MARCO Dev results are different from the values for coCondenser CLS reported in Table 2.
First, we observe that DistilBERT AGG is not only competitive with TAS-B on in-domain evaluation but also outperforms both TAS-B and CL-DRD on out-of-domain evaluation, without needing supervision from an expensive cross-encoder teacher.Secondly, Contriever yields the best out-of-domain results at the cost of in-domain effectiveness.On the other hand, coCondenser AGG reaches the same level of retrieval effectiveness as GTR-Base without leveraging any expensive fine-tuning strategies.Fine-tuning Aggretriever with KD, HNM, and large batch size is possible to further improve retrieval effectiveness, but these techniques are orthogonal to our proposed model.

Figure 1 :
Figure 1: In-domain versus zero-shot effectiveness.All DPR models are trained with BM25 negatives.

Figure 4 :
Figure 4: In-domain versus zero-shot effectiveness comparisons between textual representations under different numbers of dimensions.

Table 2 :
In-domain retrieval effectiveness comparisons.All models are fine-tuned with negatives from BM25.Bold denotes the best model for that metric.

Table 3 :
In-domain retrieval effectiveness while fine-tuning models using limited training data.

Table 5 :
Multi-domain zero-shot retrieval effectiveness comparisons using various sources for finetuning.Bold denotes the best model for that metric.

Table 6 :
Fine-tuning with noisy hard negatives.

Table 8 :
DistilBERT AGG text aggregation ablation.We project[CLS]to 128 dimensions and concatenate with a 640-dimensional embedding pooled and pruned using different strategies.AVERAGE denotes average pooling over all 768-dimensional contextualized token embeddings other than [CLS].

Table 11 :
Comparisons with existing DPR models.These numbers are not comparable due to the use of a "non-standard" MS MARCO passage corpus that has been augmented with title. *