Abstract
Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that models such as BERT are not “structurally ready” to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This “lack of readiness” results from the gap between language model pre-training and DPR fine-tuning. Previous solutions call for computationally expensive techniques such as hard negative mining, cross-encoder distillation, and further pre-training to learn a robust DPR model. In this work, we instead propose to fully exploit knowledge in a pre-trained language model for DPR by aggregating the contextualized token embeddings into a dense vector, which we call agg★. By concatenating vectors from the [CLS] token and agg★, our Aggretriever model substantially improves the effectiveness of dense retrieval models on both in-domain and zero-shot evaluations without introducing substantial training overhead. Code is available at https://github.com/castorini/dhr.
1 Introduction
A bi-encoder architecture (Reimers and Gurevych, 2019; Karpukhin et al., 2020) based on pre-trained language models (Devlin et al., 2018; Liu et al., 2019; Raffel et al., 2020) has been widely used for first-stage retrieval in knowledge-intensive tasks such as open-domain question answering and fact checking. Compared to bag-of-words models such as BM25, these approaches circumvent lexical mismatches between queries and passages by encoding text into dense vectors.
Despite their success, recent research calls into question the robustness of these single-vector models (Thakur et al., 2021). As shown in Figure 1, single-vector dense retrievers (e.g., BERTCLS and TAS-B) trained with well-designed knowledge distillation strategies (Hofstätter et al., 2021) still underperform BM25 on out-of-domain datasets. Along the same lines, Sciavolino et al. (2021) find that simple entity-centric questions are challenging to these dense retrievers.
Recently, Gao and Callan (2021) observe that pre-trained language models such as BERT are not “structurally ready” for fine-tuning on downstream retrieval tasks. This is because the [CLS] token, pre-trained on the task of next sentence prediction (NSP), does not have the proper attention structure to aggregate fine-grained textual information. To address this issue, the authors propose to further pre-train the [CLS] vector before fine-tuning and show that the gap between pre-training and fine-tuning tasks can be mitigated (see coCondenserCLS illustrated in Figure 1). However, further pre-training introduces additional computational costs, which motivates us to ask the following question: Can we directly bridge the gap between pre-training and fine-tuning without any further pre-training?
Before diving into our proposed solution, we briefly overview the language modeling pre-training and DPR fine-tuning tasks using BERT. Figure 2(a) illustrates the BERT pre-training tasks, NSP and mask language modeling (MLM), while Figure 2(b) shows the task of fine-tuning a dense retriever. We observe that solely relying on the [CLS] vector as the dense representation does not exploit the full capacity of the pre-trained model, as the [CLS] vector participates directly only in NSP during pre-training, and therefore lacks information captured in the contextualized token embeddings. A simple solution is to aggregate the token embeddings by pooling (max or mean) into a single vector. However, information is lost in this process and empirical results do not show any consistent effectiveness gains. Hence, we see the need for better aggregation schemes.
In this paper, we propose a novel approach to generate textual representations for retrieval that fully exploit contextualized token embeddings from BERT, shown in Figure 2(c). Specifically, we reuse the pre-trained MLM head to map each contextualized token embedding into a high-dimensional wordpiece lexical space. Following a simple max-pooling and pruning strategy, we obtain a compact lexical vector that we call agg★. By concatenating agg★ and the [CLS] vector, our novel Aggretriever dense retrieval model captures representations pre-trained from both NSP and MLM, improving retrieval effectiveness by a noticeable margin compared to fine-tuned models that solely rely on the [CLS] vector (see BERTAGG vs BERTCLS in Figure 1).
Importantly, fine-tuning Aggretriever does not require any sophisticated and computationally expensive techniques, making it a simple yet competitive baseline for dense retrieval. However, our approach is orthogonal to previously proposed further pre-training strategies, and can still benefit from them to improve retrieval effectiveness even more (see coCondenserAGG in Figure 1). To the best of our knowledge, this is the first work in the DPR literature that leverages the BERT pre-trained MLM head to encode textual information into a single dense vector.
2 Background and Motivation
Recent work (Gao and Callan, 2021; Lu et al., 2021) shows that the [CLS] vector remains “dormant” in most layers of pre-trained models and fails to adequately aggregate information from the input sequence during pre-training. Thus, researchers argue that the models are not “structurally ready” for fine-tuning. To tackle this issue, unsupervised contrastive learning has been proposed, which creates pseudo relevance labels from the target corpus to “prepare” the [CLS] vector for retrieval. The most representative technique is the Inverse Cloze Task (ICT; Lee et al., 2019). However, since the generated relevance data is noisy, further pre-training with ICT often requires a huge amount of computation due to the need for large batch sizes or other sophisticated training techniques (Chang et al., 2020; Izacard et al., 2021; Ni et al., 2021).
Another thread of work (Gao and Callan, 2021; Lu et al., 2021) manages to guide transformers to aggregate textual information into the [CLS] vector through auto-encoding. This method does not require as much computation as unsupervised contrastive learning but is still much more computationally intensive than fine-tuning. For example, Gao and Callan (2021) report that the further pre-training process still requires one week on four RTX 2080 Ti GPUs, while fine-tuning consumes less than one day in the same environment.
Recent work on neural sparse retrievers (Bai et al., 2020; Formal et al., 2021b) projects contextualized token embeddings into a high-dimensional wordpiece lexical space through the BERT pre-trained MLM projector and directly performs retrieval in wordpiece lexical space. These models demonstrate that MLM pre-trained weights can be used to learn effective lexical representations for retrieval tasks, a finding that has not been fully explored in the DPR literature. Inspired by this work, we explore reusing MLM pre-trained weights for DPR fine-tuning and further combine the [CLS] vector to fully exploit textual information in a pre-trained language model.
3 Aggretriever
In this section, we first introduce our method for text aggregation to form agg★, which consists of two steps: pooling and pruning. Then, we describe how to concatenate the aggregated text representation agg★ and [CLS] into a 768-dimensional dense vector for fine-tuning and retrieval.
3.1 Text Aggregation Pooling
The goal of text aggregation is to transform contextualized token embeddings into a single-vector token representation. Let the input sequence q denote a tokenized query sequence with a length of l, ([CLS], q1,q2,⋯ql, [SEP]), or alternatively, a passage p of length m, ([CLS], p1,p2,⋯pm, [SEP]). One simple approach is to directly pool (mean or max) contextualized token embeddings from the final layer. Such pooling strategies have been studied in previous work (Reimers and Gurevych, 2019), but do not appear to be consistently more effective than just using the [CLS] token; this is also confirmed in our ablation study (Section 5.4).
Our design has three advantages: (1) the MLM head with softmax is used for BERT pre-training; thus, the output probabilities can accurately model each contextualized token semantically. (2) In contrast to directly pooling contextualized embeddings, important dimensions of the token representations in the high-dimensional space are less likely to overlap, resulting in non-interfering max-pooling (Jang et al., 2021). (3) Finally, wi and disentangle the effects of term importance from the MLM head. We will study the effectiveness of this design in Section 5.4 through ablations. Note that compared to previous work on sparse retrieval (Bai et al., 2020; Formal et al., 2021b), which switches softmax to ReLU to create sparse representations, our design sticks to the original activation function for MLM pre-training and directly outputs 30522-dimensional dense lexical vectors (vq).
Figure 3(a) illustrates the generation of vq with |VBERT| = 10 for simplicity. Ideally, we can directly compute vq ·vp as a lexical matching similarity score for the wordpiece lexical representations. However, the vectors () are too large for efficient retrieval using dense vector search libraries such as Faiss. To address this issue, we introduce our non-parametric pruning method to convert vq (vp) into a low-dimensional vector for dense retrieval.
3.2 Text Aggregation Pruning
We consider as a bag-of-words representation with each dimension storing the corresponding term weight. Thus, dimensions with low term weights indicate that the corresponding terms are not important and can be pruned.
- (a)
;
- (b)
.
That is, the dot product of agg★ in Eq. (8) avoids interactions between misaligned terms in the above cases (with 50% of probability), which agg + in Eq. (5) does not consider. Note that we do not store the vectors idp and idq to compute Eq. (8). Figure 3(b) illustrates the difference between and with d = 5, |Sn| = 2 and for simplicity.
3.3 Fine-Tuning and Retrieval
Although agg★ can mitigate the issue of term misalignment, the approximation error cannot be completely eliminated unless d = |VBERT|. To enhance retrieval effectiveness, we concatenate the agg★ vector with the [CLS] vector since they are pre-trained to capture textual representations in different ways, focusing on the lexical and semantic, respectively.
4 Experimental Setup
4.1 Datasets
In-Domain Evaluations.
We evaluate in-domain retrieval effectiveness on web search and open-domain question answering. Table 1 provides statistics of the datasets.
. | MARCO . | NQ . | TQA . |
---|---|---|---|
# passages | 8,841,823 | 21,015,325 | |
# training queries | 532,761 | 58,880 | 60,413 |
# test queries | Dev / DL19 / 20 | Test | Test |
6,980 / 43 / 53 | 3,610 | 11,313 |
. | MARCO . | NQ . | TQA . |
---|---|---|---|
# passages | 8,841,823 | 21,015,325 | |
# training queries | 532,761 | 58,880 | 60,413 |
# test queries | Dev / DL19 / 20 | Test | Test |
6,980 / 43 / 53 | 3,610 | 11,313 |
For web search, we use the MS MARCO passage ranking dataset introduced by Bajaj et al. (2016), comprising a corpus with 8.8M passages and around 500K training queries. We evaluate model effectiveness on the following query sets: (1) MARCO Dev, 6980 queries from the development set with one relevant passage per query on average. Following the established procedure, we report RR@10 and R@1000 as the evaluation metrics. (2) TREC DL (Craswell et al., 2019,2020), created by the organizers of the 2019 (2020) Deep Learning Tracks at the Text REtrieval Conferences (TRECs), where 43 (53) queries with graded relevance labels are released. We report nDCG@10, used by the organizers as the main metric.
For open-domain question answering, we use the Wikipedia corpus released by Karpukhin et al. (2020) and conduct experiments on two query sets, Natural Questions (NQ; Kwiatkowski et al., 2019) and Trivia QA (TQA; Joshi et al., 2017). We directly use the training and test sets released by Karpukhin et al. (2020) for training and evaluation, respectively. For this task, we report hit accuracy at cutoffs 5, 20, and 100, denoted R@5/20/100.
Zero-Shot Evaluations.
We evaluate zero-shot retrieval effectiveness on open-domain QA with two query sets, SQuAD (Rajpurkar et al., 2016) and EntityQuestions (EntityQs; Sciavolino et al., 2021), which are challenging for dense retrieval models. We report hit accuracy at cutoffs 20 and 100 (R@20/100). In addition, we use BEIR (Thakur et al., 2021), consisting of 18 distinct IR datasets spanning diverse domains and tasks, including retrieval, question answering, fact checking, question paraphrasing, and citation prediction. We conduct zero-shot retrieval on 14 of the 18 datasets that are publicly available.2 We report nDCG@10 averaged over the 14 datasets.
4.2 Models
Since our approach to text aggregation can be applied to any existing pre-trained encoder-only model, we test the effectiveness of Aggretriever on two pre-trained LM models and two further pre-trained models: (1) BERT (Devlin et al., 2018); (2) DistilBERT (Sanh et al., 2019), a 6-layer transformer distilled from BERT; (3) Condenser (Gao and Callan, 2021), a BERT model further pre-trained with the tasks of auto-encoding and skip-connection MLM; and (4) coCondenser (Gao and Callan, 2022), a corpus-aware Condenser combining the tasks of skip-connection MLM and an ICT variant that comes in two separate flavors, further pre-trained on the MS MARCO and Wikipedia corpora, respectively. All model checkpoints can be downloaded from the HuggingFace Model Hub.3 We compare models fine-tuned using only the [CLS] vector and based on our approach with the subscripts “CLS” and “AGG”, respectively, e.g., BERTCLS and BERTAGG. In addition, we also report the effectiveness of BM25 as a reference point; these results come from the Pyserini IR toolkit (Lin et al., 2021a).
For implementation details, we refer readers to Appendix A.1. It is worth emphasizing that in our main experiments, we do not leverage any expensive fine-tuning strategies such as hard negative mining or knowledge distillation. Thus, we fine-tune all the DPR models under the same settings for a fair comparison. Additional detailed comparisons are provided in Appendix A.2.
5 Results
5.1 In-Domain Evaluations
Fine-Tuning with Full Training Data.
Table 2 compares in-domain retrieval effectiveness across the various models. We observe that our approach consistently improves on DistilBERT and BERT across all datasets, especially for metrics that emphasize top rankings. For example, DistilBERTAGG sees a three-point and five-point improvement over DistilBERTCLS on RR@10 and nDCG@10 for MS MARCO Dev and TREC DL, respectively, and over two points on R@5 for both NQ and TQA (row 2 vs 1). Similar trends can be observed on BERT (row 4 vs 3).
Model . | MARCO Dev . | DL19 / 20 . | NQ Test . | TQA Test . | |||||
---|---|---|---|---|---|---|---|---|---|
RR@10 . | R@1K . | nDCG@10 . | R@5 . | R@20 . | R@100 . | R@5 . | R@20 . | R@100 . | |
(a) BM25 | 0.188 | 0.858 | 0.506 / 0.475 | 0.438 | 0.629 | 0.783 | 0.663 | 0.764 | 0.832 |
(1) DistilBERTCLS | 0.308 | 0.940 | 0.633 / 0.629 | 0.660 | 0.785 | 0.860 | 0.698 | 0.790 | 0.849 |
(2) DistilBERTAGG | 0.341 | 0.960 | 0.682 / 0.674 | 0.681 | 0.805 | 0.869 | 0.729 | 0.808 | 0.857 |
(3) BERTCLS | 0.314 | 0.942 | 0.612 / 0.643 | 0.677 | 0.799 | 0.863 | 0.710 | 0.796 | 0.852 |
(4) BERTAGG | 0.343 | 0.962 | 0.677 / 0.666 | 0.696 | 0.805 | 0.867 | 0.735 | 0.813 | 0.860 |
(5) CondenserCLS | 0.335 | 0.954 | 0.663 / 0.666 | 0.701 | 0.814 | 0.872 | 0.732 | 0.812 | 0.858 |
(6) CondenserAGG | 0.356 | 0.966 | 0.674 / 0.697 | 0.699 | 0.810 | 0.873 | 0.747 | 0.821 | 0.864 |
(7) coCondenserCLS | 0.352 | 0.973 | 0.674 / 0.684 | 0.707 | 0.818 | 0.878 | 0.745 | 0.819 | 0.867 |
(8) coCondenserAGG | 0.363 | 0.973 | 0.678 / 0.697 | 0.699 | 0.812 | 0.875 | 0.751 | 0.823 | 0.867 |
Model . | MARCO Dev . | DL19 / 20 . | NQ Test . | TQA Test . | |||||
---|---|---|---|---|---|---|---|---|---|
RR@10 . | R@1K . | nDCG@10 . | R@5 . | R@20 . | R@100 . | R@5 . | R@20 . | R@100 . | |
(a) BM25 | 0.188 | 0.858 | 0.506 / 0.475 | 0.438 | 0.629 | 0.783 | 0.663 | 0.764 | 0.832 |
(1) DistilBERTCLS | 0.308 | 0.940 | 0.633 / 0.629 | 0.660 | 0.785 | 0.860 | 0.698 | 0.790 | 0.849 |
(2) DistilBERTAGG | 0.341 | 0.960 | 0.682 / 0.674 | 0.681 | 0.805 | 0.869 | 0.729 | 0.808 | 0.857 |
(3) BERTCLS | 0.314 | 0.942 | 0.612 / 0.643 | 0.677 | 0.799 | 0.863 | 0.710 | 0.796 | 0.852 |
(4) BERTAGG | 0.343 | 0.962 | 0.677 / 0.666 | 0.696 | 0.805 | 0.867 | 0.735 | 0.813 | 0.860 |
(5) CondenserCLS | 0.335 | 0.954 | 0.663 / 0.666 | 0.701 | 0.814 | 0.872 | 0.732 | 0.812 | 0.858 |
(6) CondenserAGG | 0.356 | 0.966 | 0.674 / 0.697 | 0.699 | 0.810 | 0.873 | 0.747 | 0.821 | 0.864 |
(7) coCondenserCLS | 0.352 | 0.973 | 0.674 / 0.684 | 0.707 | 0.818 | 0.878 | 0.745 | 0.819 | 0.867 |
(8) coCondenserAGG | 0.363 | 0.973 | 0.678 / 0.697 | 0.699 | 0.812 | 0.875 | 0.751 | 0.823 | 0.867 |
For the further pre-trained models, we observe that both CondenserAGG and coCondenserAGG yield effectiveness gains on MS MARCO and TQA (rows 6 and 8), which suggests that our approach is orthogonal and additive to further pre-training methods. We observe that in some cases, Aggretriever using pre-trained BERT as the backbone can obtain better retrieval effectiveness than further pre-trained models that are fine-tuned only on the [CLS] vector. For example, BERTAGG outperforms CondenserCLS for MS MARCO and TQA (row 4 vs 5). This indicates that existing language models pre-trained on MLM can serve as an effective single-vector dense retriever, without further pre-training, using our proposed methods. Without corpus-aware further pre-training, CondenserAGG is competitive with coCondenserCLS on MS MARCO and TQA (row 6 vs 7).
Fine-Tuning with Limited Data.
Table 3 reports retrieval effectiveness when the models are fine-tuned on subsets of the MS MARCO training data. Specifically, we randomly sample 1K and 10K queries from the training queries and fine-tune the models on each set for 40 epochs. We first observe that with only 1K training queries, both DistilBERTCLS and BERTCLS underperform BM25 (rows 1, 3 vs a), while both DistilBERTAGG and BERTAGG surpass BM25 (rows 2, 4 vs a) and are on par with CondenserCLS (row 5), indicating that our approach successfully aggregates text information into a single vector without any further pre-training. We observe similar trends when fine-tuning models with 10K training queries.
Model . | MARCO Dev . | |||
---|---|---|---|---|
RR@10 . | R@1K . | |||
(a) BM25 | 0.188 | 0.858 | ||
Train Size | 1K | 10K | 1K | 10K |
(1) DistilBERTCLS | 0.145 | 0.222 | 0.754 | 0.865 |
(2) DistilBERTAGG | 0.207 | 0.260 | 0.868 | 0.905 |
(3) BERTCLS | 0.153 | 0.230 | 0.778 | 0.866 |
(4) BERTAGG | 0.207 | 0.258 | 0.871 | 0.906 |
(5) CondenserCLS | 0.191 | 0.259 | 0.841 | 0.903 |
(6) CondenserAGG | 0.211 | 0.258 | 0.873 | 0.899 |
(7) coCondenserCLS | 0.234 | 0.287 | 0.935 | 0.948 |
(8) coCondenserAGG | 0.209 | 0.280 | 0.880 | 0.914 |
Model . | MARCO Dev . | |||
---|---|---|---|---|
RR@10 . | R@1K . | |||
(a) BM25 | 0.188 | 0.858 | ||
Train Size | 1K | 10K | 1K | 10K |
(1) DistilBERTCLS | 0.145 | 0.222 | 0.754 | 0.865 |
(2) DistilBERTAGG | 0.207 | 0.260 | 0.868 | 0.905 |
(3) BERTCLS | 0.153 | 0.230 | 0.778 | 0.866 |
(4) BERTAGG | 0.207 | 0.258 | 0.871 | 0.906 |
(5) CondenserCLS | 0.191 | 0.259 | 0.841 | 0.903 |
(6) CondenserAGG | 0.211 | 0.258 | 0.873 | 0.899 |
(7) coCondenserCLS | 0.234 | 0.287 | 0.935 | 0.948 |
(8) coCondenserAGG | 0.209 | 0.280 | 0.880 | 0.914 |
Finally, we find that coCondenserCLS performs the best when fine-tuning with limited training data. This is probably because coCondenser’s further pre-training is designed for the [CLS] vector to learn corpus-aware signals from pseudo relevance in addition to skip-connection MLM. Thus, the [CLS] vector is more “ready” for retrieval with small training data.
5.2 Zero-Shot Evaluations
Near-Domain Retrieval Effectiveness.
In these experiments, we examine robustness in a zero-shot retrieval setting. We first consider transfer to “near-domain” (Wikipedia) datasets, reported in Table 4. Specifically, we perform retrieval on test queries from SQuAD and EntityQs using models fine-tuned on NQ or TQA.
Target (Source) . | SQuAD (NQ) . | EntityQs (NQ) . | SQuAD (TQA) . | EntityQs (TQA) . | ||||
---|---|---|---|---|---|---|---|---|
Model . | R@20 . | R@100 . | R@20 . | R@100 . | R@20 . | R@100 . | R@20 . | R@100 . |
(a) BM25 | 0.712 | 0.820 | 0.714 | 0.800 | 0.712 | 0.820 | 0.714 | 0.800 |
(1) DistilBERTCLS | 0.514 | 0.670 | 0.518 | 0.650 | 0.573 | 0.725 | 0.640 | 0.751 |
(2) DistilBERTAGG | 0.529 | 0.688 | 0.564 | 0.683 | 0.648 | 0.775 | 0.713 | 0.797 |
(3) BERTCLS | 0.512 | 0.671 | 0.534 | 0.664 | 0.581 | 0.722 | 0.637 | 0.747 |
(4) BERTAGG | 0.539 | 0.692 | 0.562 | 0.681 | 0.651 | 0.779 | 0.716 | 0.798 |
(5) CondenserCLS | 0.559 | 0.705 | 0.567 | 0.692 | 0.605 | 0.742 | 0.671 | 0.775 |
(6) CondenserAGG | 0.541 | 0.692 | 0.564 | 0.684 | 0.643 | 0.772 | 0.716 | 0.800 |
(7) coCondenserCLS | 0.567 | 0.715 | 0.556 | 0.684 | 0.629 | 0.762 | 0.695 | 0.791 |
(8) coCondenserAGG | 0.535 | 0.696 | 0.584 | 0.701 | 0.646 | 0.777 | 0.724 | 0.804 |
Target (Source) . | SQuAD (NQ) . | EntityQs (NQ) . | SQuAD (TQA) . | EntityQs (TQA) . | ||||
---|---|---|---|---|---|---|---|---|
Model . | R@20 . | R@100 . | R@20 . | R@100 . | R@20 . | R@100 . | R@20 . | R@100 . |
(a) BM25 | 0.712 | 0.820 | 0.714 | 0.800 | 0.712 | 0.820 | 0.714 | 0.800 |
(1) DistilBERTCLS | 0.514 | 0.670 | 0.518 | 0.650 | 0.573 | 0.725 | 0.640 | 0.751 |
(2) DistilBERTAGG | 0.529 | 0.688 | 0.564 | 0.683 | 0.648 | 0.775 | 0.713 | 0.797 |
(3) BERTCLS | 0.512 | 0.671 | 0.534 | 0.664 | 0.581 | 0.722 | 0.637 | 0.747 |
(4) BERTAGG | 0.539 | 0.692 | 0.562 | 0.681 | 0.651 | 0.779 | 0.716 | 0.798 |
(5) CondenserCLS | 0.559 | 0.705 | 0.567 | 0.692 | 0.605 | 0.742 | 0.671 | 0.775 |
(6) CondenserAGG | 0.541 | 0.692 | 0.564 | 0.684 | 0.643 | 0.772 | 0.716 | 0.800 |
(7) coCondenserCLS | 0.567 | 0.715 | 0.556 | 0.684 | 0.629 | 0.762 | 0.695 | 0.791 |
(8) coCondenserAGG | 0.535 | 0.696 | 0.584 | 0.701 | 0.646 | 0.777 | 0.724 | 0.804 |
We see that Aggretriever with any backbone yields sizable gains over its [CLS] counterpart, with the exception that CondenserAGG (and coCondenserAGG) underperforms CondenserCLS (and coCondenserCLS) in SQuAD using NQ as the source (e.g., row 6 vs 5). It is worth mentioning that using TQA as the source, Aggretriever with any backbone is competitive with BM25 while the other [CLS] models still lag behind BM25 on the EntityQs test queries. Finally, we observe that models fine-tuned on TQA have better zero-shot retrieval effectiveness in near-domain datasets compared to those fine-tuned on NQ, which is also observed by Ram et al. (2022).
Multi-Domain Retrieval Effectiveness.
In addition, we evaluate zero-shot retrieval effectiveness on the multi-domain BEIR dataset, reported in Table 5. We evaluate the models fine-tuned on three different sources: MS MARCO, NQ, and TQA. Similarly, Aggretriever shows better zero-shot retrieval effectiveness compared to its [CLS] counterpart with any backbone. For example, our model consistently and substantially outperforms the comparable baselines using MS MARCO and TQA as the source dataset for fine-tuning. Although models fine-tuned on NQ show the worst zero-shot retrieval capability, Aggretriever with any backbone still slightly outperforms its [CLS] counterpart. It is also worth mentioning that Aggretriever with any backbone fine-tuned on MS MARCO outperforms the strong BM25 baseline.
Model . | BEIR (nDCG@10) . | ||
---|---|---|---|
(a) BM25 | 0.430 | ||
Source | MARCO | NQ | TQA |
(1) DistilBERTCLS | 0.364 | 0.262 | 0.266 |
(2) DistilBERTAGG | 0.450 | 0.277 | 0.386 |
(3) BERTCLS | 0.382 | 0.283 | 0.305 |
(4) BERTAGG | 0.449 | 0.299 | 0.394 |
(5) CondenserCLS | 0.393 | 0.286 | 0.314 |
(6) CondenserAGG | 0.447 | 0.295 | 0.385 |
(7) coCondenserCLS | 0.414 | 0.277 | 0.307 |
(8) coCondenserAGG | 0.446 | 0.280 | 0.376 |
Model . | BEIR (nDCG@10) . | ||
---|---|---|---|
(a) BM25 | 0.430 | ||
Source | MARCO | NQ | TQA |
(1) DistilBERTCLS | 0.364 | 0.262 | 0.266 |
(2) DistilBERTAGG | 0.450 | 0.277 | 0.386 |
(3) BERTCLS | 0.382 | 0.283 | 0.305 |
(4) BERTAGG | 0.449 | 0.299 | 0.394 |
(5) CondenserCLS | 0.393 | 0.286 | 0.314 |
(6) CondenserAGG | 0.447 | 0.295 | 0.385 |
(7) coCondenserCLS | 0.414 | 0.277 | 0.307 |
(8) coCondenserAGG | 0.446 | 0.280 | 0.376 |
5.3 Fine-Tuning with Noisy Hard Negatives
In this experiment, we use DistilBERTAGG to examine Aggretriever’s robustness to fine-tuning with noisy hard negatives. Following TCT (Lin et al., 2021b) and RocketQA (Qu et al., 2021), for each query in the MS MARCO training set, we retrieve the top-200 candidates using DistilBERTAGG and further fine-tune the model by randomly sampling the candidates as negatives for two additional epochs using the same settings as the previous fine-tuning setup.
The results are listed in Table 6; we directly copy the numbers of TCT and RocketQA from the original papers. We notice that hard negatives reduce the effectiveness of both TCT and RocketQA since there are many false negatives in the candidates, as noted by Qu et al. (2021). They address this issue using expensive training strategies: knowledge distillation, denoising, and cross-batch negative sampling. On the other hand, DistilBERTAGG obtains competitive retrieval effectiveness without any expensive training strategies. This experiment demonstrates that Aggretriever is robust and able to extract useful information when fine-tuned with hard negatives.
Model . | batch size . | MARCO Dev . | |
---|---|---|---|
RR@10 . | R@1K . | ||
RocketQA (Qu et al., 2021) | |||
BM25 Neg. | 8K | 0.333 | – |
+ Hard Neg. | 4K | 0.260 | – |
+ Denoise | 4K | 0.364 | – |
+ Data Aug. | 4K | 0.370 | 0.979 |
TCT (Lin et al., 2021b) | |||
BM25 Neg. + KD | 96 | 0.344 | 0.967 |
+ Hard Neg. | 96 | 0.237 | 0.929 |
+ KD | 96 | 0.359 | 0.970 |
DistilBERTAGG | |||
BM25 Neg. | 64 | 0.341 | 0.960 |
+ Hard Neg. | 64 | 0.360 | 0.967 |
Model . | batch size . | MARCO Dev . | |
---|---|---|---|
RR@10 . | R@1K . | ||
RocketQA (Qu et al., 2021) | |||
BM25 Neg. | 8K | 0.333 | – |
+ Hard Neg. | 4K | 0.260 | – |
+ Denoise | 4K | 0.364 | – |
+ Data Aug. | 4K | 0.370 | 0.979 |
TCT (Lin et al., 2021b) | |||
BM25 Neg. + KD | 96 | 0.344 | 0.967 |
+ Hard Neg. | 96 | 0.237 | 0.929 |
+ KD | 96 | 0.359 | 0.970 |
DistilBERTAGG | |||
BM25 Neg. | 64 | 0.341 | 0.960 |
+ Hard Neg. | 64 | 0.360 | 0.967 |
5.4 Ablation Study
In this experiment, we use DistilBERTAGG fine-tuned on the MS MARCO dataset to conduct an ablation study. In addition to MARCO Dev, to understand the zero-shot effectiveness of each condition, we conduct retrieval on a subset of BEIR (denoted BEIR small), consisting of five datasets from different domains: NFCorpus, FiQA, ArguAna, SCIDOCS, and SciFact. We report nDCG@10 averaged over these five datasets.
Dimensionality Ablation.
We first study the effects of dimensionality on the [CLS] and agg★ vectors in Table 7. We find that [CLS] alone slightly outperforms agg★ alone (row 1 vs 4) on in-domain evaluation while the reverse trend is seen on zero-shot evaluation. This observation indicates that the [CLS] and agg★ vectors encode text in different ways and that combining them further improves retrieval effectiveness (row 5). Compared to [CLS] alone and agg★ alone, we still see a slight improvement for in-domain evaluation at 256 dimensions (row 6 vs 1 and 4). Holding the number of dimensions constant (rows 1–4), the best condition (row 3) indicates that the agg★ vector requires more space than the [CLS] vector.
. | Dim. . | MARCO Dev . | BEIR small . | ||
---|---|---|---|---|---|
[CLS] . | agg★ . | RR@10 . | R@1K . | nDCG@10 . | |
(1) | 768 | 0 | 0.308 | 0.940 | 0.259 |
(2) | 640 | 128 | 0.327 | 0.954 | 0.307 |
(3) | 128 | 640 | 0.341 | 0.960 | 0.355 |
(4) | 0 | 768 | 0.307 | 0.926 | 0.328 |
(5) | 768 | 768 | 0.350 | 0.966 | 0.358 |
(6) | 128 | 128 | 0.320 | 0.946 | 0.300 |
(7) | 0 | 30522 | 0.345 | 0.956 | 0.363 |
. | Dim. . | MARCO Dev . | BEIR small . | ||
---|---|---|---|---|---|
[CLS] . | agg★ . | RR@10 . | R@1K . | nDCG@10 . | |
(1) | 768 | 0 | 0.308 | 0.940 | 0.259 |
(2) | 640 | 128 | 0.327 | 0.954 | 0.307 |
(3) | 128 | 640 | 0.341 | 0.960 | 0.355 |
(4) | 0 | 768 | 0.307 | 0.926 | 0.328 |
(5) | 768 | 768 | 0.350 | 0.966 | 0.358 |
(6) | 128 | 128 | 0.320 | 0.946 | 0.300 |
(7) | 0 | 30522 | 0.345 | 0.956 | 0.363 |
Finally, we report the retrieval effectiveness of the original wordpiece lexical representations before pruning (row 7), which can be considered the effectiveness upper bound of agg★. Although agg★ with 768 dimensions has lower effectiveness (row 4 vs 7), combined with [CLS], Aggretriever reduces the gap (rows 3, 5 vs 7), with better retrieval efficiency in terms of smaller index size and lower retrieval latency. For example, on the MS MARCO dataset, representing each passage as a 768-dimensional vector in a Faiss Flat index with 32 (16) bits requires 26 (13) GB and 100 ms/q retrieval latency on a single V100 GPU, while the 30522-dimensional vectors (without pruning) require around 40 times more index storage and are not practical for end-to-end retrieval.
Pooling Stage Ablation.
In the second ablation experiment, we fix [CLS] and agg★ to 128 and 640 dimensions, respectively, and compare different designs of the pooling stage to form agg★, as discussed in Section 3.1. The results are reported in the first main block of Table 8; row 1 is our default condition. In row 2, we remove the term importance component and assign a term weight of one for weighted max pooling. A substantial drop in retrieval effectiveness can be observed. In row 3, we remove MLM projection and represent each query (or passage) token with the 30522-dimensional indicator vector in Eq. (2); that is, for j ∈{token_id(qi)}. We notice that skipping the MLM projector modestly harms retrieval effectiveness. This means that most textual information can be captured without the MLM projector, but it does help. This is sensible since the 30522-dimensional indicator vector still retains each original query (or passage) term. A comparison of row 2 and row 3 shows that learned term weights for each token are more important than the term semantic distribution (projected by MLM) over the wordpiece vocabulary.
. | Pooling . | Pruning . | MARCO Dev . | BEIR small . | ||
---|---|---|---|---|---|---|
MLM . | Weight . | RR@10 . | R@1K . | nDCG@10 . | ||
(1) | ✓ | ✓ | full aggregation | 0.341 | 0.960 | 0.355 |
(2) | ✓ | ✗ | full aggregation | 0.308 | 0.937 | 0.308 |
(3) | ✗ | ✓ | full aggregation | 0.332 | 0.953 | 0.355 |
(4) | ✓ | ✓ | semi aggregation | 0.341 | 0.960 | 0.322 |
(5) | ✓ | ✓ | 0.327 | 0.959 | 0.313 | |
(6) | AVERAGE | 0.300 | 0.933 | 0.270 | ||
(7) | RepBERT (Zhan et al., 2020) | – | 0.306 | 0.942 | 0.264 |
. | Pooling . | Pruning . | MARCO Dev . | BEIR small . | ||
---|---|---|---|---|---|---|
MLM . | Weight . | RR@10 . | R@1K . | nDCG@10 . | ||
(1) | ✓ | ✓ | full aggregation | 0.341 | 0.960 | 0.355 |
(2) | ✓ | ✗ | full aggregation | 0.308 | 0.937 | 0.308 |
(3) | ✗ | ✓ | full aggregation | 0.332 | 0.953 | 0.355 |
(4) | ✓ | ✓ | semi aggregation | 0.341 | 0.960 | 0.322 |
(5) | ✓ | ✓ | 0.327 | 0.959 | 0.313 | |
(6) | AVERAGE | 0.300 | 0.933 | 0.270 | ||
(7) | RepBERT (Zhan et al., 2020) | – | 0.306 | 0.942 | 0.264 |
Pruning Stage Ablation.
In the second main block of Table 8, we study the effects of pruning wordpiece lexical representations on Aggretriever. For example, we semi-aggregate (linearly project) the lexical representations into 640-dimensional dense vectors, as shown in row 4 (5). We observe that our non-parametric pruning approaches are better than the learned ones (rows 1, 4 vs 5). Although agg + shows the same retrieval effectiveness as agg★ on in-domain evaluation, a substantial drop can be observed on out-of-domain evaluation (row 1 vs 4). This result demonstrates that our fully aggregated representations better preserve information from lexical representations and appear to be more robust to domain shifts.
We observe that directly projecting averaged contextualized embedding (excluding the [CLS]), denoted AVERAGE, into 640 dimensions, and then concatenating with [CLS] (row 6), does not perform well, indicating that projecting contextualized token embeddings into the high-dimensional wordpiece lexical space before pooling is key to preserving lexical information. Finally, we also try average pooling over all contextualized embeddings (including the [CLS]), which corresponds to RepBERT (Zhan et al., 2020). This yields negligible effectiveness difference from AVERAGE concatenated with [CLS] (row 7 vs 6); i.e., 0.306 (RR@10) and 0.264 (nDCG@10) on MARCO dev and BEIR small, respectively.
To further understand the differences between pruned lexical representations (rows 1, 4, 5 in Table 8), we fine-tune DistilBERT using each representation alone (without using [CLS]) with 128, 256, and 768 dimensions on the MS MARCO dataset and compare their retrieval effectiveness on MS MARCO Dev and BEIR small in Figure 4. We observe that agg★ performs better than agg + under all conditions, demonstrating that distributing representations to the full vector space can mitigate the problem of term misalignment (rectangles vs triangles) mentioned in Section 3.2, especially when the number of dimensions is small. Although the linearly projected lexical representations (diamonds) show better in-domain retrieval effectiveness than our non-parametric pruning approaches (agg + and agg★) with 128 and 256 dimensions, agg★ still exhibits better zero-shot retrieval effectiveness. This indicates that the learned linear projector helps compress textual information into low-dimensional space in a way that is biased toward the training data.
In addition, in Figure 4, we also show the retrieval effectiveness of [CLS] and AVERAGE (solid and hollow circles) as comparisons. We observe that although all 768-dimensional textual representations reach similar in-domain retrieval effectiveness, [CLS] and AVERAGE show poor zero-shot retrieval effectiveness on BEIR small compared to the other models pruned from 30K-dimensional lexical representations. We hypothesize that [CLS] and AVERAGE capture textual information in a different manner than our lexical representations. This explains why fusing [CLS] with pruned lexical representations performs better than AVERAGE (rows 1, 4, 5 vs 6 in Table 8).
However, [CLS] and AVERAGE do not exhibit much retrieval effectiveness drop on both in-domain and zero-shot evaluations when reducing the number of dimensions. This is probably because lexical representations contain fine-grained textual information in 30K-dimensional lexical space while [CLS] and AVERAGE embeddings capture high-level textual information in low-dimensional semantic space. This result also explains the optimal balance in Table 7, where agg★ requires more space than [CLS] when restricting the total vector dimension to 768.
5.5 Query Encoding Latency
Although different single-vector dense retrievers with the same vector dimensionality have similar retrieval latency under the same software and environment when performing top-k retrieval, query encoding latency is also an important component to consider. In this experiment, we compare the query encoding latency of DistilBERTAGG and DistilBERTCLS. We measure the time required to encode the 6980 queries from MARCO Dev with batch size one on the CPU and GPU, using one thread on a Linux machine with a 2.2 GHz Intel Xeon Silver 4210 CPU and a single Tesla V100 GPU (32GB), respectively. We report the latency at 1st, 50th, and 99th percentiles in Table 9.
. | latency (1st/ 50th / 99th perc.) . | ||
---|---|---|---|
CPU . | GPU . | ||
(1) | DistilBERTCLS | 93 / 103 / 122 ms | 15 / 16 / 18 ms |
(2) | DistilBERTAGG | 155 / 163 / 191 ms | 18 / 19 / 24 ms |
(3) | w/o MLM | 103 / 109 / 138 ms | 16 / 19 / 20 ms |
. | latency (1st/ 50th / 99th perc.) . | ||
---|---|---|---|
CPU . | GPU . | ||
(1) | DistilBERTCLS | 93 / 103 / 122 ms | 15 / 16 / 18 ms |
(2) | DistilBERTAGG | 155 / 163 / 191 ms | 18 / 19 / 24 ms |
(3) | w/o MLM | 103 / 109 / 138 ms | 16 / 19 / 20 ms |
We observe that query encoding with Aggretriever is slightly slower than its [CLS] counterpart on the GPU (row 2 vs 1). On the CPU, the gap is much larger, especially for tail queries. However, from row 3 (the same condition as row 3 in Table 8), we see that skipping the MLM head projection step reduces the query encoding latency with only a small retrieval effectiveness loss. For a real-world application, this might be a sensible option, bringing query encoding latency roughly in line with the [CLS]-only model.
5.6 Comparison with Sparse Retrievers
In our final set of experiments, we compare Aggretriever and sparse retrievers since we borrow ideas from existing learned sparse retrieval models such as SPLADE-max (Formal et al., 2021a), which uses a different activation function after the MLM projector and adds sparsity regularization to generate sparse lexical representations for inverted indexes. For comparison to a sparse retriever without MLM projection, we use uniCOIL without expansions from T5 (Nogueira and Lin, 2019). Both models are fine-tuned on MS MARCO with BM25 negatives; thus, they represent reasonably fair comparisons to DistilBERTAGG and its variant without MLM, respectively (although uniCOIL uses BERT as a backbone). We index and evaluate SPLADE-max and uniCOIL using the code provided by Formal et al. (2021a)4 and Pyserini (Lin et al., 2021a), respectively.5
Results are shown in Table 10. We first observe that DistilBERTCLS shows competitive in-domain retrieval effectiveness but underperforms sparse retrievers on out-of-domain evaluations (row 1 vs 5). This indicates that sparse retrieval using lexical matching has better generalization across retrieval tasks than dense retrieval with [CLS] alone. On the other hand, DistilBERTAGG and its variant show equally good generalization capability compared to the sparse retrievers (rows 2, 3 vs 4, 5). We attribute the transferability of Aggretriever to agg★, which effectively aggregates and preserves information from wordpiece lexical representations.
. | MARCO Dev . | BEIR . | ||
---|---|---|---|---|
RR@10 . | R@1K . | nDCG@10 . | ||
(1) | DistilBERTCLS | 0.308 | 0.940 | 0.364 |
(2) | DistilBERTAGG | 0.341 | 0.960 | 0.450 |
(3) | w/o MLM | 0.332 | 0.953 | 0.445 |
(4) | SPLADE-max | 0.340 | 0.965 | 0.447 |
(5) | w/o MLM* | 0.315 | 0.924 | 0.441 |
. | MARCO Dev . | BEIR . | ||
---|---|---|---|---|
RR@10 . | R@1K . | nDCG@10 . | ||
(1) | DistilBERTCLS | 0.308 | 0.940 | 0.364 |
(2) | DistilBERTAGG | 0.341 | 0.960 | 0.450 |
(3) | w/o MLM | 0.332 | 0.953 | 0.445 |
(4) | SPLADE-max | 0.340 | 0.965 | 0.447 |
(5) | w/o MLM* | 0.315 | 0.924 | 0.441 |
uniCOIL w/o expansion (Lin and Ma, 2021) can be considered a variant of SPLADE-max w/o MLM.
Finally, we observe that without the MLM projector, the effectiveness of the sparse retrievers degrades, especially on in-domain evaluation (row 4 vs 5), while agg★ only sees a slight degradation (row 2 vs 3). We hypothesize that the MLM projector helps sparse retrievers learn semantic matching as well as exact term matching. In contrast, Aggretriever can still learn semantic matching, even without the MLM projector, because it benefits from fusion with the [CLS] vector.
6 Related Work
Dense Retrieval.
The most related line of research to our own work is the literature on how to effectively fine-tune a single-vector dense retriever. On the one hand, some researchers propose computationally expensive fine-tuning techniques such as hard negative mining strategies (Xiong et al., 2021; Zhan et al., 2021b), knowledge distillation (Lin et al., 2021b; Hofstätter et al., 2021), or their combination (Qu et al., 2021). On the other hand, others leverage further pre-training to improve the subsequent fine-tuning (Lee et al., 2019; Gao et al., 2021b; Lu et al., 2021; Gao and Callan, 2021; Izacard et al., 2021; Gao and Callan, 2022; Liu and Shao, 2022). As far as we are aware, our work is the first to discuss how to fine-tune dense retrieval models to effectively aggregate textual information from the pre-trained MLM head rather than directly using the [CLS] vector or contextualized embeddings from max or average pooling (Reimers and Gurevych, 2019).
Sparse Retrieval.
Previous work (Bai et al., 2020; Mallia et al., 2021; Formal et al., 2021b; Lin and Ma, 2021) has demonstrated that projecting contextualized token embeddings into a high-dimensional vector in the wordpiece vocabulary space is an effective way to represent token-level information from transformers for lexical matching. These models directly feed the high-dimensional vectors into an inverted index for retrieval. Thus, sparsity control for effectiveness–efficiency tradeoffs involves additional considerations (Mackenzie et al., 2021). In contrast, our approach converts high-dimensional vectors into low-dimensional ones where top-k retrieval can be performed directly using ANN search libraries (Guo et al., 2020; Johnson et al., 2021).
Hybrid Retrieval.
Our work can be characterized as hybrid since we “fuse” semantic and lexical representations into a single dense vector. Recent work (Gao et al., 2021a; Hofstätter et al., 2022; Shen et al., 2022; Lin and Lin, 2022) proposes to jointly train [CLS] and token-level representations for semantic and lexical matching, respectively. The two kinds of representations require different implementations for top-k retrieval, so multiple software stacks are required to perform retrieval. In contrast, our representations retain the best of semantic and lexical matching, but entirely as dense vectors. Thus, retrieval can be performed in a simple execution environment.
7 Conclusion and Future Work
In this paper, we present Aggretriever, a single-vector dense retrieval model that exploits all contextualized token embeddings from the input to BERT. We introduce a simple approach to aggregate the contextualized token embeddings into a dense vector, agg★. Experiments show that agg★ combined with the standard [CLS] vector achieves better retrieval effectiveness than using the [CLS] vector alone for both in-domain and zero-shot evaluations. Our work demonstrates that MLM pre-trained transformers can be fine-tuned into effective dense retrievers without further pre-training or expensive fine-tuning strategies.
Our work leads to a few open questions for future research: (1) Since we have demonstrated that Aggretriever still benefits from further pre-training, can we design additional pre-training tasks tailored directly to our model? The design of these tasks, of course, needs to be mindful of the computational costs. (2) Can we apply current state-of-the-art compression techniques to Aggretriever? Zhan et al. (2021a,2022) has shown that 768-dimensional dense representations can be effectively compressed into much smaller vectors. However, it is still unknown if these techniques can be applied to Aggretriever to retain both in-domain and zero-shot retrieval effectiveness. (3) Finally, can we apply Aggretriever to multi-lingual retrieval? Because, in a multi-lingual BERT model, the MLM head can project into tokens in multiple languages, we can envision a natural extension. However, as shown in Section 5.5, MLM projection is expensive, and the issue becomes worse when using a pre-trained multi-lingual model since the vocabulary size is usually even larger.
Acknowledgments
This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada. We thank the anonymous referees who provided useful feedback to improve this work.
A Appendix
A.1 Implementation Details
We implement our models using Tevatron (Gao et al., 2022) and apply its default training settings in most tasks. For MS MARCO, we train models for three epochs with a learning rate 5e − 6, and for each batch, we include 8 queries. Each of the queries is paired with a randomly sampled positive passage and 7 negative passages mined using BM25. The maximum query and passage lengths are set to 32 and 128, respectively. Note that we use the official training set and corpus6 instead of the ones in Tevatron, which are further processed by Qu et al. (2021). For open-domain QA, we follow the original settings used by Karpukhin et al. (2020) except for two modifications: (1) we use shared instead of independent weights between the query and passage encoders; (2) we set the maximum query and passage lengths to 32 and 156 for faster fine-tuning and inference. Note that we use one and four Tesla V100 GPUs (32GB) for model fine-tuning on MS MARCO and open-domain QA, respectively. For BEIR evaluation, we use the APIs provided by Thakur et al. (2021) and set maximum query and passage input lengths to 512.7
A.2 Comparison with Existing DPR Models
Table 11 compares Aggretriever with existing dense retrievers fine-tuned with more expensive strategies; i.e., cross-encoder knowledge distillation (KD), hard negative mining (HNM), and large in-batch negatives, on both in-domain and out-of-domain evaluations. The two baseline models without further pre-training are: (1) TAS-B (Hofstätter et al., 2021), which distills ColBERT and a cross-encoder to DPR with an efficient topic-aware sampling strategy; (2) CL-DRD (Zeng et al., 2022), which further improves TAS-B by combining curriculum learning, HNM, and cross-encoder KD. Three models with further pre-training are included: (1) coCondenser (Gao and Callan, 2022), already discussed in Section 4.2; (2) Contriever (Izacard et al., 2021), which leverages pre-training by combining advanced contrastive learning techniques with an Inverse Cloze Task (ICT) variant; (3) GTR-Base (Ni et al., 2021), which trains a T5-Base encoder model that combines pre-training, KD, and HNM. For TAS-B, Contriever, and GTR-Base, we directly copy numbers from Izacard et al. (2021) and Ni et al. (2021), respectively. For CL-DRD8 and coCondenser,9 we use the models provided by the authors to conduct in-domain and out-of-domain evaluations ourselves. Note that the coCondenser model provided by the authors is fine-tuned in another round with self-mined hard negatives. Furthermore, they use a “non-standard” MS MARCO corpus where each passage is concatenated with a title; thus, the MS MARCO Dev results are different from the values for coCondenserCLS reported in Table 2.
. | w/o pre-training . | w/ pre-training . | |||||
---|---|---|---|---|---|---|---|
DistilBERTAGG . | TAS-B . | CL-DRD . | coCondenserAGG . | coCondenser . | Contriever . | GTR-Base . | |
KD | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
HNM | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ |
batch size >1K | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
MARCO | RR@10 | ||||||
Dev | 0.341 | 0.344 | 0.381 | 0.363 | 0.382* | 0.341 | 0.366 |
BEIR | nDCG@10 | ||||||
TREC-COVID | 0.661 | 0.481 | 0.584 | 0.751 | 0.712 | 0.596 | 0.539 |
NFCorpus | 0.297 | 0.319 | 0.315 | 0.323 | 0.325 | 0.328 | 0.308 |
NQ | 0.474 | 0.463 | 0.500 | 0.490 | 0.487 | 0.498 | 0.495 |
HotpotQA | 0.616 | 0.584 | 0.589 | 0.609 | 0.563 | 0.638 | 0.535 |
FiQA-2018 | 0.292 | 0.300 | 0.308 | 0.305 | 0.276 | 0.329 | 0.349 |
ArguAna | 0.417 | 0.429 | 0.413 | 0.438 | 0.299 | 0.446 | 0.511 |
Tóuche-2020 (v2) | 0.263 | 0.162 | 0.203 | 0.213 | 0.191 | 0.230 | 0.205 |
Quora | 0.834 | 0.835 | 0.826 | 0.851 | 0.856 | 0.865 | 0.881 |
DBPedia | 0.362 | 0.384 | 0.381 | 0.380 | 0.363 | 0.413 | 0.347 |
SCIDOCS | 0.138 | 0.149 | 0.146 | 0.143 | 0.137 | 0.165 | 0.149 |
FEVER | 0.781 | 0.700 | 0.734 | 0.600 | 0.495 | 0.758 | 0.660 |
Climate-FEVER | 0.210 | 0.228 | 0.204 | 0.155 | 0.144 | 0.237 | 0.241 |
SciFact | 0.630 | 0.643 | 0.621 | 0.650 | 0.615 | 0.677 | 0.600 |
CQADupStack | 0.318 | 0.314 | 0.325 | 0.338 | 0.320 | 0.345 | 0.357 |
Avg.nDCG@10 | 0.450 | 0.428 | 0.439 | 0.446 | 0.413 | 0.466 | 0.441 |
. | w/o pre-training . | w/ pre-training . | |||||
---|---|---|---|---|---|---|---|
DistilBERTAGG . | TAS-B . | CL-DRD . | coCondenserAGG . | coCondenser . | Contriever . | GTR-Base . | |
KD | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
HNM | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ |
batch size >1K | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
MARCO | RR@10 | ||||||
Dev | 0.341 | 0.344 | 0.381 | 0.363 | 0.382* | 0.341 | 0.366 |
BEIR | nDCG@10 | ||||||
TREC-COVID | 0.661 | 0.481 | 0.584 | 0.751 | 0.712 | 0.596 | 0.539 |
NFCorpus | 0.297 | 0.319 | 0.315 | 0.323 | 0.325 | 0.328 | 0.308 |
NQ | 0.474 | 0.463 | 0.500 | 0.490 | 0.487 | 0.498 | 0.495 |
HotpotQA | 0.616 | 0.584 | 0.589 | 0.609 | 0.563 | 0.638 | 0.535 |
FiQA-2018 | 0.292 | 0.300 | 0.308 | 0.305 | 0.276 | 0.329 | 0.349 |
ArguAna | 0.417 | 0.429 | 0.413 | 0.438 | 0.299 | 0.446 | 0.511 |
Tóuche-2020 (v2) | 0.263 | 0.162 | 0.203 | 0.213 | 0.191 | 0.230 | 0.205 |
Quora | 0.834 | 0.835 | 0.826 | 0.851 | 0.856 | 0.865 | 0.881 |
DBPedia | 0.362 | 0.384 | 0.381 | 0.380 | 0.363 | 0.413 | 0.347 |
SCIDOCS | 0.138 | 0.149 | 0.146 | 0.143 | 0.137 | 0.165 | 0.149 |
FEVER | 0.781 | 0.700 | 0.734 | 0.600 | 0.495 | 0.758 | 0.660 |
Climate-FEVER | 0.210 | 0.228 | 0.204 | 0.155 | 0.144 | 0.237 | 0.241 |
SciFact | 0.630 | 0.643 | 0.621 | 0.650 | 0.615 | 0.677 | 0.600 |
CQADupStack | 0.318 | 0.314 | 0.325 | 0.338 | 0.320 | 0.345 | 0.357 |
Avg.nDCG@10 | 0.450 | 0.428 | 0.439 | 0.446 | 0.413 | 0.466 | 0.441 |
These numbers are not comparable due to the use of a “non-standard” MS MARCO passage corpus that has been augmented with title.
First, we observe that DistilBERTAGG is not only competitive with TAS-B on in-domain evaluation but also outperforms both TAS-B and CL-DRD on out-of-domain evaluation, without needing supervision from an expensive cross-encoder teacher. Secondly, Contriever yields the best out-of-domain results at the cost of in-domain effectiveness. On the other hand, coCondenserAGG reaches the same level of retrieval effectiveness as GTR-Base without leveraging any expensive fine-tuning strategies. Fine-tuning Aggretriever with KD, HNM, and large batch size is possible to further improve retrieval effectiveness, but these techniques are orthogonal to our proposed model.
Notes
Slice mean pooling is less effective in our experiment.
We exclude BioASQ, Signal-1M, TREC-NEWS, and Robust04.
Note that the BEIR figures for SPLADE-max reported in Formal et al. (2021a) do not include CQADupStack and use Tóuche-2020 (v1) instead of Tóuche-2020 (v2).
References
Author notes
Action Editor: Hang Li