Sparse, Dense, and Attentional Representations for Text Retrieval

Dual encoder architectures perform retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query. We investigate the capacity of this architecture relative to sparse bag-of-words retrieval models and attentional neural networks. We establish new connections between the encoding dimension and the number of unique terms in each document and query, using both theoretical and empirical analysis. We show an upper bound on the encoding size, which may be unsustainably large for long documents. For cross-attention models, we show an upper bound using much smaller encodings per token, but such models are difficult to scale to realistic retrieval problems due to computational cost. Building on these insights, we propose a simple neural model that combines the efficiency of dual encoders with some of the expressiveness of attentional architectures, and explore a sparse-dense hybrid to capitalize on the precision of sparse retrieval. These models outperform strong alternatives in open retrieval.


Introduction
Retrieving relevant documents is a core task for language technology, and is a component of other applications, such as information extraction (e.g., Narasimhan et al., 2016) and question answering (e.g., Kwok et al., 2001;Voorhees, 2001). While classical information retrieval has focused on heuristic weights for sparse bag-of-words representations (Spärck Jones, 1972), more recent work has adopted a two-stage retrieval and ranking pipeline, where a large number (e.g. 1000) * equal contribution documents are retrieved using sparse high dimensional query/document representations, and are further reranked with learned neural models (see Mitra and Craswell (2018) for an overview). This two stage approach is powerful and has achieved state-of-the-art results on multiple IR benchmarks (Nogueira and Cho, 2019;Yang et al., 2019;Nogueira et al., 2019a), especially since largescale annotated data has become available for training deep neural models (Dietz et al., 2018;Craswell et al., 2020). However, this pipeline approach suffers from a strict upper bound on performance imposed by the potentially limited ability of the first-stage retrieval model to include relevant documents in its top candidates (for example, the Recall@1000 for BM25 reported by Yan et al. (2020) is 69.4). Therefore, work on improving the first stage retriever is also important. Note that the effective BERT ) models jointly encoding queries and documents for reranking are not computationally feasible for large scale first-stage retrieval. One approach to take advantage of neural models while still employing sparse term-based retrieval is to expand the documents with neural models before indexing (Nogueira et al., 2019b) or learn contextual term weights (Dai and Callan, 2020).
A promising alternative first-stage retriever is one based on learned dense low-dimensional encodings of documents and queries (Huang et al., 2013;Reimers and Gurevych, 2019;Gillick et al., 2019;Karpukhin et al., 2020). The dual encoder model scores each document by the inner product between its encoding and that of the query, and is widely used due to its scalability and its ability to generalize across related words.
Recent history in NLP might suggest that learned dense representations should always outperform sparse features, but this is not necessarily true: as shown in Figure 1, the BM25 model (Robertson et al., 2009) can outperform a dual encoder based on BERT, particularly on longer documents (See § 7). This raises questions about the utility and limitations of dual encoders, and the circumstances in which these powerful models do not yet reach the state-of-the-art. We explore these questions using both theoretical and empirical tools, and propose new architectures that leverage the strengths of dual encoders while avoiding some of their weaknesses.
More formally, a dual encoder is a function that encodes query and document strings into vectors q, d ∈ R k and computes the relevance score for a document-query pair as the inner product, q, d = k i=1 q i × d i . Dual encoders can be built from bag-of-words and bag-of-bigrams vectors, including weighted representations such as BM25. However, our focus is on compressive encoders, where k is less than the vocabulary size v.
As with all representation learning approaches, the performance of a dual encoder model depends on two factors: whether the representations have the capacity to capture meaningful distinctions in the inputs, and whether the learning procedure results in representations that generalize from the training data to linguistically similar unseen phenomena. We focus on the capacity of the dual encoder model, because capacity limitations impose a strict upper bound on performance, and because they do not depend on details of the training data and learning algorithm. Specifically, we assess the capacity of compressive dual encoder retrieval to mimic the retrieval results of sparse bag-of-words models. We address the following questions: • Under what conditions does a dual encoder have the capacity to match the retrieval decisions of a sparse bag-of-words model such as BM25? • How does the capacity of the dual encoder vary with document length and vocabulary size?
We first establish theoretical links between the error with which a compressive dual encoder preserves distances and its ability to replicate the rankings of a sparse boolean inner product retrieval model, and show that a dual encoder with embedding size that grows with the square of the number of unique terms in the longest document can preserve all pairwise rankings of boolean sparse inner product. We then derive an upper bound on embedding size for more general sparse retrieval models such as BM25, which implies that random projections achieve suitably low error when the encoding size grows with a measure of query-dependent normalized margin between documents. While this is an upper boundand could be pessimistic-we show empirically that the normalized margin is highly indicative of the random projection dimension required to preserve a given ranking between two documents for a given query. Attention-based architectures enable an upper bound on embedding size using far more compact token encodings; however, such architectures have a significantly higher computational cost. Empirical evaluation of learned neural models shows how the performance of compressive encoders degrades with document length, and demonstrates that significant improvements can be obtained from relatively simple scalable hybrids of dual encoding, attention, and sparse retrieval.

Analyzing Dual Encoder Retrieval
A query or a document is a sequence of words drawn from some vocabulary V = {1, 2, . . . v}. 1 Throughout this section we assume a representation of queries and documents typically used in sparse bag-of-words models; each query q is a vector in R v where v is the vocabulary size, and similarly each document d is a vector in R v . We take the inner product q, d to be the relevance score of document d for query q. For part of this section, we will focus on the simple case of boolean inner product, where q and d are elements of {0, 1} v , with d i = 1 appears at least once in the document, and q i defined analogously. Extensions to TF-IDF (Spärck Jones, 1972) and BM25 (Robertson et al., 2009) are discussed in Section 3.1.
We will compare relevance scores q, d with compressive dual encoders, for which we write f (d) and f (q) to indicate compression of d and q to R k , with k < v, and where k does not vary with the document length. We write the relevance score as the inner product f (q), f (d) . (In § 4, we consider encoder functions that apply to sequences of tokens rather than vectors of word counts.) A fundamental question is how the capacity of dual encoder retrieval varies with the embedding size k. A tractable proxy for capacity is fidelity: how much can we compress the input while maintaining the ability to mimic the performance of bag-of-words retrieval? To answer this question, we will relate the embedding size to bounds on the absolute difference between inner products Definition 2.1. Let Q ⊆ R v be a set of queries and let D ⊆ R v be a set of documents. An encoder f : R v → R k is rank preserving over D and Q iff for all triples (q, d 1 , d 2 ) ∈ Q × D × D, pairwise relevance rankings are preserved after encoding: 2 (1) We wish to guarantee that a compressive encoder is rank preserving if it is sufficiently precise. Much of the literature on dimensionality reduction (e.g., Vempala, 2004) characterizes precision in terms of distances: We say f is -precise over sets Q and D when f is -precise over all (q, d) ∈ Q × D.
The precision of distances can be related to inner products by the following lemma, adapted from Ben-David et al. (2002, corollary 19).
We now state a theorem that links rank preservation and -precision in the case where each query or document is a vector in {0, 1} v : Theorem 1. Let Q ⊆ {0, 1} v be a set of queries and let D ⊆ {0, 1} v be a set of documents. Assume constants q and d such that then it is rank preserving over D and Q.
Proof. Under the assumption that each query or document is a vector in {0, 1} v , if q, d 1 > q, d 2 then the minimum possible margin is q, d 1 − q, d 2 = 1. Thus, if f 's errors for each document-query inner product are < 1 2 , then f is rank preserving. If v i=1 |d i | < d then ||d|| < √ d , and analogously, ||q|| < q . Let = ( q + d ) −1 and plug this value into Equation 3: because q > ||q|| 2 and d > ||d|| 2 for all d, q.

Dual Encoder Retrieval by Projection
To establish baselines on the performance of compressive dual encoder retrieval, we now consider encoders based on random projections (Vempala, 2004). The encoder is defined as f (x) = Ax, where A ∈ R k×v is a random matrix. In Rademacher embeddings, each element a i,j of the matrix A is sampled with equal probability from two possible values: In Gaussian embeddings, each a i,j ∼ N (0, k −1/2 ). The following theorem applies: Theorem 2 (Achlioptas (2003)). Given a set of queries Q ⊆ R v and documents D ⊆ R v define n = |Q ∪ D|. Given , β > 0 let k 0 = 4+2β 2 /2− 3 /3 ln n. For integer k ≥ k 0 let A be a k × v dimensional matrix of Rademacher or Gaussian embeddings and let f (x) = Ax. Then with probability at least 1 − n −β , f is -precise over all (q, d) ∈ Q × D.
Theorem 1 states that if a compressive encoder is -precise with ≤ 1 q + d , then it is rank preserving relative to boolean dot product retrieval. Combining this with Theorem 2 and substituting n = |Q ∪ D| ≤ |Q| + |D|, if a random projection has size k ≥ 6(2 + β)( q + d ) 2 ln(|Q| + |D|), it is rank preserving relative to boolean retrieval with probability at least 1 − n −β . The bound requires that the size of the embeddings grow with the square of the number of unique terms in the query and document. 3 Although this is an upper bound and may be too pessimistic, empirical studies and related theoretical work may support a similar asymptotic growth of a lower bound (see § 3.1 and § 3.2).

Rank preservation for more general sparse models
We now introduce results for more general sparse models such as TF-IDF (Spärck Jones, 1972) and BM25 (Robertson et al., 2009), which employ more powerful weighting schemes than boolean representations. Both TF-IDF and BM25 can be written as inner products between bag-of-words representations of the document and query. Set the query representationq i = q i × IDF i , where q i indicates the presence of the term in the query and IDF i indicates the inverse document frequency of term i. The TF-IDF score is then q, d . For BM25, we must define a vectord ∈ R v , with eachd i a function of the count d i and the document length, along with several hyperparameters (Robertson et al., 2009). The key point is that for every document d there exists somed ∈ R v such that q,d = BM25(q, d).
For BM25 and other classical retrieval models, the minimum non-zero margin q, d 1 − q, d 2 can be much smaller than 1 and is dataset dependent. For example, the minimum observed margin for BM25 for the MS MARCO documents dataset (see § 9) is less than 10 −5 . Although such close rankings are difficult to preserve with random projections, the majority of rankings for triples (q, d 1 , d 2 ) we care about may be easier to preserve.
Here we derive a lower bound on the probability that a random projection will preserve a specific pairwise ranking of interest, depending on the dimensionality k and properties of the vectors q, d 1 , and d 2 . Without loss of generality, we assume that the sparse model prefers d 1 , meaning that q, d 1 > q, d 2 .
For convenience, define We term δ the normalized margin for the triple q, d 1 , d 2 .
To see whether the predicted relationship between δ and the minimum dimensionality k of a random projection that can preserve the pairwise ranking for a triple (q, d 1 , d 2 ) with normalized margin δ may provide a too pessimistic picture, we measure this relationship on a real natural language text retrieval problem.
Using the MS MARCO document retrieval dataset (see § 9) for data processing details, we evaluate the ability of Rademacher random projections to achieve accuracy of at least 95% on pairwise rankings (q, d 1 , d 2 ), with respect to both boolean ( Figure 2) and BM25 sparse representations ( Figure 3).
Input triples (q, d 1 , d 2 ) are binned by their corresponding 1 δ 2 −δ 3 and represented on the x axes. The y axes show the minimum dimension projection k that reached pairwise ranking error of less than .05 for the corresponding bin. We tested a grid of 50 values of k (examples that could not be ranked with the desired accuracy given the explored range were excluded.) The empirical lower bound obtained from these experiments displays a linear relationship with 1 δ 2 −δ 3 , matching the theoretical upper bound (ignoring constants).

Deriving Lower Bounds
This section has presented upper bounds on the dimensionality required for effective retrieval. An open question is whether related lower bounds can be derived, and under which assumptions such lower bounds hold. The work of Larsen and Nelson (2017)  beddings is highly relevant to this question. ? derive related results for inner products.

Attention
Cross attention involves computations over pairs of tokens, rather than aggregating texts into single vectors. This requires more computation, but as we show, makes it possible to use much more compact representations than the dual encoder. 4 Let x = (x 1 , x 2 , . . . , x Tx ) and y = (y 1 , y 2 , . . . , y Ty ), with x t , y t ∈ R k . (This notation is meant to distinguish between sequence representations and the word count representations q and d from the previous sections.) For example, x t might represent token t with a contextualized embedding or with an indicator vector. The cross attention inner product is defined as, where a x,y (t, t ) ∈ R + is the attention from token x t to y t (Yang et al., 2016;Hao et al., 2017).
Definition 4.1. The analysis will focus on normalized hard attention, which we define as, with 0/0 = 0.
We now extend the concept of rank preservation (Definition 2.1) to token sequences.
Definition 4.2. Let X , Y ⊆ R k and let N : (X * ∪ Y * ) → R v be an aggregation function 5 such that the inner product N (x), N (y) is a relevance score for the pair x, y. A scoring function ψ : We next show that cross attention is rank preserving with respect to simple inner product, using indicator vectors and normalized hard attention.
be the set of indicator vectors over a vocabulary of size v, and define N 0,1 (x) as the vector of elementwise minima min( Tx t=1 x t , 1), so that N 0,1 (x) ∈ {0, 1} v and N 0,1 (x), N 0,1 (y) corresponds to simple inner product scoring. The lemma and theorem that follow are restricted to queries without repeated terms, which can be ensured in preprocessing.
Lemma 3. For x ∈ I * v without repeated terms and for all y ∈ I * v , the normalized hard attention inner product is identical to the simple inner product, ψ (X) (x, y) = N 0,1 (x), N 0,1 (y) .
Proof. Space limits permit only a proof sketch. Each x t has non-zero attention with exactly those y t such that x t , y t = 1. There are z x,y (t) such items, and a x,y (t, t ) is scaled inversely by this quantity so that x t contributes exactly 1 toward the score if it matches any terms in y, and zero otherwise. The assumption that there are no repeated terms in x completes the proof.
Let f be a compressive encoder . . , f (x Tx )), and define f (y) analogously. This encoder can be viewed as a word embedding function. Let ψ . It is possible to define f by random projection such that k is constant in the document length T y , and with high probability ψ (X) f is rank preserving with respect to N 0,1 for queries without repeating terms.
Theorem 3. Let T X ≥ 1 be an upper bound on the number of tokens in a query. Let x t , y t ∈ I v 5 Here X * and Y * use Kleene star notation; for example X * is the set of all sequences of vectors drawn from X . be indicator vectors. Given β > 0, let A ∈ R k×v be a random projection such that k ≥ k 0 = 24(2 + β)T 2 X ln v. Then with probability is rank preserving with respect to N 0,1 (x), X , Y for queries without repeated terms.
Proof. By the Johnson-Lindenstrauss lemma, for all x, y in a set of size v, y(t) , and there are z x,y (t) such terms for each t. The total error for each query token is then < 1 2T X , and so |ψ when x has no repeating terms. As argued in Theorem 1, errors < 1 2 are rank preserving for simple inner product.
Attention enables the use of word encodings whose size is constant with respect to document length. This lends theoretical support to empirical findings that attention improves over dual encoderstyle translation models on long inputs, such as for the translation of long sentences (Bahdanau et al., 2015). While it is still necessary to store the integer identifiers of all terms in each document, the more practical impediment for the use of cross attention is the cost of computing O(T x T y ) inner products for each document-query pair.
Although this analysis has focused on the ability of cross attention to replicate simple inner product retrieval, we can also use contextualized embeddings in x t and y t to pool information across tokens and gain additional linguistic sensitivity. This is outside the scope of our theoretical analysis and must be evaluated empirically.

Multi-Vector Encodings
The theoretical analysis suggests that fixed-length vector representations of documents may in general need to be large for long documents, if fidelity with respect to sparse high-dimensional representations is important. Cross-attentional representations have higher capacity, but are impractical for retrieval. We therefore propose a new architecture that represents each document as a fixed-size set of m vectors. Relevance scores are computed as the maximum inner product over this set.
Formally, let x = (x 1 , . . . , x T ) represent a sequence of tokens, with x 1 equal to the special token [CLS], and define y analogously. Then [h 1 (x), . . . , h T (x)] represents the sequence of contextualized embeddings at the top level of a deep transformer. We define a single-vector representation of the query x as f (1) (x) = h 1 (x), and a multi-vector representation of document y as f (m) (y) = [h 1 (y), . . . , h m (y)], the first m representation vectors for the sequence of tokens in y, with m < T . The relevance score is defined as: (12) Although this scoring function is not a dual encoder, the search for the highest-scoring document can be implemented efficiently with standard approximate nearest-neighbor search by adding multiple (m) entries for each document to the index data structure used for search. If some vector f (m) j (y) yields the largest inner product with the query vector f (1) (x), it is easy to show the corresponding document must be the one that maximizes the relevance score ψ (m) (x, y). The size of the index must grow by a factor of m, but due to the efficiency of contemporary approximate nearest neighbor and maximum inner product search, the time complexity can be sub-linear (Andoni et al., 2019;. This efficiency is a key difference from the POLY-ENCODER (Humeau et al., 2020), which computes a fixed number of vectors per query, and aggregates them by softmax attention against document vectors. Yang et al. (2018) use a similar architecture for language modeling. Because of the use of softmax in these approaches, it is not possible to decompose the relevance score into a max over inner products, and so fast nearest-neighbor search cannot be applied. In addition, these works did not address retrieval from a large collection with long documents and instead focused on exhaustively ranking a fixed set of candidates while using enriched representations of queries.
We are not ready to provide a formal analysis of the relationship between embedding size and rank preservation for the multi-vector model. As an informal sketch, consider an idealized setting in which documents are composed of m segments, and each query is guaranteed to refer to exactly one segment in the gold document. If the multivector encoding model learns to encode each segment into a separate vector, then the error on the inner product between the query and maximizing segment can be bounded by a term involving the segment norm. Writing the maximum segment norm s = max j=1...m j , this offers an improvement of to ( s + q ) −1 , with s ∈ ( d /m, d ]. (Even with equal-size segments, s > d /m because of the overlap in vocabulary between segments.) Due to the dependence of the embedding size k on −2 , this could be a significant advantage for long documents. However, note that the bounds in § 3 involve terms for the log of the number of vectors to be encoded, resulting in an additional ln m in the bound on the embedding size. Thus, even in the ideal scenario where d q and max j=1...m j ≈ d m , the improvement in the bound on the total encoding size cannot exceed m ln m .

Experimental Setup
Our theoretical results relate the dimensionality of compressive dual encoders to their ability to accurately approximate rankings defined by bagof-words representations like BM25. But our theoretical setup differs from realistic informationseeking scenarios in two ways.
First, the distribution of natural language texts may have a special structure (e.g. the texts may lie on a low-dimensional subspace). This in turn could enable precise approximation of sparse bagof-word models with a lower-dimensional compressive dual encoder. Second, information seeking tasks require retrieval of semantically-related documents. The notion of semantic similarity is task-dependent and imperfectly modeled by weighted exact-word overlap models like BM25. Dual encoders can introduce trained distributed representations of texts, better equipped to capture graded notions of semantic similarity. Nevertheless, if they can't make the distinctions that sparse models make, they could suffer a performance ceiling.
We relate the theoretical analysis to text retrieval in practice through experimental studies on three tasks. The first task, described in § 7, tests the ability of models to retrieve natural language documents that exactly contain a query. It allows us to test the capacity of fixed-dimensional dual encoder models in relation to BM25 on a task where BM25 shines, and where ample training data is available. The second task, described in § 8, is the open-domain QA version of the Natural Questions (Kwiatkowski et al., 2019), following the setup defined by . This benchmark reflects the need to capture graded notions of semantic similarly and has a natural query text distribution. The task requires both passage retrieval and reading comprehension, and has been explored primarily in the natural language processing community.
To evaluate the performance of our best models in comparison to state-of-the-art works on large-scale retrieval and ranking, in § 9 we report results on a third group of tasks focusing on passage/document ranking: the passage and document-level MS MARCO retrieval datasets (Nguyen et al., 2016) also used in the TREC 2019 Deep Learning Track (Craswell et al., 2020) and the TREC-CAR passage retrieval dataset (Dietz et al., 2018).
For the first two tasks, we consider a reranking setting where only 200 document candidates from a first-pass BM25 system are ranked per query, and a full retrieval setting where millions of documents are ranked. We study the impact of encoder dimension as passage length varies, in comparison to more expressive neural models, and BM25. The reranking setting allows comparative evaluation of all models studied, whereas only the most efficient models are directly applicable to first-pass large scale retrieval.
For the third group of tasks, we follow the standard approach, evaluating a two-stage retrieval and ranking system: a first-stage retrieval from a large document collection, followed by reranking with a cross-attention model. We focus on evaluating the impact of the first-stage retrieval component.

Models
Our experiments compare compressive and sparse dual encoders, cross attention, and hybrid models.
Rademacher Embeddings We experiment with dual encoders based on Rademacher projections of varying dimension k (see § 3), applied to BM25 and indicator vector representations.
Dual encoders from BERT We encode queries and documents using BERT-base, which is a pretrained transformer network (12 layers, 768 dimensions) . We implement dual encoders from BERT as a special case of the multi-vector model formalized in § 5, with number of vectors for the document m = 1: the representations for queries and documents are the top layer transformer representations at the [CLS] token. This approach is widely used for retrieval Reimers and Gurevych, 2019;Humeau et al., 2020;Wu et al., 2019). 7 To learn lower-dimensional encodings, we learn down-projections from d = 768 to k ∈ 32, 64, 128, 512, 8 implemented as a single feedforward layer, followed by layer normalization (Ba et al., 2016). All parameters of all BERTbased models are fine-tuned for the retrieval tasks. We refer to these models as DE-BERT-k.

Multi-Vector Models:
Cross-Attentional BERT The most expressive model we consider is cross-attentional BERT, which we implement by applying the BERT encoder to the concatenation of the query and document, with a special [SEP] separator inserted between x and y. The relevance score is then a learned linear function of the encoding of the [CLS] token. Due to the computational cost, crossattentional BERT is applied only in reranking as in prior work (Nogueira and Cho, 2019;Yang et al., 2019). These models are referred to as Cross-

Attention.
Sum-of-Max As a more lightweight alternative to Cross-Attention, we compute a score by separately encoding the query x and document y, and summing over the maximum inner product for each token in the query, 7 Based on preliminary experiments with multiple pooling strategies to derive single vector text representations from BERT we selected the [CLS] vectors (without the feed-forward projection learned on the next sentence prediction task). 8 We experimented with adding a similar layer for d = 768, but this did not offer empirical gains. Tx t=1 max t ∈[Ty] x t , y t , where x t is the BERT contextualized embedding for token t in the query and y t is the contextualized embedding for token t in the document. This model is closely related to the "hard attention" model that was analyzed in § 4. Although it cannot be efficiently implemented at large scale, sum-of-max is considerably faster than cross attentional BERT. Considering prediction-time cost, sum-of-max is only O(kT x T y ), whereas Cross-Attention needs to jointly encode x and y, leading to a cost of O(k(T x + T y ) 2 ) per layer (Vaswani et al., 2017).
Multi-Vector (ME8) BERT In § 5 we introduced a model in which every document is represented by exactly m vectors. We use m = 8 as a good compromise between cost and accuracy in § 7 and § 8, and find lower values of m more accurate on the datasets in § 9. In addition to using BERT output representations directly in this method, we also consider down-projected representations, implemented using a single feed-forward layer with dimension 768 × k. Unlike for the DE-BERT-BERT models, for the multi-vector case we found that an additional projection layer helps even when k = 768. A multi-vector model with 8 vectors of k dimensions model is referred to as ME8-k.

Sparse-Dense Hybrids:
A natural approach to balancing between the fidelity of sparse representations and the generalization of learned dense ones is to build a hybrid. To do this, we linearly combine a sparse and dense system's scores using a single trainable weight λ, tuned on a development set. Prior work on hybrid sparse-dense models for QA has employed pipeline architectures (Seo et al., 2019;Karpukhin et al., 2020) where the first model's candidates are rescored by the linear combination with the second model's score; this is similar to our approach but has the disadvantage that the first model in the pipeline imposes an absolute upper bound on the recall.

Learning and Inference
For the experiments in § 7 and § 8, all trained models are initialized from pre-trained BERT-base, and all parameters are fine-tuned using a crossentropy loss with 7 sampled negatives from a precomputed 200-document list and additional inbatch negatives (with a total number of 1024 candidates in a batch); the pre-computed candidates include 100 top neighbors from BM25 and 100 ran-dom samples. This is similar to the method by , but with additional fixed candidates, also used in concurrent work (Karpukhin et al., 2020). Given a model trained in this way, for the scalable methods, we also applied hard-negative mining as in Gillick et al. (2019) and used one iteration of this approach for the first task studied. No benefits were seen for the open-domain QA task. For retrieval from large document collections with the scalable models, we used exact search settings of an efficient approximate nearest neighbor search library. In § 9, the same general approach with slightly different hyperparameters (detailed in that section) was used, to enable more direct comparisons to prior work.

Containing Passage ICT Task
We begin with experiments on the first task of retrieving a Wikipedia passage y containing a sequence of words x. We create a dataset using Wikipedia, following the Inverse Cloze Task definition by , but adapted to suit the goals of our experimental study. The task is defined by first breaking Wikipedia texts into segments (also termed documents or passages) of length at most l. These form the document collection D. Queries are generated by sampling subsequences from the documents y i to form queries x i . We use queries of lengths between 5 and 25, and do not remove the queries x i from their corresponding documents y i . We create a dataset with one million queries Q, and evaluate retrieval against four different document collections D l , for l ∈ 50, 100, 200, 400. Each D l contains three million documents of maximum length l tokens. In addition to original Wikipedia passages, each D l contains synthetic distractor documents, which contain the large majority of words in x but differ by one or two tokens. 5, 000 queries are used for evaluation and the rest are used for training and validation.
Although checking containment is straightforward from a machine learning standpoint, this task is a good testbed for assessing the capacity of compressive neural models relative to sparse bag-ofwords ones. BM25-bi achieves over 90% accuracy across document collections for this task. how strong the sparse retrieval models are relative to even a moderately-sized 768-dimensional DE-BERT. The accuracy of DE-BERT models also falls more rapidly with increase in document length. Full cross attention is nearly perfect, and the simpler Sum-Max model is almost as strong while being much faster. The multi-vector method ME8-BERT which uses 8 vectors of dimension 768 to represent documents strongly outperforms the best DE-BERT model. Even ME8-BERT-64, which uses 8 vectors of size only 64 instead (thus requiring the same document collection size as DE-BERT-512), outperforms the DE-BERT models by a large margin. It was not feasible to evaluate DE-BERT models with larger embedding size, but Rademacher embeddings were observed to require k of approximately 4K, 6K, 8K, and 32K, for the four document collections, respectively, to achieve 99% of BM25-uni's accuracy.
Figure 4 (right) shows results for the much more challenging task of retrieval from three million candidates. For the latter setting, we only evaluate models that can efficiently retrieve nearest neighbors from such a large set. We see similar behavior to the reranking setting, with the multivector methods matching BM25-uni for all but the longest documents.

Open-domain QA
For this task we similarly use English Wikipedia 9 as four different document collections, of maximum passage length l ∈ {50, 100, 200, 400}, and corresponding approximate sizes of 39 million, 27.3 million, 16.1 million, and 10.2 million documents, respectively. Here we use real 9 https://archive.org/download/ enwiki-20181220 user queries contained in the Natural Questions dataset (Kwiatkowski et al., 2019). We follow the setup in . There are 87, 925 QA pairs in training and 3, 610 QA pairs in the test set. We hold out a subset of training for development.
For document retrieval, a passage is correct for a query x if it contains a string that matches exactly 10 one of the annotator-provided short answers for the question. We form a reranking task by considering the top 100 results from BM25-uni and another 100 random samples, and also consider the full retrieval setting. To define candidates for reranking and model training, BM25-uni is used here instead of BM25-bi, because it is the stronger sparse retrieval model for this task. Figure 5 (left) shows heldout set results on the reranking task. To fairly compare systems that operate over document collections of different-sized passages, we allow each model to select approximately the same number of tokens (400) and evaluate on whether an answer is contained in them. For example, models retrieving passages of length 50 return their top 8 passages, and ones retrieving from D 100 retrieve top 4. The figure shows this recall at 400 tokens across models and for the four document collections.
The relative performance of BM25-uni and the DE-BERT model is different from that seen in the ICT task, due to the semantic generalizations needed. Nevertheless, higher-dimensional DE-BERT models generally perform better, and multivector models provide further benefits, especially advantage over DE-BERT. Figure 5 (right) shows heldout set results for the open-domain task of retrieving from Wikipedia for each of the four document collections D l . Due to the computational cost, it was not possible to run Cross-Attention in this setting. Unlike the reranking setting, only higher-dimensional DE-BERT models outperform BM25. We also explore an efficient BM25-uni-neural model hybrid: each system retrieves 100 top scoring documents and the documents in the union of 100-best lists are scored using a linear combination of the two systems' scores. 11 The hybrid models offer large improvements over their components, capturing both precise word overlap and semantic similarity.

Short Answer Model
To put our models in the context of prior work and to evaluate the accuracy of complete open-domain QA systems, we additionally implement a short-answer QA model. We train a short answer model for a retrieval system M in a pipeline fashion, by training a BERT-based reader model given the fixed and separately trained retrieval system M . Given a retriever M , a training set of queries x paired with short answer string answers a(x), and a document collection D l , we generate training examples for the reader as follows: For each training query x we use M to retrieve the top 100 documents of length up to l, and group these into larger segments (blocks) of length up to 400, which are used as inputs to the reading comprehension model; the original passage boundaries are indicated by a special token. The reader uses the SQuAD2.0 BERT-base architecture to select an answer span or a NULL answer.
A trained retriever and its corresponding reader 11 Any documents not present in a system's n-best list are assigned an approximating score of 0 for that system.
are used for open-domain QA by similarly considering the top b 400-token text blocks, which are read independently (Table 1 shows performance for b = 1 and b = 8). 12 Our system is thus in the class of pipeline models. Table 1 shows the short answer exact match score on the standard test set. We evaluate the impact of the document collection passage length on the performance of the three main classes of models we consider: BM25-uni, efficient trained dual encoder models and multi-vector extensions, and hybrid model combinations. The hybrid models in the Table combine BM25-uni with the best dual encoder or multi-vector model for each document collection passage length, based on the retrieval heldout set results shown in Figure 5 (right).
Given an inference-time constraint of using only one 400-token text block for the reader (top part of the Table 1), the dual encoder models outperform BM25-uni across document collection passage lengths, and the hybrid model strongly improves upon its components. When QA models are allowed to do inference over increasing number of blocks, they could potentially approximate a full cross-attention retrieval model. The differences among first-pass retrieval models are therefore diminished. Both BM25-uni and dense retrieval models peak at document collection passage length 200 and their combination outperforms the best prior pipeline model (Min et al., 2019). It also matches the performance of the endto-end ORQA model , but uses two times more text at inference time. 13 Two con-12 For multiple blocks, 20-best answer spans are generated from each block and answer string probabilities are weighted by the retriever block probability and aggregated (summed) across blocks, following Lin et al. (2018). 13 The end-to-end training and retrieval-specific pretraining method from ORQA (not used in this work), are ex- current works (Guu et al., 2020) and (Karpukhin et al., 2020) brought significant improvements, reaching up to 41.5 short answer exact match, by better unsupervised model pretraining, and use of full supervision for passage relevance with careful selection of negatives, respectively. Our study shows a complementary analysis of the relative and combined strengths of sparse and dense dual encoder and multi-vector models, as the length of documents in the retrieval collection grows.

Large Scale Supervised Passage Retrieval and Ranking
The previous two experimental sections focused on understanding the relationship of our theoretical analysis to the accuracy of compressive dense encoders on a memorization task in § 7 and weakly supervised open-domain question answering for information-seeking queries in § 8, relating representation dimensionality to document length. In this section we evaluate whether our newly proposed efficient multi-vector dense retrieval model ME-BERT, its corresponding dual encoder baseline model DE-BERT, and sparse-dense hybrids compare favorably to state-of-the-art models focusing on large-scale supervised retrieval and ranking for IR benchmarks.
Datasets The MS MARCO passage ranking task focuses on ranking passages from a collection of about 8.8 million. About 532 thousand queries paired with relevant passages are provided for supervised training. The MS MARCO document pected to further improve the trained models' performance. ranking task is focused on ranking full documents instead, where documents that contain relevant passages are assumed to relevant. The full collection contains about 3 million documents and the training set has about 367 thousand queries. We report results on the passage and document ranking development sets, comprising 6,980 and 5,193 queries, respectively. For the TREC-CAR dataset, we follow the setup of Nogueira and Cho (2019) and use automatic annotations for training and evaluation, where the training set has about 2.3 million queries for supervised training, 583 thousand queries for development, and 2,254 queries for testing.
Model Settings For MS MARCO passage and TREC-CAR, we apply retrieval and ranking models on the provided passage collections. For MS MARCO document, we follow (Yan et al., 2020) and break documents into a set of overlapping passages, each including the document URL and title. For each task, we train the models on that task's training data only. We initialize the retriever and reranker models with BERT-large, and for TREC-CAR we use the pretrained model provided by Nogueira and Cho (2019), to avoid pre-training on test data. We train dense retrieval models on positive and negative candidates from 1000-best list of BM25, additionally using one iteration of hard negative mining when beneficial. For the multi-vector models ME-BERT, up to m = 4 vectors performed best.
Results Table 2 compares our models to stateof-the-art and baseline models on the three tasks. The state of the art prior work follows the twostage retrieval and reranking approach, where an efficient first-stage system retrieves a (usually large) list of candidates from the document collection, and a second stage more expensive model such as cross-attention BERT reranks the candidates.
Our focus is on improving the first efficient retrieval stage, and we compare to prior works in two settings: Retrieval, top part of the Table, where only first-stage efficient retrieval systems are used and Reranking, bottom part of the Table, where more expensive second-stage models are employed to re-rank candidates. Table 3 delves into the impact of the first-stage retrieval systems as the number of candidates the second stage reranker has access to (often termed retrieval depth) is substantially reduced, improving overall system efficiency.
We report results in comparison to the following systems: 1) MULTI-STAGE is a multistage ranking architecture proposed in Nogueira et al. (2019a) where BM25 candidates are first ranked by a reranker trained with binary classification loss (monoBERT), and then passed to a second pairwise reranker (duoBERT). 2) DOC2QUERY (Nogueira et al., 2019b) focuses on improving first-stage retrieval by using a neural model to expand documents with a sequenceto-sequence model before indexing and scoring with sparse retrieval models. 3) DEEPCT is the DeepCT-Index model introduced in Dai and Callan (2020) which learns to map BERTâȂŹs contextualized text representations to contextaware term weights for sentences and passages. DeepCT-Index produces term weights that can be stored in an ordinary inverted index for first-stage passage retrieval. 4) IDST is a two-stage cascade ranking pipeline proposed by Yan et al. (2020) which used both document expansion and crossattention ensemble reranking with tailored BERT model pre-training. 5) Leaderboard is the best reported development set score on the MS MARCOpassage leaderboard 14 .
Looking at the top part of Table 2, we can see that our DE-BERT model already outperforms prior efficient retrieval systems, across all datasets. The multi-vector model brings largest improvement on the dataset containing the longest documents (MS MARCO document), and sparse-dense hybrid model brings improvements over denseonly models on all datasets. The improvement from the hybrid model is smallest on the collection with shortest passages (MS MARCO passage). 15 When a large number of candidates can be reranked with a more powerful cross-attention model, the impact of the first-stage retrieval system decreases. Compared to the state of the art, we see that our models outperform prior work on TREC-CAR and are comparable to prior work on the other datasets.
The accuracy of the first-stage retrieval system is particularly important when the cost of reranking a large set of first-pass candidates is pro-    hibitive. Table 3 shows the performance of systems that can rerank a smaller number of candidates. We see that, especially when a very small number, such as 10 or 20 candidates can be scored with expensive cross-attention models, our multivector ME-BERT and hybrid models achieve significant improvements compared to prior first-pass retrieval systems on both MS MARCO tasks.

Related work
We have mentioned research improving the accuracy of retrieval and ranking from a large space throughout the paper. Here we focus on prior works related to our research questions on the capacity of dense dual encoder representations relative to sparse high-dimensional bag-of-words ones.
A number of other works relate to the general problem of recovering bag-of-words representations from dense encodings. For example, the literature from compressive sensing shows that it is possible to recover a bag of words vector x from the projection Ax for suitable A. Bounds for the sufficient dimensionality of isotropic Gaussian projections (Candes and Tao, 2005;Arora et al., 2018) are a factor of T log v worse than the bound described in § 3, but this is unsurprising because the task of recovering bags-of-words from a compressed measurement is strictly harder than recovering inner products. Subramani et al. (2019) ask whether it is possible to exactly recover sentences (token sequences) from pretrained decoders, using vector embeddings that are added as a bias to the decoder hidden state. Because their decoding model is more expressive (and thus more computationally intensive) than inner product retrieval, the theoretical bounds derived here do not apply. Nonetheless, Subramani et al. empirically observe a similar dependence between sentence length and embedding size. Wieting and Kiela (2019) represent sentences as bags of random projections, finding that high-dimensional projections (k = 4096) perform nearly as well as trained encoding models such as SkipThought (Kiros et al., 2015) and In-ferSent (Conneau et al., 2017). These empirical results may provide further empirical support for the hypothesis that bag-of-words vectors from real text are "hard to embed" in the sense of Larsen and Nelson (2017). Our contribution is to systematically explore the relationship between document length and encoding dimension, focusing on the case of exact inner product-based retrieval. Approximate retrieval (Indyk and Motwani, 1998;Har-Peled et al., 2012) is often necessary in practice. We leave the combination of representation learning and approximate retrieval for future work.

Conclusion
Transformers perform well on an unreasonable range of problems in natural language processing. Yet the computational demands of large-scale retrieval push us to seek other architectures: cross attention over contextualized embeddings is too slow, but dual encoding over fixed-length vectors may be insufficiently expressive, failing even to match the performance of sparse bag-of-words competitors. We have used both theoretical and empirical techniques to characterize the limitations of fixed-length dual encoders, focusing on the role of document length. Based on these observations, we explore a set of hybrid models: attention-like computations over a limited number of vectors per document, and integration of sparse and dense representations. These methods yield strong performance, while maintaining scalability.