Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Current state-of-the-art approaches to cross- modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross- modal retrieval, we propose a novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach that combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine- tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross- encoders.1


Introduction
Information-rich and efficient methods for dealing with large unstructured data in both computer vision and NLP are required to process and understand huge amounts of user-created content and beyond. In multi-modal contexts, such methods enable fundamental applications such as image * Both authors contributed equally to this work. 1 We release the code and model weights at github .com/UKPLab/MMT-Retrieval. retrieval. A typical efficient bi-encoder 2 approach encodes images and text separately and then induces a shared high-dimensional multi-modal feature space. This enables cross-modal retrieval, where standard distance metrics identify the most similar examples for each query in the data collection via nearest-neighbor search (Arya et al., 1998;Kushilevitz et al., 2000;Liu et al., 2004;Andoni and Indyk, 2008;Hajebi et al., 2011).
These bi-encoder approaches have already been shown to achieve reasonable performance in search and retrieval applications, both monolingually for English (Nam et al., 2017;Faghri et al., 2018;Zheng et al., 2020;Wang et al., 2019a;Shi et al., 2019) and in multilingual contexts (Gella et al., 2017;Kádár et al., 2018;Kim et al., 2020;Wehrmann et al., 2019;Burns et al., 2020). However, they cannot match performance of more recent attention-based methods. Here, a typical modus operandi is to apply a cross-attention mechanism between examples from the two modalities to compute their similarity score, relying on Transformer-based neural architectures (Vaswani et al., 2017). Such so-called multi-modal cross-encoders (CE) (Tan and Bansal, 2019;Li et al., 2020a;Li et al., 2020b;Ni et al., 2021) pass each text-image pair through the multi-modal encoder to compute their similarity, see Figure 1a.
While the results accomplished by the CE methods look impressive (Li et al., 2020b;Bugliarello et al., 2021;Ni et al., 2021), this comes at a prohibitive cost. In particular, they have extremely high search latency: Processing a single text query with an image collection of 1M items may take up to 36 minutes using a single NVIDIA V100 GPU (see Table 3). Due to this issue, they are evaluated only with extremely small benchmarks, that is, the maximum size of typical image collections for image retrieval tasks is 5k images, and evaluation still lasts ≈50 hours (see Table 4). 3 In sum, cross-encoders are impractical for deployment in realistic application scenarios, while the use of small benchmarks results in inflated and thus misleading evaluation performance.
In unimodal text-only setups, Transformerbased architectures have recently been integrated with bi-encoder (BE) methods (Guo et al., 2018;Reimers and Gurevych, 2019;Humeau et al., 2020;Henderson et al., 2020;Feng et al., 2020, inter alia), yielding computationally more efficient sentence encoders. Instead of jointly encoding sentence pairs with cross-attention, a pretrained Transformer model (e.g., BERT [Devlin et al., 2019]) is fine-tuned within a twin network with shared Transformer weights, as illustrated in Figure 1b. In a nutshell, each sentence is passed through the encoder separately, and a loss function is defined on top of the two respective separately computed encodings. However, despite their strong performance on sentence retrieval and similarity tasks (Reimers and Gurevych, 2019;Litschko et al., 2021), these encoders cannot match the task performance of cross-encoders (Humeau et al., 2020).
Motivated by these insights, in this work we aim to leverage the best of both worlds towards improved and more efficient cross-modal search and retrieval: 1) efficiency and simplicity of BE approaches based on twin networks, as well as 2) expressiveness and cutting-edge performance of CE methods. We first provide a systematic comparative analysis on the effectiveness and efficiency of Transformer-based multi-modal BE and CE methods across a range of image search evaluation benchmarks. We then propose two novel models that aim to blend the main strengths of CE and BE. The idea behind the first model variant, termed cooperative (SEP+COOP), is to retrieve and rerank with two separate, independently trained retrieval models: 1) an initial top-k list of potentially relevant items (i.e., texts or images) is retrieved by the more efficient BE model, and then 2) this top-k list is reranked ''smartly'' by the more accurate CE model, as illustrated in Figure 1c. Our second, joint (JOINT+COOP) model variant also operates in the same retrieve-and-rerank setup, but it now trains a multi-modal cross-encoder and a multi-modal BE model jointly with tied weights, as illustrated in Figure 1d. The retrieve step, where efficiency is paramount, is again executed by the BE sub-model, and the precision-oriented rerank step is conducted via the CE sub-model.
We propose a general framework for crossmodal search and retrieval, where JOINT+COOP and SEP+COOP models are independent of the chosen pretrained vision-language representation architectures. The experiments are thus based on a stateof-the-art vision-language architecture OSCAR (Li et al., 2020b) (experiments in English) and M3P (Ni et al., 2021) (multilingual), and we demonstrate consistent improvements over the original OSCAR model on the standard benchmarks MSCOCO and Flick30k and improvements over the original M3P in multiple languages on the Multi30k dataset. We empirically validate huge efficiency benefits of the proposed framework.
Contributions. 1) We construct and systematically evaluate twin-networks combined with multi-modal Transformers (BE); they outperform all previous bi-encoder approaches, but lag behind their CE counterparts. 2) We evaluate BE and CE approaches within a cooperative retrieve-andrerank approach; their combination outperforms the individual models, while offering substantial efficiency boosts compared to CE methods. 3) We propose a novel joint CE-BE model (JOINT+ COOP), which is trained to simultaneously crossencode and embed multi-modal input; it achieves the highest scores overall while maintaining retrieval efficiency. 4) Finally, we propose a more realistic evaluation benchmark; we demonstrate harsh drops in overall cross-modal retrieval performance of all models in this more difficult scenario, calling for improved evaluation benchmarks and protocols in future work.

Related Work
Efficient approaches to cross-modal image-text retrieval relied on the induction of shared multimodal visual-semantic embedding spaces (VSEs) (Frome et al., 2013;Faghri et al., 2018;Shi et al., 2019;Mahajan et al., 2019). In a multilingual setup, all languages share the same embedding space along with the visual data (Kim et al., 2020;Wehrmann et al., 2019;Burns et al., 2020). More recently, attention-based cross-encoder models, typically based on Transformer architectures (Vaswani et al., 2017) have considerably outperformed the VSE-based approaches. However, this comes at a severe cost of decreased retrieval efficiency and increased latency (Lee et al., 2018;Wang et al., 2019b). The current state-of-the-art multi-modal models jointly encode and crossattend over text tokens and image features Tan and Bansal, 2019;Li et al., 2020a;Li et al., 2020b;Bugliarello et al., 2021;Ni et al., 2021, inter alia). These CE methods leverage image captioning datasets such as MSCOCO (Lin et al., 2014) andFlick30k (Plummer et al., 2015) and train a classification head that learns to identify whether or not an (image, caption) input pair constitutes an aligned pair. Each image-text combination must be passed through the network, which scales quadratically with the number of examples.
To handle this quadratic increase, we use a cooperative retrieve-and-rerank approach. Although to the best of our knowledge this has not been proposed for cross-modal settings, it has a long history in NLP, where Yates et al. (2021) date it back to the 1960s (Simmons, 1965). Until recently, bag-of-words methods (BoW; e.g., BM25) were commonly used for the first retrieval step. For the second step, pretrained language models (LMs) were fine-tuned to either rerank candidates  or-for question-answering tasks-directly generated the answer span . More recent work on text-based retrieval and QA tasks has moved away from BoW methods towards learned (neural) models for the first retrieval step (Karpukhin et al., 2020;Qu et al., 2021;Xiong et al., 2021).
Our work is inspired by recent BE-based approaches in unimodal text-only setups. Here, LMs are fine-tuned via twin-network architectures on auxiliary tasks such as semantic textual similarity (Reimers and Gurevych, 2019;Humeau et al., 2020), paraphrasing (Wieting et al., 2019), response retrieval Henderson et al., 2019;Henderson et al., 2020;Humeau et al., 2020), or translation ranking (Chidambaram et al., 2019;Feng et al., 2020). This effectively turns the LMs into universal sentence encoders which can then be used off-the-shelf for efficient text-based monolingual and cross-lingual retrieval (Litschko et al., 2021). In this work, we first extend this idea to multi-modal setups, and then show that our cooperative and joint approaches yields improved cross-modal retrieval models, maintaining retrieval efficiency.
Joint approaches like our JOINT+COOP model, which aim to align the retriever and reranker can be found in different forms: Boualili et al. (2020) ''mark'' exact matches from the bag-ofwords retrieval for the reranker; Yan et al. (2021) share the parameters between a passage expander (which adds more relevant terms for a bag-ofwords retriever) and the reranker; Hofstätter et al. (2020) distill knowledge from the reranker into the retriever model with soft labels generated by the teacher. Specifically for question-answeringwhere a two stage retriever-reader setup similar to the retrieve-and-rerank approach is commonresearch aims to synchronize the models through knowledge distillation from the reader to the retriever (Yang and Seo, 2020;Izacard and Grave, 2021) or by directly training both models end-toend Sachan et al., 2021a,b). The challenge here is that the reader and the retriever are coupled-the reader requires candidates from the retriever that contain the solution. Our proposed reranker side-steps this problem as it uses no candidates from the retriever during training and only learns if a given input pair is (dis)similar. This way, we can train both components, the retriever and the reranker, side-by-side and align them by sharing their weights.
The work most closely related to ours includes contemporaneous models: ALBEF , CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), and VisualSparta (Lu et al., 2021). ALBEF includes contrastive learning as one of its pretraining tasks but then uses a CE approach for downstream retrieval. CLIP and ALIGN use similar contrastive learning strategies as we do, but are cast as full-fledged pretraining architectures that learn from scratch and require magnitudes of more data than our approach. We show that it is possible to fine-tune pretrained models with fewer data and offer a general framework, applicable to a spectrum of pretrained models. Further, unlike prior work, we demonstrate the benefits of combining BE-based (contrastive) learning with cross-encoders for improved and efficient retrieval. 4 Finally, VisualSparta (Lu et al., 2021) fine-tunes OSCAR, but at the level of token (text) and image-region embeddings. This enables the use of extremely fast lookup tables for efficient retrieval. However, this comes with a major disadvantage: the model disposes of wider context information. 5 Our cooperative methods do leverage the finer-grained information at retrieval.

Methodology
The predominant Transformer-based multi-modal text-vision architecture is a single-stream encoder: It shares the majority of weights between the two modalities, including the multi-head crossattention Li et al., 2020a;Li et al., 2020b;Ni et al., 2021). The Transformer weights and text embeddings are typically initialized with weights of a pretrained LM (e.g., BERT [Devlin et al., 2019] for English, XLM-R [Conneau et al., 2020] for multilingual models), where the corresponding vocabulary and tokenizer are utilized. Images are preprocessed via object detection models such as Faster R-CNN (Ren et al., 2015) to extract feature representations for regions of interest (Anderson et al., 2018). The image features are passed through an affine-transformation layer which learns to align the vision input with the pretrained Transformer. The position of the region of interest (or in some models also the region's width and height) is used to generate positional embeddings. By combining these two representations, each object region is passed into the Transformer separately. The cross-attention mechanism of the Transformer attends over all text and image inputs at every layer, thus learning a joint representation of both modalities.
We focus on different fine-tuning strategies of the pretrained models for the downstream task of image-text retrieval. We illustrate these approaches in Figure 1 and describe them in what follows.

Cross-Encoders
For image and text retrieval tasks, the prevailing approach with pretrained multi-modal Transformer models is to cross-encode each image-text combination (see Figure 1a).
Training. A pretrained model receives as input positive and negative pairs of images and captions. Negative pairs are also sampled from the training dataset (e.g., MSCOCO, Flickr30k). A binary classification head is placed on top of the Transformer model, where the contextualized embedding of the [CLS] token is passed into the classification head. The weights of the classifier together with the Transformer, word embeddings, and image feature transformation matrices are fully fine-tuned using a binary cross-entropy (BCE) loss: p(i, c) indicates the probability of the input combination of image i and caption c to have the positive label (i.e., whether it is the correct image-caption combination); y = 1 if (i, c) is a positive pair and y = 0 if either the image or text has been replaced (i.e., a negative pair). 6 Retrieval. At retrieval, all (i, c) combinations need to be processed, and are ranked by the probability p(i, c). For instance, given a text query c, retrieving the single most relevant image i from an image collection I proceeds as follows: Despite its typically high performance, this approach comes at high computational costs as each target instance needs to be passed through the entire network along with the query to obtain the score p(i, c); that is, the approach does not leverage any pre-computed representations during retrieval.

Bi-Encoders
Training. Each image and text caption is passed separately through the pretrained Transformer model (Figure 1b). The contextualized representations are mean-pooled to represent the embedding of the respective image i and text caption c. 7 The objective of the twin network is to place positive training instances (i, c) closely in the shared multi-modal space, while unrelated instances should be placed farther apart. This is formulated through a standard triplet loss function. It leverages (i, c, c ) and (i, i , c) triplets, where (i, c) are positive image-caption pairs from the training corpus, while c and i are negative examples sampled from the same corpus such that image-caption pairs/instances (i, c ) and (i , c) do not occur in the corpus. The triplet loss is then: where [·] + = max(0, ·), α defines a margin, and i and c are embeddings of respective image and caption negatives.
Sampling Negative Examples. Negative examples may have a profound impact on training and performance, and it has been shown that selecting hard negative examples typically yields improved performance (Faghri et al., 2018). However, detecting such hard negatives is only possible with BE-based approaches, as cross-encoding all instances is computationally infeasible. We rely on the In-Batch Hard Negatives (BHN) method (Hermans et al., 2017), a computationally efficient sampling of hard negative examples. In a nutshell, BHN randomly samples a set of N negative examples from the training corpus and then ranks them according to their distance to all positive examples; for each positive example, the closest negative example is selected as the hardest negative example. By scaling up N , the probability of sampling truly hard negatives increases.
Retrieval. The BE approach enables preencoding of all items for efficient retrieval lookup. 8 For instance, a text query q is encoded with the bi-encoder and the most similar pre-encoded instance from an image collection I is retrieved: arg max i∈I cos(i, q). This approach can scale to even billions of images (Johnson et al., 2021), but it cannot be guaranteed that the important idiosyncratic information necessary to distinguish truly relevant from related examples is sufficiently encoded in the embedding. Further, the approach might not generalize well in low-resource scenarios as the model is not required to learn finer-grained parts of the input if they are never demanded by the training data.

Separate Training, Cooperative Retrieval
We combine the benefits of the two model types (CE and BE) within a cooperative retrieval approach (SEP+COOP), as illustrated in Figure 1c.
Training and Retrieval. Two models, one CE ( §3.1) and one BE ( §3.2), are trained independently. Following that, the retrieval step is split into two stages. First, the efficient BE model is used to retrieve the top-k relevant items from the entire large collection, yielding a much smaller collection I k : where top k (·) retrieves a set of the top-k most similar instances. Second, we rerank the instances from I k with the more precise but computationally more expensive CE model: arg max i∈I p(i, c). This cooperative approach thus combines the benefits of both approaches and is able to efficiently retrieve instances. 9 However, given that this approach requires two models to be stored in memory, it is less parameter-efficient than the previous methods.

Joint Training, Cooperative Retrieval
Training and Retrieval. Instead of relying on two fully separated models, we propose to train a single joint model, able to both cross-encode and embed (i.e., 'bi-encode'), see Figure 1d. The joint model with shared parameters trains by alternating between the respective sub-models and their input types. When cross-encoding, a dedicated prediction head is trained using BCE loss ( §3.1). In order to train the BE-based sub-model, we again rely on a twin architecture with a triplet loss from Eq. (2).
Retrieval proceeds with the same two-step retrieve-and-rerank procedure from §3.3. We first obtain the set I k with the much cheaper BE-based submodel, and then rerank its items with the CE submodel. We combine the best traits of CE and BE, while maintaining parameter efficiency. Using both learning objectives at training, the joint model is forced to observe the input from different viewpoints, thus improving its generalization capability while offering parameter efficiency.

Experimental Setup
Our fine-tuning framework from §3 can be applied to any pretrained multi-modal Transformer. In all the experiments, we opt for state-of-the-art pretrained multi-modal models for monolingual (English) and multilingual contexts: OSCAR (Li et al., 2020b) and M3P (Ni et al., 2021), respectively.
OSCAR is a single-stream multi-modal Transformer (Bugliarello et al., 2021), with its weights initialized with those of the pretrained BERT Base model, and then subsequently fine-tuned on multimodal data (see §3). Unlike prior work, OSCAR additionally uses object labels of detected regions: Those labels serve as anchors for visual grounding, with large improvements achieved over its 9 Retrieval time for 1M images: 94ms (GPU), 13s (CPU). prior work. M3P is a single-stream multilingual multi-modal Transformer. Its weights are initialized with those of pretrained XLM-R Base and then fine-tuned on multi-modal data (see §3) as well as multilingual text-only data.
Training and Test Data. We primarily experiment with the English image-text retrieval benchmarks MSCOCO and Flick30k. They comprise 123k and 31.8k images, respectively, with 5 captions describing each image. MSCOCO provides two test benchmarks of sizes 1k and 5k, where the smaller set is a subset of the 5k test set. The standard Flickr30k test set consists of 1k images. In addition, we use the development set of Conceptual Captions (CC) (Sharma et al., 2018) for zero-shot evaluation, and also to construct a larger and more difficult test set (see later in §6). The original CC dev set contained 15.8k images, but currently, only 14k images are still available online.
For multilingual experiments, we use the standard Multi30k dataset (Elliott et al., 2016(Elliott et al., , 2017Barrault et al., 2018), which extends Flickr30k with 5 German and one French and Czech caption per image. Its test s et also comprises 1k images.
The evaluation metric is the standard Recallat-M (R@M): It reports the proportion of queries for which the relevant target item is present within the top-M retrieved items.
Training Setup and Hyperparameters. Our setup largely follows Li et al. (2020b) and Ni et al. (2021) unless noted otherwise. 10 We experiment with learning rates [5e − 5, 2e − 5], and with the number of update steps between 25k and 125k. One batch contains 128 positive pairs plus 128 negative pairs with L CE . We use the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear learning rate decay without warmup, and a weight decay of 0.05. We take model checkpoints every 5k steps and select the checkpoint with the best development set performance.

Baselines and Model Variants
CE. Our main baselines are OSCAR and M3P models used in the standard CE setting, described in §3.1. We fully fine-tune the Transformer weights along with a randomly initialized classification head. 11 At retrieval, we cross-encode each text-image combination and rank them according to the corresponding probability, see Eq. (1).

BE.
We rely on BHN negative sampling, finding that training for 30k steps, with a learning rate of 5e − 5, and with a margin α = 0.1 works best. 12 SEP+COOP. For the cooperative method without joint training ( §3.3), we retrieve the top-20 instances with BE and rerank them via CE. 13 JOINT+COOP. We alternate between the two objective functions while training the joint model (see §3.4). We find that training for 60k update steps with a learning rate of 2e − 5 (OSCAR) or 5e − 5 (M3P) works best, the rest of the hyperparameters are the same as with separately trained models. For retrieval, we again set k = 20. To demonstrate the benefits of cooperative retrieval, we also evaluate two non-cooperative variants originating from the joint model: JOINT+CE uses the CE sub-model for a single-step CE-style retrieval, while JOINT+BE operates in the fully BE retrieval setup.
The underlying pretrained Transformer is denoted with a superscript: For example, JOINT+COOP OSCAR denotes that: 1) pretrained OSCAR is 2) fine-tuned with the joint variant from §3.4, and 3) then used in the cooperative retrieval setup.

Results and Discussion
The main results on English-only monolingual datasets Flickr30k and MSCOCO are summarized in Table 1, and the scores on multilingual Multi30k are provided in Table 2.
As expected, all Transformer-based approaches (groups G2 and G3) substantially outperform the pre-Transformer (PT) models (G1). While this has 11 Training for 100k steps and a learning rate of 2e − 5 (OSCAR) or 5e − 5 (M3P) performed best. 12 We also experimented with Approximate-nearestneighbor Negative Contrastive Estimation (ANCE) (Xiong et al., 2021); however, it did not yield performance benefits. 13 We provide an ablation study of different k values in §6. We have also experimented with training a CE model using hard negative samples from a pretrained BE model. However, the CE model is able to easily overfit on those negative examples, resulting in inferior performance. already been established in prior work for CE methods, our findings confirm that the same holds also for the efficient BE approach. This validates the effectiveness of Transformer architectures pretrained on large corpora for the retrieval task. R@1 scores with BE lag slightly behind the CE scores, but the respective R@10 scores are mostly on-par. This suggests that the BE approach is ''coarser-grained'', and mostly relies on ''global'' interactions between the modalities. We investigate this conjecture further in §6. This is also illustrated by an example in Figure 2. When dealing with related target items, CE's cross-attention mechanism is able to explicitly attend over each token and image region, capturing additional (non-global) information relevant to the query. Although the high-level ''global'' concept of a skiing person is present in (almost) every example, the additional important information related to what the person is wearing is not adequately represented in the embeddings. Therefore, the BE (sub)model does not rank this instance at the top position. The CE (sub)model then directly compares the instances, identifying that clothing is important and reranks the target examples accordingly.
Most importantly, the relative comparison of R@1 versus R@10 scores empirically hints at the necessity of the retrieve-and-rerank cooperative approach: The BE approach efficiently retrieves 20 relevant examples, but the increased expressiveness of CE is required to refine the initially retrieved list. Moreover, the results in the cooperative setup even without joint training (SEP+COOP OSCAR and SEP+COOP M3P ) demonstrate that the two models support each other: Slight improvements are observed over the pure CE, while offering massive efficiency boosts over CE. Our speculation is that the BE model filters out false positives, which in turn makes the CE model more robust.
The results of the JOINT+COOP variant indicate that it is indeed possible to maintain retrieval efficiency with improved parameter efficiency: This approach performs on-par or even slightly outperforms the standard state-of-the-art CE models. The results verify that the two objective functions do not interfere with each other and that a single model is able to both embed and crossencode. We note that the JOINT+COOP variant offers the best trade-off between parameter and retrieval efficiency, achieving the peak scores on   (Ni et al., 2021), we report mean Recall (mR) scores: mR computes an average score of Recall@1, Recall@5 and Recall@10 on image-to-text retrieval and textto-image retrieval tasks. All methods in the comparison use text data from all four languages. We divide the models into groups G1-G3 as in Table 1. † indicates results taken directly from the literature (Ni et al., 2021) and ‡ indicates our own results. MULE (Kim et al., 2020); S-LIWE (Wehrmann et al., 2019); SMALR (Burns et al., 2020); CE M3P † (Ni et al., 2021).
the monolingual MSCOCO and Flickr30k benchmarks, and very competitive results on the multilingual Multi30k benchmark.

Model
NVIDIA V100 CPU 50k 1M 50k 1M BE 16ms 37ms 0.2s 1.6s SEP/JOINT+COOP 74ms 94ms 6s 13s CE 2min 36min 2.4h 47h Table 3: Retrieval latency for one query with an image collection of 50k or 1M images (with preencoded images) using a single GPU/ CPU. Batch size for cross-encoding of the query with the images is 512. CPU is an Intel Xeon Gold 6154.

Further Analysis
We now discuss a series of additional experiments that further profile and analyze the proposed multi-modal retrieval approaches, focusing especially on the multiple efficiency aspects related to fine-tuning and retrieval stages.
Retrieval Efficiency. We empirically validate the time efficiency of our cooperative approaches for retrieval in an image search scenario (Table 3) and for evaluation on huge datasets (Table 4). To allow for a fair comparison between the approaches, we implement the entire retrieval  pipeline-from model to nearest-neighbor searchin PyTorch without additional optimization such as multi-processing or optimized nearest-neighbor search libraries like FAISS (Johnson et al., 2021). Our measurements confirm the efficiency of BEs in comparison to CEs. The cooperative approaches, which only have to cross-encode a constant number of items invariant of the collection size, are close in retrieval latency to BE for image search and remain feasible even for large datasets.
Larger Benchmarks. The results in Table 1 indicate that current top-performance models achieve very high scores in absolute terms on the standard retrieval benchmarks. However, this is partially due to too small image collections with only a few thousand instances; one undesired effect is that it becomes increasingly difficult to identify significant differences between model performances. Unfortunately, the inefficiency of CE models, as empirically validated in Tables 3-4, has prevented evaluation with larger collections. However, more efficient  fully BE-based and SEP+COOP methods now enable evaluation on larger collections and in realistic scenarios. We thus increase the benchmark size by merging test instances from different available evaluation sets. In particular, we construct a collection spanning 20k images: It blends the test sets of MSCOCO (5k instances), Flickr30k (1k), and the development set of CC (14k). Note that we simply augment the benchmarks but the query set with labels for each standardized evaluation task/set remains unchanged; in other words, the instances from other datasets are used as distractors that increase the search space and make the retrieval task more difficult. The results thus provide insights into the model performance in the target domain, as well as its robustness regarding out-of-distribution data. We now observe in Table 5   differences, which were lacking with the smaller benchmarks. The pure BE-based approach now substantially underperforms SEP/JOINT+COOP variants. The JOINT+COOP does remain the best-scoring variant overall.
Zero-Shot Performance. Relying on multimodal and multilingual representations fine-tuned for cross-modal retrieval, the proposed methods should also generalize to new unseen captions and images beyond the dataset used for fine-tuning. Therefore, we directly transfer the model finetuned on one dataset to the test data of another dataset (e.g., fine-tune on MSCOCO data, test on Flickr30k). As baselines, we use the reported zero-shot results of UNITER  for Flickr30k 14 and we also evaluate the CLIP model. 15 The zero-shot results in Table 6, reveal that the CE variant slightly outperforms other approaches when transferring from Flickr30k to MSCOCO, while JOINT+COOP OSCAR remains competitive. However, for the opposite direction, we achieve considerable performance gains with the JOINT+ COOP OSCAR variant. On CC, all variants considerably underperform CLIP; we speculate that it might be due to a more diverse set of images included in CC, including illustrations, which neither exist in MSCOCO nor Flickr30k. This means that CLIP has a considerable advantage on CC due to its exposure to massive amounts of data during pretraining.
Multilingual zero-shot results, where we finetune on the English Multi30k captions and test on the captions in other languages, are shown in 14 They do not report results for MSCOCO. 15 We use the ViT-B/32 model variant. Retrieval results from Radford et al. (2021)    Sample Efficiency. We also analyze how the amount of image-text data for fine-tuning impacts the retrieval performance; we thus sample smaller datasets from the full MSCOCO training set, covering 1k, 10k, and 50k images with their captions (5 per image). The results in Figure 3 reveal that BE-based approaches in general are considerably less sample-efficient than cross-encoders. They  particularly struggle in the lowest-data scenario with only 1k images available; this is also reflected in the lower performance of JOINT+COOP in the 1k setup. A reason behind the more effective adaptation of CE to low-data regimes might be their richer ''input consumption'': With 1k images and 5k captions, CE runs a whole grid of 1k×5k items through its network, which provides more learning signal with fewer data available. On the other hand, BE-based approaches are expected to learn effective encoders of both modalities separately based solely on 1k images and 5k captions, without any cross-modal interaction.
Parameter Efficiency. We also provide a simple parameter efficiency analysis by initializing the models with pretrained OSCAR weights, but only passing the representations through every second layer, effectively halving the total amount of Transformer parameters. The results are shown in Figure 4. The performance with all approaches using the ''halved'' model is around ∼ 90% of the performance with the full Transformer. Overall, the JOINT+COOP method again achieves the highest  scores. This suggests that the proposed fine-tuning approaches are applicable also to smaller models, with similar relative trends in retrieval results.
Retrieving Top-k. We analyze different values for k for top-k retrieval of the BE component in Table 8. Selecting small values for k significantly decreases the retrieval latency, as fewer instances need to be cross-encoded. However, selecting k values that are too small can come at a cost of precision, as the true positive instance might not be among the top-k retrieved instances of the BE model. In our experiments, k = 20 achieves the best trade-off between precision and retrieval latency.
Combining Ranking. We evaluate the ranking score combination of the two components JOINT+BE and JOINT+CE in Table 9. We combine the ranking of the bi-encoder submodel and the cross-encoder submodel by summing over the scores using two different variations: (1) We directly add the scores in a weighted sum ADD λ (e, c) = λe + (1 − λ)c where e and c are the embedding cosine similarity and cross-encoder similarity scores respectively and λ is a weighting parameter. The cross-encoder scores have been processed with a sigmoid function so that both e and c are in the same value range. The final ranking is then defined by ADD λ (e, c).
(2) We separately 0-1-normalize the scores for the top-k candidates of the bi-and cross-encoder before combining them for NORM ADD λ (e, c), which is defined analog to ADD λ (e, c).
However, we find that relying solely on the cross-encoder achieves the best results. This suggests that the scores by the bi-encoder are useful in the ''global'' scope with all data to retrieve strong candidates but in the ''local'' scope of the top-k candidates, the cross-encoder is superior.

Conclusion
We proposed a novel framework that converts pretrained multi-modal Transformers into effective and efficient cross-modal retrieval models. The framework is applicable to any pretrained model and combines the efficiency of bi-encoder (BE) approaches with the accuracy of computationally more demanding cross-encoding (CE) approaches. Their synergistic effect at retrieval is achieved through a cooperative retrieve-andrerank regime, where the initial retrieval from a large collection is performed via efficient BE approaches, followed by another accuracy-driven step via a CE model. Moreover, we introduced a parameter-efficient joint fine-tuning regime that blends BE and CE into a single model with shared weights. Our results with state-of-the-art pretrained models across a range of standard monolingual and multilingual cross-modal retrieval tasks and setups validated the strong performance of such cooperative and joint approaches. At the same time, we demonstrated their retrieval efficiency, which makes them viable in realistic retrieval scenarios with large collections. In future work, we will put more focus on zero-shot and few-shot retrieval scenarios, and expand the approach to more languages, modalities, and tasks.