Abstract
Current state-of-the-art approaches to cross- modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross- modal retrieval, we propose a novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach that combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine- tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross- encoders.1
1 Introduction
Information-rich and efficient methods for dealing with large unstructured data in both computer vision and NLP are required to process and understand huge amounts of user-created content and beyond. In multi-modal contexts, such methods enable fundamental applications such as image retrieval. A typical efficient bi-encoder2 approach encodes images and text separately and then induces a shared high-dimensional multi-modal feature space. This enables cross-modal retrieval, where standard distance metrics identify the most similar examples for each query in the data collection via nearest-neighbor search (Arya et al., 1998; Kushilevitz et al., 2000; Liu et al., 2004; Andoni and Indyk, 2008; Hajebi et al., 2011).
These bi-encoder approaches have already been shown to achieve reasonable performance in search and retrieval applications, both monolingually for English (Nam et al., 2017; Faghri et al., 2018; Zheng et al., 2020; Wang et al., 2019a; Shi et al., 2019) and in multilingual contexts (Gella et al., 2017; Kádár et al., 2018; Kim et al., 2020; Wehrmann et al., 2019; Burns et al., 2020). However, they cannot match performance of more recent attention-based methods. Here, a typical modus operandi is to apply a cross-attention mechanism between examples from the two modalities to compute their similarity score, relying on Transformer-based neural architectures (Vaswani et al., 2017). Such so-called multi-modal cross-encoders (CE) (Tan and Bansal, 2019; Lu et al., 2019; Chen et al., 2020; Li et al., 2020a; Gan et al., 2020; Li et al., 2020b; Ni et al., 2021) pass each text-image pair through the multi-modal encoder to compute their similarity, see Figure 1a.
While the results accomplished by the CE methods look impressive (Li et al., 2020b; Bugliarello et al., 2021; Ni et al., 2021), this comes at a prohibitive cost. In particular, they have extremely high search latency: Processing a single text query with an image collection of 1M items may take up to 36 minutes using a single NVIDIA V100 GPU (see Table 3). Due to this issue, they are evaluated only with extremely small benchmarks, that is, the maximum size of typical image collections for image retrieval tasks is 5k images, and evaluation still lasts ≈50 hours (see Table 4).3 In sum, cross-encoders are impractical for deployment in realistic application scenarios, while the use of small benchmarks results in inflated and thus misleading evaluation performance.
In unimodal text-only setups, Transformer- based architectures have recently been integrated with bi-encoder (BE) methods (Guo et al., 2018; Reimers and Gurevych, 2019; Humeau et al., 2020; Henderson et al., 2020; Feng et al., 2020, inter alia), yielding computationally more efficient sentence encoders. Instead of jointly encoding sentence pairs with cross-attention, a pretrained Transformer model (e.g., BERT [Devlin et al., 2019]) is fine-tuned within a twin network with shared Transformer weights, as illustrated in Figure 1b. In a nutshell, each sentence is passed through the encoder separately, and a loss function is defined on top of the two respective separately computed encodings. However, despite their strong performance on sentence retrieval and similarity tasks (Reimers and Gurevych, 2019; Litschko et al., 2021), these encoders cannot match the task performance of cross-encoders (Humeau et al., 2020).
Motivated by these insights, in this work we aim to leverage the best of both worlds towards improved and more efficient cross-modal search and retrieval: 1) efficiency and simplicity of BE approaches based on twin networks, as well as 2) expressiveness and cutting-edge performance of CE methods. We first provide a systematic comparative analysis on the effectiveness and efficiency of Transformer-based multi-modal BE and CE methods across a range of image search evaluation benchmarks. We then propose two novel models that aim to blend the main strengths of CE and BE. The idea behind the first model variant, termed cooperative (Sep+Coop), is to retrieve and rerank with two separate, independently trained retrieval models: 1) an initial top-k list of potentially relevant items (i.e., texts or images) is retrieved by the more efficient BE model, and then 2) this top-k list is reranked “smartly” by the more accurate CE model, as illustrated in Figure 1c. Our second, joint (Joint+Coop) model variant also operates in the same retrieve-and-rerank setup, but it now trains a multi-modal cross-encoder and a multi-modal BE model jointly with tied weights, as illustrated in Figure 1d. The retrieve step, where efficiency is paramount, is again executed by the BE sub-model, and the precision-oriented rerank step is conducted via the CE sub-model.
We propose a general framework for cross- modal search and retrieval, where Joint+Coop and Sep+Coop models are independent of the chosen pretrained vision-language representation architectures. The experiments are thus based on a state- of-the-art vision-language architecture OSCAR (Li et al., 2020b) (experiments in English) and M3P (Ni et al., 2021) (multilingual), and we demonstrate consistent improvements over the original OSCAR model on the standard benchmarks MSCOCO and Flick30k and improvements over the original M3P in multiple languages on the Multi30k dataset. We empirically validate huge efficiency benefits of the proposed framework.
Contributions.
1) We construct and systematically evaluate twin-networks combined with multi-modal Transformers (BE); they outperform all previous bi-encoder approaches, but lag behind their CE counterparts. 2) We evaluate BE and CE approaches within a cooperative retrieve-and- rerank approach; their combination outperforms the individual models, while offering substantial efficiency boosts compared to CE methods. 3) We propose a novel joint CE-BE model (Joint+ Coop), which is trained to simultaneously cross- encode and embed multi-modal input; it achieves the highest scores overall while maintaining retrieval efficiency. 4) Finally, we propose a more realistic evaluation benchmark; we demonstrate harsh drops in overall cross-modal retrieval performance of all models in this more difficult scenario, calling for improved evaluation benchmarks and protocols in future work.
2 Related Work
Efficient approaches to cross-modal image-text retrieval relied on the induction of shared multi- modal visual-semantic embedding spaces (VSEs) (Frome et al., 2013; Faghri et al., 2018; Shi et al., 2019; Mahajan et al., 2019). In a multilingual setup, all languages share the same embedding space along with the visual data (Kim et al., 2020; Wehrmann et al., 2019; Burns et al., 2020). More recently, attention-based cross-encoder models, typically based on Transformer architectures (Vaswani et al., 2017) have considerably outperformed the VSE-based approaches. However, this comes at a severe cost of decreased retrieval efficiency and increased latency (Lee et al., 2018; Wang et al., 2019b). The current state-of-the-art multi-modal models jointly encode and cross- attend over text tokens and image features (Lu et al., 2019; Tan and Bansal, 2019; Chen et al., 2020; Li et al., 2020a; Gan et al., 2020; Li et al., 2020b; Bugliarello et al., 2021; Ni et al., 2021, inter alia). These CE methods leverage image captioning datasets such as MSCOCO (Lin et al., 2014) and Flick30k (Plummer et al., 2015) and train a classification head that learns to identify whether or not an (image, caption) input pair constitutes an aligned pair. Each image-text combination must be passed through the network, which scales quadratically with the number of examples.
To handle this quadratic increase, we use a cooperative retrieve-and-rerank approach. Although to the best of our knowledge this has not been proposed for cross-modal settings, it has a long history in NLP, where Yates et al. (2021) date it back to the 1960s (Simmons, 1965). Until recently, bag-of-words methods (BoW; e.g., BM25) were commonly used for the first retrieval step. For the second step, pretrained language models (LMs) were fine-tuned to either rerank candidates (Nogueira and Cho, 2019; Nogueira et al., 2019) or—for question-answering tasks—directly generated the answer span (Yang et al., 2019). More recent work on text-based retrieval and QA tasks has moved away from BoW methods towards learned (neural) models for the first retrieval step (Karpukhin et al., 2020; Qu et al., 2021; Xiong et al., 2021).
Our work is inspired by recent BE-based approaches in unimodal text-only setups. Here, LMs are fine-tuned via twin-network architectures on auxiliary tasks such as semantic textual similarity (Reimers and Gurevych, 2019; Humeau et al., 2020), paraphrasing (Wieting et al., 2019), response retrieval (Yang et al., 2018; Henderson et al., 2019; Henderson et al., 2020; Humeau et al., 2020), or translation ranking (Chidambaram et al., 2019; Feng et al., 2020). This effectively turns the LMs into universal sentence encoders which can then be used off-the-shelf for efficient text-based monolingual and cross-lingual retrieval (Litschko et al., 2021). In this work, we first extend this idea to multi-modal setups, and then show that our cooperative and joint approaches yields improved cross-modal retrieval models, maintaining retrieval efficiency.
Joint approaches like our Joint+Coop model, which aim to align the retriever and reranker can be found in different forms: Boualili et al. (2020) “mark” exact matches from the bag-of- words retrieval for the reranker; Yan et al. (2021) share the parameters between a passage expander (which adds more relevant terms for a bag-of- words retriever) and the reranker; Hofstätter et al. (2020) distill knowledge from the reranker into the retriever model with soft labels generated by the teacher. Specifically for question-answering— where a two stage retriever-reader setup similar to the retrieve-and-rerank approach is common— research aims to synchronize the models through knowledge distillation from the reader to the retriever (Yang and Seo, 2020; Izacard and Grave, 2021) or by directly training both models end-to- end (Lee et al., 2019; Sachan et al., 2021a, b). The challenge here is that the reader and the retriever are coupled—the reader requires candidates from the retriever that contain the solution. Our proposed reranker side-steps this problem as it uses no candidates from the retriever during training and only learns if a given input pair is (dis)similar. This way, we can train both components, the retriever and the reranker, side-by-side and align them by sharing their weights.
The work most closely related to ours includes contemporaneous models: ALBEF (Li et al., 2021), CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), and VisualSparta (Lu et al., 2021). ALBEF includes contrastive learning as one of its pretraining tasks but then uses a CE approach for downstream retrieval. CLIP and ALIGN use similar contrastive learning strategies as we do, but are cast as full-fledged pretraining architectures that learn from scratch and require magnitudes of more data than our approach. We show that it is possible to fine-tune pretrained models with fewer data and offer a general framework, applicable to a spectrum of pretrained models. Further, unlike prior work, we demonstrate the benefits of combining BE-based (contrastive) learning with cross-encoders for improved and efficient retrieval.4 Finally, VisualSparta (Lu et al., 2021) fine-tunes OSCAR, but at the level of token (text) and image-region embeddings. This enables the use of extremely fast lookup tables for efficient retrieval. However, this comes with a major disadvantage: the model disposes of wider context information.5 Our cooperative methods do leverage the finer-grained information at retrieval.
3 Methodology
The predominant Transformer-based multi-modal text-vision architecture is a single-stream encoder: It shares the majority of weights between the two modalities, including the multi-head cross- attention (Chen et al., 2020; Li et al., 2020a; Gan et al., 2020; Li et al., 2020b; Ni et al., 2021). The Transformer weights and text embeddings are typically initialized with weights of a pretrained LM (e.g., BERT [Devlin et al., 2019] for English, XLM-R [Conneau et al., 2020] for multilingual models), where the corresponding vocabulary and tokenizer are utilized. Images are preprocessed via object detection models such as Faster R-CNN (Ren et al., 2015) to extract feature representations for regions of interest (Anderson et al., 2018). The image features are passed through an affine-transformation layer which learns to align the vision input with the pretrained Transformer. The position of the region of interest (or in some models also the region’s width and height) is used to generate positional embeddings. By combining these two representations, each object region is passed into the Transformer separately. The cross-attention mechanism of the Transformer attends over all text and image inputs at every layer, thus learning a joint representation of both modalities.
Similar to masked language modeling (MLM) in the text domain, multi-modal Transformer models are trained with self-supervised objectives. For pretraining, image-caption datasets (i.e., MSCOCO [Lin et al., 2014], Flickr30k [Plummer et al., 2015], Conceptual Captions (CC) [Sharma et al., 2018], and SBU [Ordonez et al., 2011]) are utilized. The pretrained multi-modal model is subsequently fine-tuned with multi-modal downstream task data.
We focus on different fine-tuning strategies of the pretrained models for the downstream task of image-text retrieval. We illustrate these approaches in Figure 1 and describe them in what follows.
3.1 Cross-Encoders
For image and text retrieval tasks, the prevailing approach with pretrained multi-modal Transformer models is to cross-encode each image-text combination (see Figure 1a).
Training.
Retrieval.
3.2 Bi-Encoders
Sampling Negative Examples.
Negative examples may have a profound impact on training and performance, and it has been shown that selecting hard negative examples typically yields improved performance (Faghri et al., 2018). However, detecting such hard negatives is only possible with BE-based approaches, as cross-encoding all instances is computationally infeasible. We rely on the In-Batch Hard Negatives (BHN) method (Hermans et al., 2017), a computationally efficient sampling of hard negative examples. In a nutshell, BHN randomly samples a set of N negative examples from the training corpus and then ranks them according to their distance to all positive examples; for each positive example, the closest negative example is selected as the hardest negative example. By scaling up N, the probability of sampling truly hard negatives increases.
Retrieval.
The BE approach enables pre- encoding of all items for efficient retrieval look- up.8 For instance, a text query q is encoded with the bi-encoder and the most similar pre-encoded instance from an image collection I is retrieved: .
This approach can scale to even billions of images (Johnson et al., 2021), but it cannot be guaranteed that the important idiosyncratic information necessary to distinguish truly relevant from related examples is sufficiently encoded in the embedding. Further, the approach might not generalize well in low-resource scenarios as the model is not required to learn finer-grained parts of the input if they are never demanded by the training data.
3.3 Separate Training, Cooperative Retrieval
We combine the benefits of the two model types (CE and BE) within a cooperative retrieval approach (Sep+Coop), as illustrated in Figure 1c.
Training and Retrieval.
Two models, one CE (§3.1) and one BE (§3.2), are trained independently. Following that, the retrieval step is split into two stages. First, the efficient BE model is used to retrieve the top-k relevant items from the entire large collection, yielding a much smaller collection Ik: Ik =topk({cos(i,q) : ∀i ∈ I}), where topk(⋅) retrieves a set of the top-k most similar instances. Second, we rerank the instances from Ik with the more precise but computationally more expensive CE model: . This cooperative approach thus combines the benefits of both approaches and is able to efficiently retrieve instances.9 However, given that this approach requires two models to be stored in memory, it is less parameter-efficient than the previous methods.
3.4 Joint Training, Cooperative Retrieval
Training and Retrieval. Instead of relying on two fully separated models, we propose to train a single joint model, able to both cross-encode and embed (i.e., ‘bi-encode’), see Figure 1d. The joint model with shared parameters trains by alternating between the respective sub-models and their input types. When cross-encoding, a dedicated prediction head is trained using BCE loss (§3.1). In order to train the BE-based sub-model, we again rely on a twin architecture with a triplet loss from Eq. (2).
Retrieval proceeds with the same two-step retrieve-and-rerank procedure from §3.3. We first obtain the set Ik with the much cheaper BE-based submodel, and then rerank its items with the CE submodel. We combine the best traits of CE and BE, while maintaining parameter efficiency. Using both learning objectives at training, the joint model is forced to observe the input from different viewpoints, thus improving its generalization capability while offering parameter efficiency.
4 Experimental Setup
Our fine-tuning framework from §3 can be applied to any pretrained multi-modal Transformer. In all the experiments, we opt for state-of-the-art pretrained multi-modal models for monolingual (English) and multilingual contexts: OSCAR (Li et al., 2020b) and M3P (Ni et al., 2021), respectively.
OSCAR is a single-stream multi-modal Transformer (Bugliarello et al., 2021), with its weights initialized with those of the pretrained BERT Base model, and then subsequently fine-tuned on multi- modal data (see §3). Unlike prior work, OSCAR additionally uses object labels of detected regions: Those labels serve as anchors for visual grounding, with large improvements achieved over its prior work. M3P is a single-stream multilingual multi-modal Transformer. Its weights are initialized with those of pretrained XLM-R Base and then fine-tuned on multi-modal data (see §3) as well as multilingual text-only data.
Training and Test Data.
We primarily experiment with the English image-text retrieval benchmarks MSCOCO and Flick30k. They comprise 123k and 31.8k images, respectively, with 5 captions describing each image. MSCOCO provides two test benchmarks of sizes 1k and 5k, where the smaller set is a subset of the 5k test set. The standard Flickr30k test set consists of 1k images. In addition, we use the development set of Conceptual Captions (CC) (Sharma et al., 2018) for zero-shot evaluation, and also to construct a larger and more difficult test set (see later in §6). The original CC dev set contained 15.8k images, but currently, only 14k images are still available online.
For multilingual experiments, we use the standard Multi30k dataset (Elliott et al., 2016, 2017; Barrault et al., 2018), which extends Flickr30k with 5 German and one French and Czech caption per image. Its test s et also comprises 1k images.
The evaluation metric is the standard Recall- at-M (R@M): It reports the proportion of queries for which the relevant target item is present within the top-M retrieved items.
Training Setup and Hyperparameters.
Our setup largely follows Li et al. (2020b) and Ni et al. (2021) unless noted otherwise.10 We experiment with learning rates [5e − 5,2e − 5], and with the number of update steps between 25k and 125k. One batch contains 128 positive pairs plus 128 negative pairs with ℒCE. We use the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear learning rate decay without warmup, and a weight decay of 0.05. We take model checkpoints every 5k steps and select the checkpoint with the best development set performance.
4.1 Baselines and Model Variants
CE. Our main baselines are OSCAR and M3P models used in the standard CE setting, described in §3.1. We fully fine-tune the Transformer weights along with a randomly initialized classification head.11 At retrieval, we cross-encode each text-image combination and rank them according to the corresponding probability, see Eq. (1).
BE.
We rely on BHN negative sampling, finding that training for 30k steps, with a learning rate of 5e − 5, and with a margin α = 0.1 works best.12
Sep+Coop.
Joint+Coop.
We alternate between the two objective functions while training the joint model (see §3.4). We find that training for 60k update steps with a learning rate of 2e − 5 (OSCAR) or 5e − 5 (M3P) works best, the rest of the hyperparameters are the same as with separately trained models. For retrieval, we again set k = 20. To demonstrate the benefits of cooperative retrieval, we also evaluate two non-cooperative variants originating from the joint model: Joint+CE uses the CE sub-model for a single-step CE-style retrieval, while Joint+BE operates in the fully BE retrieval setup.
The underlying pretrained Transformer is denoted with a superscript: For example, Joint+CoopOSCAR denotes that: 1) pretrained OSCAR is 2) fine-tuned with the joint variant from §3.4, and 3) then used in the cooperative retrieval setup.
5 Results and Discussion
The main results on English-only monolingual datasets Flickr30k and MSCOCO are summarized in Table 1, and the scores on multilingual Multi30k are provided in Table 2.
Group . | Model . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | ||
MSCOCO (5k) . | Flickr30k (1k) . | ||||||||||||
G1. Pre-Transformer | VSE++ (Faghri et al., 2018) | 43.9 | 59.4 | 72.4 | 41.3 | 71.1 | 81.2 | 39.6 | 70.1 | 79.5 | 52.9 | 80.5 | 87.2 |
SCAN (Lee et al., 2018) | 38.6 | 69.3 | 80.4 | 50.4 | 82.2 | 90.0 | 48.6 | 77.7 | 85.2 | 67.9 | 90.3 | 95.8 | |
PFAN (Wang et al., 2019b) | – | – | – | – | – | – | 50.4 | 78.7 | 86.1 | 70.0 | 91.8 | 95.0 | |
SCG (Shi et al., 2019) | 39.2 | 68.0 | 81.3 | 56.6 | 84.5 | 92.0 | 49.3 | 76.4 | 85.6 | 71.8 | 90.8 | 94.8 | |
G2. Cross-Encoders (Inefficient for retrieval) | CEUNITER (Chen et al., 2020) | 48.4 | 76.7 | 85.0 | 63.3 | 87.0 | 93.1 | 72.5 | 92.4 | 96.1 | 85.9 | 97.1 | 98.8 |
CEUnicoder-VL (Li et al., 2020a) | 46.7 | 76.0 | 85.3 | 62.3 | 87.1 | 92.8 | 71.5 | 90.9 | 94.9 | 86.2 | 96.3 | 99.0 | |
CEVILLA (Gan et al., 2020) | – | – | – | – | – | – | 74.7 | 92.9 | 95.8 | 86.6 | 97.9 | 99.2 | |
CEOSCAR † (Li et al., 2020b) | 54.0 | 80.8 | 88.5 | 70.0 | 91.1 | 95.5 | – | – | – | – | – | – | |
CEOSCAR ‡ | 52.6 | 80.0 | 88.1 | 69.3 | 90.7 | 95.3 | 75.9 | 93.3 | 96.6 | 88.5 | 98.5 | 99.2 | |
G3. Bi-Encoders (Efficient for retrieval) | VisualSparta (Lu et al., 2021) | 44.4 | 72.8 | 82.4 | – | – | – | 57.4 | 82.0 | 88.1 | – | – | – |
BEOSCAR | 52.2 | 80.2 | 88.0 | 66.9 | 90.1 | 95.0 | 72.0 | 91.0 | 94.7 | 84.7 | 97.1 | 98.7 | |
Sep+CoopOSCAR | 52.8 | 80.5 | 88.5 | 70.2 | 91.6 | 95.0 | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
Joint+CoopOSCAR | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
Joint+CEOSCAR | 54.6 | 81.1 | 88.8 | 70.6 | 91.0 | 95.1 | 76.5 | 93.4 | 96.3 | 89.0 | 97.9 | 99.1 | |
Joint+BEOSCAR | 52.5 | 80.0 | 88.0 | 66.7 | 90.0 | 95.0 | 71.6 | 91.5 | 95.0 | 86.3 | 96.8 | 98.6 |
Group . | Model . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | ||
MSCOCO (5k) . | Flickr30k (1k) . | ||||||||||||
G1. Pre-Transformer | VSE++ (Faghri et al., 2018) | 43.9 | 59.4 | 72.4 | 41.3 | 71.1 | 81.2 | 39.6 | 70.1 | 79.5 | 52.9 | 80.5 | 87.2 |
SCAN (Lee et al., 2018) | 38.6 | 69.3 | 80.4 | 50.4 | 82.2 | 90.0 | 48.6 | 77.7 | 85.2 | 67.9 | 90.3 | 95.8 | |
PFAN (Wang et al., 2019b) | – | – | – | – | – | – | 50.4 | 78.7 | 86.1 | 70.0 | 91.8 | 95.0 | |
SCG (Shi et al., 2019) | 39.2 | 68.0 | 81.3 | 56.6 | 84.5 | 92.0 | 49.3 | 76.4 | 85.6 | 71.8 | 90.8 | 94.8 | |
G2. Cross-Encoders (Inefficient for retrieval) | CEUNITER (Chen et al., 2020) | 48.4 | 76.7 | 85.0 | 63.3 | 87.0 | 93.1 | 72.5 | 92.4 | 96.1 | 85.9 | 97.1 | 98.8 |
CEUnicoder-VL (Li et al., 2020a) | 46.7 | 76.0 | 85.3 | 62.3 | 87.1 | 92.8 | 71.5 | 90.9 | 94.9 | 86.2 | 96.3 | 99.0 | |
CEVILLA (Gan et al., 2020) | – | – | – | – | – | – | 74.7 | 92.9 | 95.8 | 86.6 | 97.9 | 99.2 | |
CEOSCAR † (Li et al., 2020b) | 54.0 | 80.8 | 88.5 | 70.0 | 91.1 | 95.5 | – | – | – | – | – | – | |
CEOSCAR ‡ | 52.6 | 80.0 | 88.1 | 69.3 | 90.7 | 95.3 | 75.9 | 93.3 | 96.6 | 88.5 | 98.5 | 99.2 | |
G3. Bi-Encoders (Efficient for retrieval) | VisualSparta (Lu et al., 2021) | 44.4 | 72.8 | 82.4 | – | – | – | 57.4 | 82.0 | 88.1 | – | – | – |
BEOSCAR | 52.2 | 80.2 | 88.0 | 66.9 | 90.1 | 95.0 | 72.0 | 91.0 | 94.7 | 84.7 | 97.1 | 98.7 | |
Sep+CoopOSCAR | 52.8 | 80.5 | 88.5 | 70.2 | 91.6 | 95.0 | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
Joint+CoopOSCAR | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
Joint+CEOSCAR | 54.6 | 81.1 | 88.8 | 70.6 | 91.0 | 95.1 | 76.5 | 93.4 | 96.3 | 89.0 | 97.9 | 99.1 | |
Joint+BEOSCAR | 52.5 | 80.0 | 88.0 | 66.7 | 90.0 | 95.0 | 71.6 | 91.5 | 95.0 | 86.3 | 96.8 | 98.6 |
Type . | Model . | en . | de . | fr . | cs . | mean . |
---|---|---|---|---|---|---|
G1. PT | MULE | 70.3 | 64.1 | 62.3 | 57.7 | 63.6 |
S-LIWE | 76.3 | 72.1 | 63.4 | 59.4 | 67.8 | |
SMALR | 74.5 | 69.8 | 65.9 | 64.8 | 68.8 | |
G2. CE | CEM3P † | 86.7 | 82.2 | 73.5 | 70.2 | 78.2 |
CEM3P ‡ | 83.7 | 79.4 | 76.5 | 74.6 | 78.6 | |
G3. BE | BEM3P | 82.8 | 78.0 | 75.1 | 73.6 | 77.4 |
Sep+CoopM3P | 84.8 | 80.5 | 77.5 | 75.6 | 79.6 | |
Joint+CoopM3P | 83.0 | 79.2 | 75.9 | 74.0 | 78.0 |
Type . | Model . | en . | de . | fr . | cs . | mean . |
---|---|---|---|---|---|---|
G1. PT | MULE | 70.3 | 64.1 | 62.3 | 57.7 | 63.6 |
S-LIWE | 76.3 | 72.1 | 63.4 | 59.4 | 67.8 | |
SMALR | 74.5 | 69.8 | 65.9 | 64.8 | 68.8 | |
G2. CE | CEM3P † | 86.7 | 82.2 | 73.5 | 70.2 | 78.2 |
CEM3P ‡ | 83.7 | 79.4 | 76.5 | 74.6 | 78.6 | |
G3. BE | BEM3P | 82.8 | 78.0 | 75.1 | 73.6 | 77.4 |
Sep+CoopM3P | 84.8 | 80.5 | 77.5 | 75.6 | 79.6 | |
Joint+CoopM3P | 83.0 | 79.2 | 75.9 | 74.0 | 78.0 |
As expected, all Transformer-based approaches (groups G2 and G3) substantially outperform the pre-Transformer (PT) models (G1). While this has already been established in prior work for CE methods, our findings confirm that the same holds also for the efficient BE approach. This validates the effectiveness of Transformer architectures pretrained on large corpora for the retrieval task. R@1 scores with BE lag slightly behind the CE scores, but the respective R@10 scores are mostly on-par. This suggests that the BE approach is “coarser-grained”, and mostly relies on “global” interactions between the modalities. We investigate this conjecture further in §6.
This is also illustrated by an example in Figure 2. When dealing with related target items, CE’s cross-attention mechanism is able to explicitly attend over each token and image region, capturing additional (non-global) information relevant to the query. Although the high-level “global” concept of a skiing person is present in (almost) every example, the additional important information related to what the person is wearing is not adequately represented in the embeddings. Therefore, the BE (sub)model does not rank this instance at the top position. The CE (sub)model then directly compares the instances, identifying that clothing is important and reranks the target examples accordingly.
Most importantly, the relative comparison of R@1 versus R@10 scores empirically hints at the necessity of the retrieve-and-rerank cooperative approach: The BE approach efficiently retrieves 20 relevant examples, but the increased expressiveness of CE is required to refine the initially retrieved list. Moreover, the results in the cooperative setup even without joint training (Sep+CoopOSCAR and Sep+CoopM3P) demonstrate that the two models support each other: Slight improvements are observed over the pure CE, while offering massive efficiency boosts over CE. Our speculation is that the BE model filters out false positives, which in turn makes the CE model more robust.
The results of the Joint+Coop variant indicate that it is indeed possible to maintain retrieval efficiency with improved parameter efficiency: This approach performs on-par or even slightly outperforms the standard state-of-the-art CE models. The results verify that the two objective functions do not interfere with each other and that a single model is able to both embed and cross- encode. We note that the Joint+Coop variant offers the best trade-off between parameter and retrieval efficiency, achieving the peak scores on the monolingual MSCOCO and Flickr30k benchmarks, and very competitive results on the multilingual Multi30k benchmark.
6 Further Analysis
We now discuss a series of additional experiments that further profile and analyze the proposed multi-modal retrieval approaches, focusing especially on the multiple efficiency aspects related to fine-tuning and retrieval stages.
Retrieval Efficiency.
We empirically validate the time efficiency of our cooperative approaches for retrieval in an image search scenario (Table 3) and for evaluation on huge datasets (Table 4). To allow for a fair comparison between the approaches, we implement the entire retrieval pipeline—from model to nearest-neighbor search— in PyTorch without additional optimization such as multi-processing or optimized nearest-neighbor search libraries like FAISS (Johnson et al., 2021).
Our measurements confirm the efficiency of BEs in comparison to CEs. The cooperative approaches, which only have to cross-encode a constant number of items invariant of the collection size, are close in retrieval latency to BE for image search and remain feasible even for large datasets.
Larger Benchmarks.
The results in Table 1 indicate that current top-performance models achieve very high scores in absolute terms on the standard retrieval benchmarks. However, this is partially due to too small image collections with only a few thousand instances; one undesired effect is that it becomes increasingly difficult to identify significant differences between model performances. Unfortunately, the inefficiency of CE models, as empirically validated in Tables 3–4, has prevented evaluation with larger collections. However, more efficient fully BE-based and Sep+Coop methods now enable evaluation on larger collections and in realistic scenarios.
Model . | 1k . | 5k . | 100k . |
---|---|---|---|
BE | 5s | 30s | 7min |
Sep/Joint+Coop | 5min | 25min | 8.5h |
CE | 2h | 50h | 2.3a* |
Model . | 1k . | 5k . | 100k . |
---|---|---|---|
BE | 5s | 30s | 7min |
Sep/Joint+Coop | 5min | 25min | 8.5h |
CE | 2h | 50h | 2.3a* |
We thus increase the benchmark size by merging test instances from different available evaluation sets. In particular, we construct a collection spanning 20k images: It blends the test sets of MSCOCO (5k instances), Flickr30k (1k), and the development set of CC (14k). Note that we simply augment the benchmarks but the query set with labels for each standardized evaluation task/set remains unchanged; in other words, the instances from other datasets are used as distractors that increase the search space and make the retrieval task more difficult. The results thus provide insights into the model performance in the target domain, as well as its robustness regarding out-of-distribution data. We now observe in Table 5 more salient performance differences, which were lacking with the smaller benchmarks. The pure BE-based approach now substantially underperforms Sep/Joint+Coop variants. The Joint+Coop does remain the best-scoring variant overall.
Model . | Image Retrieval . | Text Retrieval . | ||||
---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |
Flickr30k 1k | + CC 14k + MSCOCO 5k | |||||
BEOSCAR | 45.8 | 69.1 | 76.1 | 71.1 | 90.9 | 94.9 |
Sep+CoopOSCAR | 55.5 | 75.8 | 80.1 | 80.5 | 93.8 | 95.4 |
Joint+CoopOSCAR | 55.9 | 77.5 | 82.9 | 81.0 | 92.9 | 94.9 |
MSCOCO 5k | + CC 14k + Flickr 1k | |||||
BEOSCAR | 40.6 | 68.5 | 78.1 | 62.5 | 87.7 | 93.3 |
Sep+CoopOSCAR | 43.7 | 72.1 | 81.2 | 68.2 | 90.4 | 94.3 |
Joint+CoopOSCAR | 45.6 | 73.0 | 82.3 | 69.0 | 90.3 | 94.7 |
Model . | Image Retrieval . | Text Retrieval . | ||||
---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |
Flickr30k 1k | + CC 14k + MSCOCO 5k | |||||
BEOSCAR | 45.8 | 69.1 | 76.1 | 71.1 | 90.9 | 94.9 |
Sep+CoopOSCAR | 55.5 | 75.8 | 80.1 | 80.5 | 93.8 | 95.4 |
Joint+CoopOSCAR | 55.9 | 77.5 | 82.9 | 81.0 | 92.9 | 94.9 |
MSCOCO 5k | + CC 14k + Flickr 1k | |||||
BEOSCAR | 40.6 | 68.5 | 78.1 | 62.5 | 87.7 | 93.3 |
Sep+CoopOSCAR | 43.7 | 72.1 | 81.2 | 68.2 | 90.4 | 94.3 |
Joint+CoopOSCAR | 45.6 | 73.0 | 82.3 | 69.0 | 90.3 | 94.7 |
Zero-Shot Performance.
Relying on multi- modal and multilingual representations fine-tuned for cross-modal retrieval, the proposed methods should also generalize to new unseen captions and images beyond the dataset used for fine-tuning. Therefore, we directly transfer the model fine- tuned on one dataset to the test data of another dataset (e.g., fine-tune on MSCOCO data, test on Flickr30k). As baselines, we use the reported zero-shot results of UNITER (Chen et al., 2020) for Flickr30k14 and we also evaluate the CLIP model.15
The zero-shot results in Table 6, reveal that the CE variant slightly outperforms other approaches when transferring from Flickr30k to MSCOCO, while Joint+CoopOSCAR remains competitive. However, for the opposite direction, we achieve considerable performance gains with the Joint+ CoopOSCAR variant. On CC, all variants considerably underperform CLIP; we speculate that it might be due to a more diverse set of images included in CC, including illustrations, which neither exist in MSCOCO nor Flickr30k. This means that CLIP has a considerable advantage on CC due to its exposure to massive amounts of data during pretraining.
Loss . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |
MSCOCO 5k . | Flickr30k 1k . | CC 14k . | ||||||||||||||||
Joint+CoopIn-DomainOSCAR | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | – | – | – | – | – | – |
CEUNITER | – | – | – | – | – | – | 66.2 | 88.4 | 92.9 | 80.7 | 95.7 | 98.0 | – | – | – | – | – | – |
CEOSCAR | 47.8 | 75.7 | 84.6 | 61.8 | 86.2 | 92.0 | 67.2 | 88.5 | 92.7 | 81.0 | 95.5 | 97.8 | – | – | – | – | – | – |
CLIP | 30.4 | 56.1 | 66.9 | 50.1 | 74.8 | 83.6 | 61.1 | 85.9 | 91.8 | 81.9 | 95.0 | 97.5 | 30.8 | 52.7 | 61.3 | 32.1 | 53.9 | 63.0 |
BEOSCAR | 37.6 | 64.4 | 75.0 | 52.0 | 78.1 | 86.3 | 63.3 | 86.4 | 91.6 | 78.2 | 94.0 | 97.3 | 13.8 | 29.4 | 37.9 | 14.4 | 29.6 | 37.6 |
Sep+CoopOSCAR | 47.6 | 73.9 | 81.2 | 62.8 | 83.8 | 88.7 | 67.6 | 89.0 | 93.1 | 82.4 | 96.3 | 98.2 | 16.8 | 34.3 | 41.9 | 17.0 | 33.5 | 41.5 |
Joint+CoopOSCAR | 47.6 | 74.5 | 82.6 | 63.9 | 85.7 | 91.0 | 70.0 | 90.2 | 94.1 | 83.7 | 96.8 | 97.9 | 16.7 | 34.7 | 43.6 | 17.5 | 34.6 | 43.5 |
Loss . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |
MSCOCO 5k . | Flickr30k 1k . | CC 14k . | ||||||||||||||||
Joint+CoopIn-DomainOSCAR | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | – | – | – | – | – | – |
CEUNITER | – | – | – | – | – | – | 66.2 | 88.4 | 92.9 | 80.7 | 95.7 | 98.0 | – | – | – | – | – | – |
CEOSCAR | 47.8 | 75.7 | 84.6 | 61.8 | 86.2 | 92.0 | 67.2 | 88.5 | 92.7 | 81.0 | 95.5 | 97.8 | – | – | – | – | – | – |
CLIP | 30.4 | 56.1 | 66.9 | 50.1 | 74.8 | 83.6 | 61.1 | 85.9 | 91.8 | 81.9 | 95.0 | 97.5 | 30.8 | 52.7 | 61.3 | 32.1 | 53.9 | 63.0 |
BEOSCAR | 37.6 | 64.4 | 75.0 | 52.0 | 78.1 | 86.3 | 63.3 | 86.4 | 91.6 | 78.2 | 94.0 | 97.3 | 13.8 | 29.4 | 37.9 | 14.4 | 29.6 | 37.6 |
Sep+CoopOSCAR | 47.6 | 73.9 | 81.2 | 62.8 | 83.8 | 88.7 | 67.6 | 89.0 | 93.1 | 82.4 | 96.3 | 98.2 | 16.8 | 34.3 | 41.9 | 17.0 | 33.5 | 41.5 |
Joint+CoopOSCAR | 47.6 | 74.5 | 82.6 | 63.9 | 85.7 | 91.0 | 70.0 | 90.2 | 94.1 | 83.7 | 96.8 | 97.9 | 16.7 | 34.7 | 43.6 | 17.5 | 34.6 | 43.5 |
Multilingual zero-shot results, where we fine- tune on the English Multi30k captions and test on the captions in other languages, are shown in Table 7. Cooperative approaches again excel; the highest scores are achieved by Sep+CoopM3P.
Model . | en . | de . | fr . | cs . | Avg . |
---|---|---|---|---|---|
CEM3P (Ni et al., 2021) | 86.0 | 48.8 | 39.4 | 38.8 | 42.3 |
BEM3P | 81.3 | 52.4 | 49.7 | 39.6 | 47.2 |
CEM3P | 84.2 | 52.6 | 49.6 | 33.4 | 45.2 |
Sep+CoopM3P | 84.4 | 55.6 | 52.2 | 39.8 | 49.2 |
Joint+CoopM3P | 83.5 | 54.2 | 48.4 | 39.4 | 47.3 |
Model . | en . | de . | fr . | cs . | Avg . |
---|---|---|---|---|---|
CEM3P (Ni et al., 2021) | 86.0 | 48.8 | 39.4 | 38.8 | 42.3 |
BEM3P | 81.3 | 52.4 | 49.7 | 39.6 | 47.2 |
CEM3P | 84.2 | 52.6 | 49.6 | 33.4 | 45.2 |
Sep+CoopM3P | 84.4 | 55.6 | 52.2 | 39.8 | 49.2 |
Joint+CoopM3P | 83.5 | 54.2 | 48.4 | 39.4 | 47.3 |
Sample Efficiency.
We also analyze how the amount of image-text data for fine-tuning impacts the retrieval performance; we thus sample smaller datasets from the full MSCOCO training set, covering 1k, 10k, and 50k images with their captions (5 per image). The results in Figure 3 reveal that BE-based approaches in general are considerably less sample-efficient than cross-encoders. They particularly struggle in the lowest-data scenario with only 1k images available; this is also reflected in the lower performance of Joint+Coop in the 1k setup. A reason behind the more effective adaptation of CE to low-data regimes might be their richer “input consumption”: With 1k images and 5k captions, CE runs a whole grid of 1k×5k items through its network, which provides more learning signal with fewer data available. On the other hand, BE-based approaches are expected to learn effective encoders of both modalities separately based solely on 1k images and 5k captions, without any cross-modal interaction.
Parameter Efficiency.
We also provide a simple parameter efficiency analysis by initializing the models with pretrained OSCAR weights, but only passing the representations through every second layer, effectively halving the total amount of Transformer parameters. The results are shown in Figure 4. The performance with all approaches using the “halved” model is around ∼ 90% of the performance with the full Transformer. Overall, the Joint+Coop method again achieves the highest scores. This suggests that the proposed fine-tuning approaches are applicable also to smaller models, with similar relative trends in retrieval results.
Retrieving Top-k.
We analyze different values for k for top-k retrieval of the BE component in Table 8. Selecting small values for k significantly decreases the retrieval latency, as fewer instances need to be cross-encoded. However, selecting k values that are too small can come at a cost of precision, as the true positive instance might not be among the top-k retrieved instances of the BE model. In our experiments, k = 20 achieves the best trade-off between precision and retrieval latency.
Model . | k . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | ||
MSCOCO 1k . | MSCOCO 5k . | Flickr30k . | |||||||||||||||||
Sep+Coop | 10 | 75.4 | 94.8 | 97.2 | 88.4 | 98.8 | 99.7 | 53.2 | 80.3 | 86.6 | 71.1 | 90.9 | 94.3 | 75.9 | 92.2 | 93.4 | 89.2 | 97.8 | 98.4 |
20 | 75.3 | 95.2 | 98.1 | 87.9 | 98.9 | 99.8 | 52.8 | 80.5 | 88.5 | 70.2 | 91.6 | 95.0 | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
50 | 75.2 | 95.0 | 98.2 | 87.9 | 99.1 | 99.8 | 52.6 | 80.1 | 88.4 | 70.1 | 91.4 | 95.5 | 75.9 | 93.4 | 96.3 | 88.9 | 98.4 | 99.4 | |
Joint+Coop | 10 | 75.4 | 95.5 | 97.8 | 88.0 | 98.8 | 99.9 | 54.8 | 81.2 | 88.0 | 70.9 | 91.2 | 95.0 | 76.5 | 93.2 | 95.0 | 88.9 | 97.3 | 98.6 |
20 | 75.5 | 95.4 | 98.2 | 88.1 | 98.6 | 99.5 | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
50 | 75.4 | 95.4 | 98.3 | 88.2 | 98.4 | 99.4 | 54.6 | 81.2 | 88.8 | 70.7 | 91.1 | 95.3 | 76.5 | 93.5 | 96.5 | 89.1 | 98.0 | 98.9 |
Model . | k . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | Image Retrieval . | Text Retrieval . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | ||
MSCOCO 1k . | MSCOCO 5k . | Flickr30k . | |||||||||||||||||
Sep+Coop | 10 | 75.4 | 94.8 | 97.2 | 88.4 | 98.8 | 99.7 | 53.2 | 80.3 | 86.6 | 71.1 | 90.9 | 94.3 | 75.9 | 92.2 | 93.4 | 89.2 | 97.8 | 98.4 |
20 | 75.3 | 95.2 | 98.1 | 87.9 | 98.9 | 99.8 | 52.8 | 80.5 | 88.5 | 70.2 | 91.6 | 95.0 | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
50 | 75.2 | 95.0 | 98.2 | 87.9 | 99.1 | 99.8 | 52.6 | 80.1 | 88.4 | 70.1 | 91.4 | 95.5 | 75.9 | 93.4 | 96.3 | 88.9 | 98.4 | 99.4 | |
Joint+Coop | 10 | 75.4 | 95.5 | 97.8 | 88.0 | 98.8 | 99.9 | 54.8 | 81.2 | 88.0 | 70.9 | 91.2 | 95.0 | 76.5 | 93.2 | 95.0 | 88.9 | 97.3 | 98.6 |
20 | 75.5 | 95.4 | 98.2 | 88.1 | 98.6 | 99.5 | 54.7 | 81.3 | 88.9 | 70.8 | 91.0 | 95.2 | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
50 | 75.4 | 95.4 | 98.3 | 88.2 | 98.4 | 99.4 | 54.6 | 81.2 | 88.8 | 70.7 | 91.1 | 95.3 | 76.5 | 93.5 | 96.5 | 89.1 | 98.0 | 98.9 |
Combining Ranking.
We evaluate the ranking score combination of the two components Joint+BE and Joint+CE in Table 9. We combine the ranking of the bi-encoder submodel and the cross-encoder submodel by summing over the scores using two different variations:
Model . | Sum . | λ . | Image Retrieval . | Text Retrieval . | ||||
---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |||
Sep+Coop | – | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
add | 0.1 | 76.0 | 92.7 | 94.8 | 86.4 | 98.7 | 99.2 | |
0.5 | 75.7 | 92.6 | 94.7 | 85.9 | 98.5 | 99.2 | ||
0.9 | 74.5 | 92.5 | 94.7 | 85.1 | 98.3 | 99.2 | ||
norm_add | 0.1 | 70.8 | 90.2 | 93.8 | 86.2 | 98.5 | 99.2 | |
0.5 | 70.7 | 90.3 | 93.7 | 85.4 | 98.4 | 99.2 | ||
0.9 | 70.3 | 90.1 | 93.7 | 83.8 | 97.6 | 98.8 | ||
Joint+Coop | – | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
add | 0.1 | 76.7 | 93.3 | 95.8 | 88.5 | 98.0 | 99.1 | |
0.5 | 75.6 | 93.1 | 95.5 | 87.2 | 97.8 | 99.1 | ||
0.9 | 74.6 | 92.8 | 95.5 | 87.3 | 97.8 | 99.1 | ||
norm_add | 0.1 | 72.8 | 92.0 | 95.2 | 87.6 | 97.9 | 99.2 | |
0.5 | 72.5 | 92.0 | 95.2 | 87.3 | 97.9 | 99.0 | ||
0.9 | 72.3 | 91.8 | 95.2 | 86.4 | 97.0 | 99.0 |
Model . | Sum . | λ . | Image Retrieval . | Text Retrieval . | ||||
---|---|---|---|---|---|---|---|---|
R@1 . | R@5 . | R@10 . | R@1 . | R@5 . | R@10 . | |||
Sep+Coop | – | 76.0 | 93.0 | 95.0 | 88.7 | 98.3 | 99.2 | |
add | 0.1 | 76.0 | 92.7 | 94.8 | 86.4 | 98.7 | 99.2 | |
0.5 | 75.7 | 92.6 | 94.7 | 85.9 | 98.5 | 99.2 | ||
0.9 | 74.5 | 92.5 | 94.7 | 85.1 | 98.3 | 99.2 | ||
norm_add | 0.1 | 70.8 | 90.2 | 93.8 | 86.2 | 98.5 | 99.2 | |
0.5 | 70.7 | 90.3 | 93.7 | 85.4 | 98.4 | 99.2 | ||
0.9 | 70.3 | 90.1 | 93.7 | 83.8 | 97.6 | 98.8 | ||
Joint+Coop | – | 76.4 | 93.6 | 96.2 | 89.4 | 97.7 | 99.0 | |
add | 0.1 | 76.7 | 93.3 | 95.8 | 88.5 | 98.0 | 99.1 | |
0.5 | 75.6 | 93.1 | 95.5 | 87.2 | 97.8 | 99.1 | ||
0.9 | 74.6 | 92.8 | 95.5 | 87.3 | 97.8 | 99.1 | ||
norm_add | 0.1 | 72.8 | 92.0 | 95.2 | 87.6 | 97.9 | 99.2 | |
0.5 | 72.5 | 92.0 | 95.2 | 87.3 | 97.9 | 99.0 | ||
0.9 | 72.3 | 91.8 | 95.2 | 86.4 | 97.0 | 99.0 |
(2) We separately 0-1-normalize the scores for the top-k candidates of the bi- and cross-encoder before combining them for norm_addλ(e,c), which is defined analog to addλ(e,c).
However, we find that relying solely on the cross-encoder achieves the best results. This suggests that the scores by the bi-encoder are useful in the “global” scope with all data to retrieve strong candidates but in the “local” scope of the top-k candidates, the cross-encoder is superior.
7 Conclusion
We proposed a novel framework that converts pretrained multi-modal Transformers into effective and efficient cross-modal retrieval models. The framework is applicable to any pretrained model and combines the efficiency of bi-encoder (BE) approaches with the accuracy of computationally more demanding cross-encoding (CE) approaches. Their synergistic effect at retrieval is achieved through a cooperative retrieve-and- rerank regime, where the initial retrieval from a large collection is performed via efficient BE approaches, followed by another accuracy-driven step via a CE model. Moreover, we introduced a parameter-efficient joint fine-tuning regime that blends BE and CE into a single model with shared weights. Our results with state-of-the-art pretrained models across a range of standard monolingual and multilingual cross-modal retrieval tasks and setups validated the strong performance of such cooperative and joint approaches. At the same time, we demonstrated their retrieval efficiency, which makes them viable in realistic retrieval scenarios with large collections. In future work, we will put more focus on zero-shot and few-shot retrieval scenarios, and expand the approach to more languages, modalities, and tasks.
Acknowledgments
Ubiquitous Knowledge Processing Lab acknowledge the financial support of the German Federal Ministry of Education and Research (BMBF) under the promotional reference 13N15897 (MISRIK), the LOEWE initiative (Hesse, Germany) within the emergenCITY center, and the German Research Foundation (DFG) as part of the UKP- SQuARE project (grant GU 798/29-1). The work of Ivan Vulić has been supported by the ERC Consolidator grant LEXICAL (no 648909), ERC PoC grant MultiConvAI (no. 957356), and a research donation from Huawei.
We thank Kevin Stowe and Christopher Klamm for insightful feedback and suggestions on a draft of this paper and we thank the TACL reviewers and Action Editor for their valuable feedback and comments during the editing process.
Notes
We release the code and model weights at https://github.com/UKPLab/MMT-Retrieval.
Also frequently referred to as dual-encoder.
Consequently, it would be impossible to evaluate these CE approaches on newer larger benchmarks: e.g., the (extrapolated) evaluation time on a benchmark spanning 100,000 images exceeds 2 years with a single GPU.
For example, considering a query “two dogs and one cat”, the model is unable to match the numbers to the animals yielding likely worse retrieval results.
Following Reimers and Gurevych (2019), we opt for mean pooling as the final “aggregated” embedding; it outperformed another standard variant, which uses the [CLS] token, in our preliminary experiments.
Note that pre-computing the embedding does come with increased storage and memory demands; e.g., with a base Transformer architecture this requires an additional ≈ 3kB of memory for each embedding. A corpus of 1M images would amount to ≈ 3GB of required storage.
Retrieval time for 1M images: 94ms (GPU), 13s (CPU).
Unlike Li et al. (2020b), we do not use object tags as additional input, as preliminary experiments suggested no improvement with object tags.
Training for 100k steps and a learning rate of 2e − 5 (OSCAR) or 5e − 5 (M3P) performed best.
We also experimented with Approximate-nearest- neighbor Negative Contrastive Estimation (ANCE) (Xiong et al., 2021); however, it did not yield performance benefits.
We provide an ablation study of different k values in §6. We have also experimented with training a CE model using hard negative samples from a pretrained BE model. However, the CE model is able to easily overfit on those negative examples, resulting in inferior performance.
They do not report results for MSCOCO.
We use the ViT-B/32 model variant. Retrieval results from Radford et al. (2021) Table 13 use the (larger) ViT-L/14 variant that has not been released to the public.
References
Author notes
Both authors contributed equally to this work.
Action Editor: Jimmy Lin