Table 3: 

Comparison of our proposed baseline to our multimodal transformer model (MMT).

Flickr30kMSCOCO
ZSFTZS
R1R10R1R10R1R10
Baseline 25.4 64.9 40.9 81.8 13.0 44.5 
− contrastive 21.7 61.0 39.0 80.6 10.2 40.9 
+ BERT PT 24.8 65.1 39.9 79.9 12.7 43.1 
 
MMT 41.9 79.0 59.1 91.5 21.3 57.9 
ViLBERT 31.9 72.8 58.2 91.5 − − 
Flickr30kMSCOCO
ZSFTZS
R1R10R1R10R1R10
Baseline 25.4 64.9 40.9 81.8 13.0 44.5 
− contrastive 21.7 61.0 39.0 80.6 10.2 40.9 
+ BERT PT 24.8 65.1 39.9 79.9 12.7 43.1 
 
MMT 41.9 79.0 59.1 91.5 21.3 57.9 
ViLBERT 31.9 72.8 58.2 91.5 − − 
Close Modal

or Create an Account

Close Modal
Close Modal