Comparison of our proposed baseline to our multimodal transformer model (MMT).
. | Flickr30k . | MSCOCO . | ||||
---|---|---|---|---|---|---|
. | ZS . | FT . | ZS . | |||
. | R1 . | R10 . | R1 . | R10 . | R1 . | R10 . |
Baseline | 25.4 | 64.9 | 40.9 | 81.8 | 13.0 | 44.5 |
− contrastive | 21.7 | 61.0 | 39.0 | 80.6 | 10.2 | 40.9 |
+ BERT PT | 24.8 | 65.1 | 39.9 | 79.9 | 12.7 | 43.1 |
MMT | 41.9 | 79.0 | 59.1 | 91.5 | 21.3 | 57.9 |
ViLBERT | 31.9 | 72.8 | 58.2 | 91.5 | − | − |
. | Flickr30k . | MSCOCO . | ||||
---|---|---|---|---|---|---|
. | ZS . | FT . | ZS . | |||
. | R1 . | R10 . | R1 . | R10 . | R1 . | R10 . |
Baseline | 25.4 | 64.9 | 40.9 | 81.8 | 13.0 | 44.5 |
− contrastive | 21.7 | 61.0 | 39.0 | 80.6 | 10.2 | 40.9 |
+ BERT PT | 24.8 | 65.1 | 39.9 | 79.9 | 12.7 | 43.1 |
MMT | 41.9 | 79.0 | 59.1 | 91.5 | 21.3 | 57.9 |
ViLBERT | 31.9 | 72.8 | 58.2 | 91.5 | − | − |