Skip to Main Content
Table 4: 

MMT trained with coattention (Co), merged attention (Merge), language-query attention (L-12 and L-24), image-query attention (I-12 and I-24) (the number indicates the number of attention heads) and modality-specific attention.

R@1CoMergeAsym. Attn.Mod.
L-12I-12L-24I-24Spec.
F. ZS 41.9 40.0 24.4 31.3 33.6 31.6 16.9 
F. FT 59.1 57.0 45.1 48.4 52.5 46.3 15.4 
M. ZS 21.3 19.6 13.8 16.1 17.0 16.0 8.0 
R@1CoMergeAsym. Attn.Mod.
L-12I-12L-24I-24Spec.
F. ZS 41.9 40.0 24.4 31.3 33.6 31.6 16.9 
F. FT 59.1 57.0 45.1 48.4 52.5 46.3 15.4 
M. ZS 21.3 19.6 13.8 16.1 17.0 16.0 8.0 
Close Modal

or Create an Account

Close Modal
Close Modal