Skip to Main Content
Table 5: 

Correlation on the concrete subsets (concr. ≥ 4) of the five evaluation benchmarks. Results in bold are the highest in the column. Results underlined are the highest among LV multimodal models.

model input concr. Spearman ρ correlation (layer) 
   RG65 WS353 SL999 MEN SVERB 
BERT L ≥ 4 0.8321 (2) 0.6138 (1) 0.4864 (0) 0.7368 (2) 0.1354 (3) 
 
LXMERT LV ≥ 4 0.8648 (27) 0.6606 (27) 0.5749 (21) 0.7862 (33) 0.1098 (21) 
UNITER LV ≥ 4 0.8148 (18) 0.5943 (2) 0.4975 (2) 0.7755 (20) 0.1215 (10) 
ViLBERT LV ≥ 4 0.8374 (20) 0.5558 (14) 0.5534 (16) 0.7910 (26) 0.1529 (14) 
VisualBERT LV ≥ 4 0.8269 (2) 0.6043 (2) 0.4971 (4) 0.7727 (20) 0.1310 (10) 
 
Vokenization LV ≥ 4 0.8708 (9) 0.6133 (3) 0.5051 (9) 0.8150 (10) 0.1390 (9) 
 
# pairs (%)   44 (68%) 121 (40%) 396 (41%) 1917 (65%) 210 (7%) 
model input concr. Spearman ρ correlation (layer) 
   RG65 WS353 SL999 MEN SVERB 
BERT L ≥ 4 0.8321 (2) 0.6138 (1) 0.4864 (0) 0.7368 (2) 0.1354 (3) 
 
LXMERT LV ≥ 4 0.8648 (27) 0.6606 (27) 0.5749 (21) 0.7862 (33) 0.1098 (21) 
UNITER LV ≥ 4 0.8148 (18) 0.5943 (2) 0.4975 (2) 0.7755 (20) 0.1215 (10) 
ViLBERT LV ≥ 4 0.8374 (20) 0.5558 (14) 0.5534 (16) 0.7910 (26) 0.1529 (14) 
VisualBERT LV ≥ 4 0.8269 (2) 0.6043 (2) 0.4971 (4) 0.7727 (20) 0.1310 (10) 
 
Vokenization LV ≥ 4 0.8708 (9) 0.6133 (3) 0.5051 (9) 0.8150 (10) 0.1390 (9) 
 
# pairs (%)   44 (68%) 121 (40%) 396 (41%) 1917 (65%) 210 (7%) 
Close Modal

or Create an Account

Close Modal
Close Modal