Correlation on the concrete subsets (concr. ≥ 4) of the five evaluation benchmarks. Results in bold are the highest in the column. Results underlined are the highest among LV multimodal models.
model | input | concr. | Spearman ρ correlation (layer) | ||||
RG65 | WS353 | SL999 | MEN | SVERB | |||
BERT | L | ≥ 4 | 0.8321 (2) | 0.6138 (1) | 0.4864 (0) | 0.7368 (2) | 0.1354 (3) |
LXMERT | LV | ≥ 4 | 0.8648 (27) | 0.6606 (27) | 0.5749 (21) | 0.7862 (33) | 0.1098 (21) |
UNITER | LV | ≥ 4 | 0.8148 (18) | 0.5943 (2) | 0.4975 (2) | 0.7755 (20) | 0.1215 (10) |
ViLBERT | LV | ≥ 4 | 0.8374 (20) | 0.5558 (14) | 0.5534 (16) | 0.7910 (26) | 0.1529 (14) |
VisualBERT | LV | ≥ 4 | 0.8269 (2) | 0.6043 (2) | 0.4971 (4) | 0.7727 (20) | 0.1310 (10) |
Vokenization | LV | ≥ 4 | 0.8708 (9) | 0.6133 (3) | 0.5051 (9) | 0.8150 (10) | 0.1390 (9) |
# pairs (%) | 44 (68%) | 121 (40%) | 396 (41%) | 1917 (65%) | 210 (7%) |
model | input | concr. | Spearman ρ correlation (layer) | ||||
RG65 | WS353 | SL999 | MEN | SVERB | |||
BERT | L | ≥ 4 | 0.8321 (2) | 0.6138 (1) | 0.4864 (0) | 0.7368 (2) | 0.1354 (3) |
LXMERT | LV | ≥ 4 | 0.8648 (27) | 0.6606 (27) | 0.5749 (21) | 0.7862 (33) | 0.1098 (21) |
UNITER | LV | ≥ 4 | 0.8148 (18) | 0.5943 (2) | 0.4975 (2) | 0.7755 (20) | 0.1215 (10) |
ViLBERT | LV | ≥ 4 | 0.8374 (20) | 0.5558 (14) | 0.5534 (16) | 0.7910 (26) | 0.1529 (14) |
VisualBERT | LV | ≥ 4 | 0.8269 (2) | 0.6043 (2) | 0.4971 (4) | 0.7727 (20) | 0.1310 (10) |
Vokenization | LV | ≥ 4 | 0.8708 (9) | 0.6133 (3) | 0.5051 (9) | 0.8150 (10) | 0.1390 (9) |
# pairs (%) | 44 (68%) | 121 (40%) | 396 (41%) | 1917 (65%) | 210 (7%) |