Minimal pair accuracies for nouns and verbs for different model ablations. W2V Finet: wav2vec2 module finetuned; A Pretr: Audio encoder pretrained; V Pretr: Video encoder pretrained; Tmp Enc: Video encoder with temporal information (not static); Tmp Frames: Video frames in correct temporal order (not scrambled). Mean and standard deviation calculated over bootstrapped scores (100 re-samples), pooled over 4 training runs.
ID . | W2V Finet. . | Jitter . | V Pretr. . | A Pretr. . | Tmp Enc. . | Tmp Frames . | Nouns . | Verbs . |
---|---|---|---|---|---|---|---|---|
0 | 0.80 ± 0.02 | 0.79 ± 0.02 | ||||||
1 | 0.72 ± 0.01 | 0.71 ± 0.01 | ||||||
2 | 0.72 ± 0.02 | 0.78 ± 0.01 | ||||||
3 | 0.56 ± 0.07 | 0.56 ± 0.07 | ||||||
4 | 0.69 ± 0.02 | 0.69 ± 0.01 | ||||||
5 | 0.75 ± 0.01 | 0.75 ± 0.01 | ||||||
6 | 0.78 ± 0.01 | 0.76 ± 0.01 | ||||||
7 | 0.79 ± 0.02 | 0.78 ± 0.02 |
ID . | W2V Finet. . | Jitter . | V Pretr. . | A Pretr. . | Tmp Enc. . | Tmp Frames . | Nouns . | Verbs . |
---|---|---|---|---|---|---|---|---|
0 | 0.80 ± 0.02 | 0.79 ± 0.02 | ||||||
1 | 0.72 ± 0.01 | 0.71 ± 0.01 | ||||||
2 | 0.72 ± 0.02 | 0.78 ± 0.01 | ||||||
3 | 0.56 ± 0.07 | 0.56 ± 0.07 | ||||||
4 | 0.69 ± 0.02 | 0.69 ± 0.01 | ||||||
5 | 0.75 ± 0.01 | 0.75 ± 0.01 | ||||||
6 | 0.78 ± 0.01 | 0.76 ± 0.01 | ||||||
7 | 0.79 ± 0.02 | 0.78 ± 0.02 |