Table 3: 

Minimal pair accuracies for nouns and verbs for different model ablations. W2V Finet: wav2vec2 module finetuned; A Pretr: Audio encoder pretrained; V Pretr: Video encoder pretrained; Tmp Enc: Video encoder with temporal information (not static); Tmp Frames: Video frames in correct temporal order (not scrambled). Mean and standard deviation calculated over bootstrapped scores (100 re-samples), pooled over 4 training runs.

IDW2V Finet.JitterV Pretr.A Pretr.Tmp Enc.Tmp FramesNounsVerbs
      0.80 ± 0.02 0.79 ± 0.02 
      0.72 ± 0.01 0.71 ± 0.01 
      0.72 ± 0.02 0.78 ± 0.01 
      0.56 ± 0.07 0.56 ± 0.07 
      0.69 ± 0.02 0.69 ± 0.01 
      0.75 ± 0.01 0.75 ± 0.01 
      0.78 ± 0.01 0.76 ± 0.01 
      0.79 ± 0.02 0.78 ± 0.02 
IDW2V Finet.JitterV Pretr.A Pretr.Tmp Enc.Tmp FramesNounsVerbs
      0.80 ± 0.02 0.79 ± 0.02 
      0.72 ± 0.01 0.71 ± 0.01 
      0.72 ± 0.02 0.78 ± 0.01 
      0.56 ± 0.07 0.56 ± 0.07 
      0.69 ± 0.02 0.69 ± 0.01 
      0.75 ± 0.01 0.75 ± 0.01 
      0.78 ± 0.01 0.76 ± 0.01 
      0.79 ± 0.02 0.78 ± 0.02 
Close Modal

or Create an Account

Close Modal
Close Modal