CiC-Language Generalization. NES on real-world language from the Chairs-in-Context (CiC) dataset. *SG architectures from Achlioptas et al. (2019) are the previously reported state-of-the-art method. NES+ grounds sub-events on the feature grid input. -SN indicates ShapeNet pre-trained features.
Method . | Input . | Listener Acc. . |
---|---|---|
Majority | N/A | 0.333 |
*SG-NoAttn | VGG16-SN | 0.812 ± 0.008 |
*SG-Attn | VGG16-SN | 0.817 ± 0.008 |
LSTM-Attn | VGG16-SN | 0.731 ± 0.012 |
PoE | VGG16-SN | 0.752 ± 0.009 |
NMN | VGG16-SN | 0.763 ± 0.023 |
MAC | VGG16-SN | 0.818 ± 0.013 |
NES | VGG16 | 0.842 ± 0.005 |
NES | VGG16-SN | 0.856 ± 0.005 |
NES | Res101 | 0.853 ± 0.011 |
NES+ | Res101 | 0.870 ± 0.009 |
Method . | Input . | Listener Acc. . |
---|---|---|
Majority | N/A | 0.333 |
*SG-NoAttn | VGG16-SN | 0.812 ± 0.008 |
*SG-Attn | VGG16-SN | 0.817 ± 0.008 |
LSTM-Attn | VGG16-SN | 0.731 ± 0.012 |
PoE | VGG16-SN | 0.752 ± 0.009 |
NMN | VGG16-SN | 0.763 ± 0.023 |
MAC | VGG16-SN | 0.818 ± 0.013 |
NES | VGG16 | 0.842 ± 0.005 |
NES | VGG16-SN | 0.856 ± 0.005 |
NES | Res101 | 0.853 ± 0.011 |
NES+ | Res101 | 0.870 ± 0.009 |