Figure 3:
Intermediate ConvRNNs best explain the object solution times (OST) of IT across images. (a) Comparing to primate OSTs. Mapping model layers to time points: In order to compare to primate IT object solution times (namely, the first time at which the neural decode accuracy for each image reached the level of the (pooled) primate behavioral accuracy), we first need to define object solution times for models. This procedure involves identification of the IT-preferred layer(s) via a standard linear mapping to temporally averaged IT responses. Choosing a temporal mapping gradation: These IT-preferred model layer(s) are then mapped to 10 ms time bins from 70 to 260 ms in either a uniform or graded fashion, if the model is feedforward. For ConvRNNs, this temporal mapping is always one-to-one with these 10 ms time bins. (b) Defining model OSTs. Once the temporal mapping has been defined, we train a linear SVM at each 10 ms model time bin and compute the classifier's d' (displayed in each of the black dots for a given example image). The first time bin at which the model d' matches the primate's accuracy is defined as the Model OST for that image (obtained via linear interpolation), which is the same procedure previously used (Kar et al., 2019) to determine the ground truth IT OST (Primate OST vertical dotted line). (c) Proper choices of recurrence best match IT OSTs. Mean and s.e.m. are computed across train/test splits (N=10) when that image (of 1320 images) was a test set image, with the Spearman correlation computed with the IT object solution times (analogously computed from the IT population responses) across the image set solved by both the given model and IT, constituting the Fraction of IT Solved Images on the x-axis. We start with either a shallow base feedforward model consisting of 5 convolutional layers and 1 layer of readout (BaseNet in blue) as well as an intermediate-depth variant with 10 feedforward layers and 1 layer of readout (BaseNet in purple), detailed in section A.2.1. From these base feedforward models, we embed recurrent circuits, resulting in either Shallow ConvRNNs or Intermediate ConvRNNs, respectively.

Intermediate ConvRNNs best explain the object solution times (OST) of IT across images. (a) Comparing to primate OSTs. Mapping model layers to time points: In order to compare to primate IT object solution times (namely, the first time at which the neural decode accuracy for each image reached the level of the (pooled) primate behavioral accuracy), we first need to define object solution times for models. This procedure involves identification of the IT-preferred layer(s) via a standard linear mapping to temporally averaged IT responses. Choosing a temporal mapping gradation: These IT-preferred model layer(s) are then mapped to 10 ms time bins from 70 to 260 ms in either a uniform or graded fashion, if the model is feedforward. For ConvRNNs, this temporal mapping is always one-to-one with these 10 ms time bins. (b) Defining model OSTs. Once the temporal mapping has been defined, we train a linear SVM at each 10 ms model time bin and compute the classifier's d' (displayed in each of the black dots for a given example image). The first time bin at which the model d' matches the primate's accuracy is defined as the Model OST for that image (obtained via linear interpolation), which is the same procedure previously used (Kar et al., 2019) to determine the ground truth IT OST (Primate OST vertical dotted line). (c) Proper choices of recurrence best match IT OSTs. Mean and s.e.m. are computed across train/test splits (N=10) when that image (of 1320 images) was a test set image, with the Spearman correlation computed with the IT object solution times (analogously computed from the IT population responses) across the image set solved by both the given model and IT, constituting the Fraction of IT Solved Images on the x-axis. We start with either a shallow base feedforward model consisting of 5 convolutional layers and 1 layer of readout (BaseNet in blue) as well as an intermediate-depth variant with 10 feedforward layers and 1 layer of readout (BaseNet in purple), detailed in section A.2.1. From these base feedforward models, we embed recurrent circuits, resulting in either Shallow ConvRNNs or Intermediate ConvRNNs, respectively.

Close Modal

or Create an Account

Close Modal
Close Modal