Abstract
Color and form information can be decoded in every region of the human ventral visual hierarchy, and at every layer of many convolutional neural networks (CNNs) trained to recognize objects, but how does the coding strength of these features vary over processing? Here, we characterize for these features both their absolute coding strength—how strongly each feature is represented independent of the other feature—and their relative coding strength—how strongly each feature is encoded relative to the other, which could constrain how well a feature can be read out by downstream regions across variation in the other feature. To quantify relative coding strength, we define a measure called the form dominance index that compares the relative influence of color and form on the representational geometry at each processing stage. We analyze brain and CNN responses to stimuli varying based on color and either a simple form feature, orientation, or a more complex form feature, curvature. We find that while the brain and CNNs largely differ in how the absolute coding strength of color and form vary over processing, comparing them in terms of their relative emphasis of these features reveals a striking similarity: For both the brain and for CNNs trained for object recognition (but not for untrained CNNs), orientation information is increasingly de-emphasized, and curvature information is increasingly emphasized, relative to color information over processing, with corresponding processing stages showing largely similar values of the form dominance index.
INTRODUCTION
Two of the most salient visual properties of an object are its color and its form. Primate neurophysiology and human fMRI decoding studies have now demonstrated that information about both color and form is available in every stage of the primate ventral visual pathway, spanning early visual regions V1, V2, and V3 (Shapley & Hawken, 2011; Conway et al., 2010; Seymour, Clifford, Logothetis, & Bartels, 2010; Conway, 2001; Ts'o, Roe, & Gilbert, 2001; Johnson, Hawken, & Shapley, 2001; Livingstone & Hubel, 1988; Gegenfurtner, Kiper, & Fenstemaker, 1996), the mid-level visual region V4 (Bushnell & Pasupathy, 2012; Seymour et al., 2010; Brouwer & Heeger, 2009; Brewer, Liu, Wade, & Wandell, 2005; Gallant, Connor, Rakshit, Lewis, & Essen, 1996), and color- and form-selective areas in macaque and human inferotemporal cortex (Taylor & Xu, 2022; Bao, She, McGill, & Tsao, 2020; Bannert & Bartels, 2013, 2018; Lehky & Tanaka, 2016; DiCarlo, Zoccolan, & Rust, 2012; McMahon & Olson, 2009; Komatsu & Ideura, 1993). The existence of color information in human inferotemporal encompassing lateral and ventral occipitotemporal cortex (VOT) is especially striking, because these brain regions are defined based on their univariate sensitivity to form and not color information (Orban, Van Essen, & Vanduffel, 2004; Kourtzi & Kanwisher, 2001; Grill-Spector, Kushnir, Edelman, Itzchak, & Malach, 1998; Malach et al., 1995), suggesting that they multiplex information about both features rather than serving as purely form-specific modules. In a recent comprehensive fMRI decoding study using carefully controlled stimuli, we confirmed that both form and color information are decodable in V1, V2, V3, V4, and lateral and ventral occipitotemporal regions defined based on univariate sensitivity to either form or color information (Taylor & Xu, 2022).
Despite these advances, how the absolute and relative coding strength of color and form may vary over the course of ventral stream processing has not been systematically documented. In our previous study (Taylor & Xu, 2022), we documented the absolute strength of color and form coding with multivoxel pattern decoding accuracy using a linear classifier. However, such an approach does not directly measure how strongly a feature is coded in a region, because a similar decoding accuracy may be obtained when the patterns are on the correct side of the decoding decision boundary regardless of the actual distance or the amount of separation between the patterns. In addition to measuring the absolute coding strength of form and color, how the relative coding strength of these two features may evolve over the course of ventral processing has never been documented. Understanding which feature may more strongly dominate the representational geometry at a given stage of processing could importantly constrain the information that downstream stages could read out from that representation. For example, one region might strongly encode color information but only weakly encode form information, making it easy to read out color information across variation in form, whereas another region might show the opposite profile. Explicitly comparing the relative coding strength of two features in this way can thus help to distinguish the possible functional roles of brain regions that multiplex multiple features. The first goal of the present study is therefore to provide a systematic documentation of the absolute and relative coding strengths of color and form features over the course of processing in human ventral visual regions. Because the human ventral visual pathway can be further subdivided into a lateral substream, culminating in an area in lateral occipitotemporal cortex (LOT), and a ventral substream, culminating in an area in VOT, we additionally separately analyzed how the strength of color and shape coding vary over these two substreams (Grill-Spector, Kushnir, Hendler, & Malach, 2000, Grill-Spector, Kourtzi, & Kanwisher, 2001; Kourtzi & Kanwisher, 2001; Grill-Spector et al., 1998).
Besides the primate ventral visual areas, color and form information has also been identified in every stage of processing in various deep convolutional neural networks (CNNs), a class of computer vision models that has attracted attention both for its success at object recognition and its potential (albeit debated) viability as a model of the primate ventral visual pathway (Taylor & Xu, 2021; Rafegas, Vanrell, Alexandre, & Arias, 2020; Serre, 2019; Flachot & Gegenfurtner, 2018; Rafegas & Vanrell, 2018; Rajalingham et al., 2018; Yamins & DiCarlo, 2016; Kriegeskorte, 2015). Previous studies have also shown that representations formed in lower and higher layers of the network track those of the human lower and higher visual processing regions, respectively (Xu & Vaziri-Pashkam, 2021a; Eickenberg, Gramfort, Varoquaux, & Thirion, 2017; Güçlü & van Gerven, 2017; Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Khaligh-Razavi & Kriegeskorte, 2014). Although CNNs have achieved high performance on object recognition tasks, they remain “black boxes” in some respects, with the nature of the internal feature transformations that give rise to their performance remaining poorly understood. How do the absolute and relative coding strengths of color and form features vary over the course of CNN visual processing? Do they resemble those found in the human brain? The second goal of the present study is therefore to provide one of the first systematic documentations of color and form coding strengths in CNNs pretrained for object recognition and compare them with those from the human ventral regions. This would provide us with a unique opportunity to compare the algorithms used by these two systems for representing and transforming combinations of visual features over processing and deepen our understanding of the similarities and differences between these systems (Xu & Vaziri-Pashkam, 2021a; Serre, 2019).
To accomplish these goals and to account for the varying complexity of form features, in the present study, we examined the absolute and relative coding strength of color and form both for stimuli varying based on a low-level form feature, namely, orientation, and stimuli that varied with respect to a mid-level form feature, namely, curvature. We examined fMRI responses from human ventral visual areas across two experiments. We also examined responses from five CNNs, chosen based on their prevalence in the literature, architectural diversity, correspondence with neural measurements or behavior, and/or object recognition performance. Specifically, we examined AlexNet (widely used shallow architecture; Krizhevsky, Sutskever, & Hinton, 2012), VGG19 (shallow architecture with high performance; Simonyan & Zisserman, 2015), ResNet-50 (deep architecture with high performance; He, Zhang, Ren, & Sun, 2015), GoogLeNet (deep architecture with high performance; Szegedy et al., 2015), and CORNet-S (shallow, recurrent network showing relatively high correspondence with neural and behavioral measurements; Kubilius et al., 2018). To document the absolute coding strength of colors and form, we measured the normalized Euclidean distance between representations associated with changes in a feature as was done in a previous study examining the coding strength of object identity and nonidentity features in human ventral cortex (Xu & Vaziri-Pashkam, 2021b). To measure the relative coding strength of these two features, we define a unitless measure, the form dominance index, that quantifies the relative contribution of two features to the representational geometry in a brain region or CNN layer. Intuitively, this index compares the representational dissimilarity between stimuli with the same form but different colors, and the dissimilarity between stimuli with the same color but different form, following a similar logic to the index presented in Xu and Vaziri-Pashkam (2021b) that quantified the relative coding strength of object identity and identity-irrelevant features. We documented the absolute and relative coding strength of color and form in each region of the ventral pathway and across each stage of processing for various CNNs, and characterized how they vary within each system and between the brain and CNNs, both in terms of whether they increase or decrease over processing, and in terms of their overall magnitudes.
Broadly, we find an overall convergence between the ventral visual pathway and CNNs in their relative emphasis of color and form information: Although the absolute values of the coding strength of color and form follow different trajectories between the ventral visual pathway and the CNNs we examined, they tend to share the key similarity that orientation, a low-level form feature, is increasingly de-emphasized relative to color, whereas curvature, a mid-level form feature, is increasingly emphasized relative to color, with the relative emphasis of color and form being largely similar between the brain and trained CNNs at corresponding stages of processing. Moreover, this similar progression in the relative emphasis of color and form is not a byproduct of the intrinsic architecture of the CNNs, but is specifically induced by training them for object recognition, because we do not find this similarity in untrained CNNs with random weights.
METHODS
fMRI Experimental Details
Participants
Experiment 1 included 12 healthy, right-handed human adults (7 women, between 25 and 34 years old, average age 30.6 years old). Experiment 2 included 13 healthy adults (7 women, between 25 and 34 years old, average age 28.7 years old). Four participants partook in both experiments. To ensure adequate power, we chose the sample sizes to meet or exceed those of past fMRI studies that have successfully identified color and shape information in the human brain regions we examine. For instance, Seymour et al. (2010) successfully decoded color and form information in V1, V2, V3, and V4 with a sample size of five participants, and Lafer-Sousa, Conway, and Kanwisher (2016) found differential univariate activation to color and shape in ventral occipitotemporal regions with 13 participants.
All participants had normal color vision and normal or corrected-to-normal visual acuity. They gave informed consent before the experiments and received payment. The experiments were approved by the Committee on the Use of Human Subjects at Harvard University.
Stimuli
Experiment 1 used colored spiral stimuli (Figure 1A). These spiral stimuli were logarithmic spirals, which exhibit the property that the angle between any spiral arm and the spiral center is fixed, in this case at 45° (see also Seymour et al., 2010). The spirals had 20 colored arms. The spirals were either clockwise or counterclockwise, yielding two conditions with orthogonal local orientation contours at corresponding locations. Experiment 2 used colored tessellation stimuli. These tessellation stimuli consisted of either a curvy or spiky tessellation pattern, designed so as not to resemble any real-world entities. These stimuli differ with respect to their degree of curvature, an important mid-level visual feature (Srihasam, Vincent, & Livingstone, 2014; Yue, Pourladian, Tootell, & Ungerleider, 2014; Gallant, Braun, & Van Essen, 1993); specifically, the spiky tessellations contained only straight contours, whereas the curvy tessellations contained only curved contours. Both the spiral and tessellation stimuli were bounded by a circular aperture with a diameter of 9.7° of visual angle, and had a white fixation dot in the center. Stimuli were presented on a black background. The stimuli could be either red or green, with the specific shade of red and green calibrated to isoluminance in the fMRI experiment for each participant using the flicker-adjustment method. Thus, within each stimulus set, there were four different stimuli (2 forms × 2 colors). Stimuli could be presented in one of two phases, depending on which parts of the image within the aperture were colored and which were black; this was done to ensure that our results did not arise purely from different retinotopic footprints between the two form conditions in each experiment (see Taylor & Xu, 2022, for more details).
Procedure
The procedure for both experiments was identical other than the stimuli used. In each run, participants viewed blocks of the stimuli lasting 12 sec each, with a 9-sec interblock blank interval and a 12-sec blank period at the beginning of the run. During each block, the phase of the stimulus alternated once per second. The participant's task was to detect two 500-msec luminance changes in each block, responding to a luminance increase with their index finger and a luminance decrease with their middle finger. Each run lasted 180 sec and contained eight stimulus blocks. Participants viewed 12 such runs and thus viewed a total of 24 blocks of each of the four stimulus types over the whole session.
Localizer Experiments and ROIs Definitions
As ROIs, we included retinotopically defined regions V1, V2, V3, and V4 in early visual cortex, and form-sensitive regions LOT and VOT in lateral and VOT (example ROIs shown in Figure 1B).
To localize topographic visual field maps, we followed standard retinotopic mapping techniques (Sereno et al., 1995). A 72° polar angle wedge swept either clockwise or counterclockwise (alternating each run) across the entire screen, with a sweeping period of 36.4 sec and 10 cycles per run. The entire display subtended 23.4 × 17.6° of visual angle. The wedge contained a colored checkerboard pattern that flashed at 4 Hz. Participants were asked to detect a dimming in the polar angle wedge. Each participant completed four to six runs, each lasting 364 sec. Areas V1 through V4 were then localized on each participant's cortical surface by manually tracing the borders of these visual maps activated by the vertical meridian of visual stimulation (identified by locating the phase reversals in the phase-encoded mapping), following the procedure outlined in Sereno et al. (1995).
To localize the form-sensitive regions in lateral and ventral occipitotemporal cortex (LOT and VOT), we followed the procedure in Kourtzi and Kanwisher (2001), and subsequently used in several of our own laboratory's studies (Jeong & Xu, 2017; Vaziri-Pashkam & Xu, 2017). Specifically, in a separate scanning session from the main experiment (usually the same one as the retinotopic mapping session), participants viewed black-and-white pictures of faces, places, common objects, arrays of four objects, phase-scrambled noise, and white noise in a block design paradigm, and responded with a button press whenever the stimulus underwent a slight spatial jitter, which occurred randomly twice per block. Each block contained 20 images from the same category, and each image was presented for 750 msec each, followed by a 50-msec blank display, totaling 16 sec per block, with four blocks per stimulus category. Each run also contained a 12-sec fixation block at the beginning, and an 8-sec fixation block in the middle and end. Images subtended 9.5° of visual angle. Participants performed either two or three runs, each lasting 364 sec. LOT and VOT were then separately defined as the clusters of voxels in LOT and VOT, respectively, that respond more to photos of real-world objects than to phase-scrambled versions of the same objects (p < .001 uncorrected). LOT and VOT corresponded to the location of the shape-sensitive lateral occipital and posterior fusiform regions defined in previous studies (Kourtzi & Kanwisher, 2000; Grill-Spector et al., 1998; Malach et al., 1995) but extended further into the temporal cortex in our effort to include more object-selective voxels in the occipitotemporal cortex.
MRI Methods
MRI data were collected using a Siemens PRISMA 3 T scanner, with a 32-channel receiver array headcoil. Participants lay on their backs inside the scanner and viewed the back-projected display through an angled mirror mounted inside the headcoil. The display was projected using an LCD projector at a refresh rate of 60 Hz and a spatial resolution of 1280 × 1024. An Apple Macbook Pro laptop was used to create the stimuli and collect the motor responses. Stimuli were created using MATLAB (The MathWorks) and Psychtoolbox (Brainard, 1997).
A high-resolution T1-weighted structural image (1.0 × 1.0 × 1.3 mm) was obtained from each participant for surface reconstruction. All BOLD data were collected via a T2*-weighted EPI pulse sequence that employed multiband radio frequency pulses and simultaneous multislice (SMS) acquisition. For the two main experiments, 69 axial slices tilted 25° toward coronal from the AC-PC line (2 mm isotropic) were collected covering the whole brain (repetition time [TR] = 1.5 sec, echo time = 30 msec, flip angle = 75°, field of view = 208 m, matrix = 104 × 104, SMS factor = 5). For the retinotopic mapping and lateral occipital cortex localizer sessions, 64 axial slices tilted 25° toward coronal from the AC-PC line (2.3 mm isotropic) were collected covering the whole brain (TR = 0.65 sec, echo time = 34.8 msec, flip angle = 52°, matrix = 90 × 90, SMS factor = 8). Different slice prescriptions were used here for the different localizers to be consistent with the parameters used in our previous studies. Because the localizer data were projected into the volume view and then onto individual participants' flattened cortical surface, the exact slice prescriptions used had minimal impact on the final results.
fMRI Data Analysis
fMRI data were analyzed using FreeSurfer (surfer.nmr.mgh.harvard.edu), FsFast (Dale, Fischl, & Sereno, 1999), and in-house Python scripts. The exact same analysis pipeline was used for the two experiments. Preprocessing was performed using FsFast. All functional data were motion-corrected to the first image of the run of the experiment. Slice-timing correction and smoothing (3-mm FWHM) was applied. A generalized linear model with a boxcar function convolved with the canonical hemodynamic response function was used to model the response of each trial, with the three motion parameters and a linear and quadratic trend used as covariates in the analysis. The first eight TRs of each run (before the presentation of the first stimulus) were included as nuisance regressors to remove them from further analysis. Beta values reflecting the brain response were extracted for each trial block. ROIs were defined on the cortical surface and then projected back to native functional space to extract the beta values from each ROI. To equate the number of voxels from each brain region, the 300 most responsive voxels from each ROI, defined using a stimulus > rest contrast across all task blocks, were extracted from the union of the two hemispheres for each ROI (e.g., left and right V1) and used in further analyses. The mean BOLD pattern for each condition (e.g., red clockwise spiral) in each ROI of each run was computed by first z-scoring the BOLD pattern from each trial to equate the mean and SD of each trial, and then averaging over all the trials of each condition to yield a mean pattern for that condition.
As our main analyses, we computed a Form Strength (dform) and a Color Strength (dcolor) value that quantified how strongly each feature was encoded in each ROI (Figure 2A). These values were then used to compute a Form Dominance Index that quantifies the relative coding strength of these two features (Figure 2B). Where applicable, statistical tests were corrected for multiple comparisons using the Benjamini-Hochberg procedure with the false discovery rate controlled at q < .05 (Benjamini & Hochberg, 1995); the set of tests across which this correction was applied is described for each analysis. These metrics were computed separately for the spiral and tessellation stimuli, such that, for the spiral stimuli, the form feature being manipulated was orientation (with the clockwise and counterclockwise spirals having opposite orientations), whereas for the tessellation stimuli, the form feature being manipulated was curvature (with one tessellation pattern being spiky and the other being curvy).
Absolute Coding Strength of Color and Form
To account for differences in measurement noise among the different brain regions, following Xu and Vaziri-Pashkam (2021b), we used a split-half approach to compute the raw coding strength of color and form and then correct it by the reliability of each brain region (Figure 2A). Specifically, we first separately averaged the patterns within the odd and even runs to generate an average odd pattern and an average even pattern for each stimulus type in each brain region. To document the strength of form coding, we computed the raw Euclidean distance between the patterns for stimuli of the same color, but different form (dform (raw)) across the odd and even runs (e.g., red clockwise from the even runs and red counterclockwise from the odd runs, and vice versa, with the results then averaged between the two directions). To correct for reliability and account for the fact that Euclidean distance will be positive even for identical patterns in the presence of noise, we also computed the Euclidean distance between the patterns for the same stimulus across the odd and even runs for each of the two stimuli being compared (dsame) as our reliability measure for that brain region (e.g., red clockwise from the odd and even runs, and red counterclockwise from the odd and even runs, with the results then averaged between the two). The reliability measure was then subtracted from the raw Euclidean distance measure to generate a noise-corrected Euclidean distance measure to reflect the form coding strength for that brain region (i.e., dform = dform (raw) − dsame). The resulting value was computed for each possible condition pair (e.g., for red clockwise vs. red counterclockwise, and for green clockwise vs. green counterclockwise) and then averaged across the pairs. Finally, to enable direct comparison of results between the brain and CNNs, the resulting value was divided by twice the square root of the number of voxels in a given region such that the distance does not depend on the number of voxels included (as the distance between two opposite patterns would increase with the increasing number of voxels/dimensions; see also Xu and Vaziri-Pashkam, 2021b). dcolor was computed in an analogous way, by comparing stimuli with the same form, but different color. The resulting dform and dcolor have the properties that they will have an expected value of zero if there is no difference between BOLD patterns for different values of their respective features after accounting for reliability, and do not depend on the number of voxels included. Values for dform and dcolor were computed separately within each participant, region, and stimulus type.
We conducted several statistical analyses to test how dform and dcolor vary across brain regions and form features. Within each ROI, one-sample, one-tailed t tests were used to test whether dform and dcolor were significantly above zero, with a positive result indicative of significant coding of a feature. One-tailed tests were used because below-zero values of dform and dcolor are not meaningful. These tests were corrected for multiple comparisons across the four tests conducted for each ROI (i.e., each combination of feature and stimulus type). The coding strength of each feature (form and color) was also compared between the spiral and tessellation stimuli within each ROI using two-tailed partially overlapping t tests (Derrick, Russ, Toher, & White, 2017), which took into account the fact that some participants partook in both experiments; these tests were corrected for multiple comparisons across two such tests (one each for color and form) conducted for each ROI. To aggregate across ROIs for additional sensitivity, a mixed effects analysis with brain region, stimulus type, and their interaction as factors was used to test whether each of dform and dcolor varies between the two types of stimuli (spirals and tessellations).
Across ROIs, several analyses were performed to characterize how dform and dcolor vary over the course of processing, with each analysis performed separately within each stimulus type. First, a one-way repeated-measures ANOVA was applied within each stimulus type to test whether each of dform and dcolor significantly vary across the six brain regions examined. After testing for the existence of differences in dform and dcolor across regions in this way, we then characterized in more detail how these quantities vary over processing. Specifically, based on anatomical segregation, we defined two visual processing substreams: a lateral substream comprising V1, V2, V3, and LOT, and a ventral substream comprising V1, V2, V3, V4, and VOT (see Figure 1B). To test for a linear increase or decrease in dform and dcolor over processing in each of these substreams within each stimulus type, each of dform and dcolor were correlated with the rank-order of each region's place in the hierarchy for that substream (e.g., for the lateral substream, 1 for V1, 2 for V2, 3 for V3, and 4 for LOT) within each participant. A one-sample t test (two-tailed) was then used to compare the mean of the Fisher's z-transformed correlation coefficients against zero. These correlation tests were corrected for multiple comparisons across the four tests (two features by two stimulus types) conducted in each substream. To document in greater detail differences among the different ROIs, we also conducted within-participant pairwise t tests to test differences between brain regions at the beginning and end of processing (i.e., between V1 and LOT, between V1 and VOT, and between the mean of V1 and V2, and the mean of LOT and VOT), and between the two higher visual regions LOT and VOT. These tests were corrected for multiple comparisons within the four tests conducted for each ROI pair (i.e., for each combination of feature and stimulus type).
Relative Coding Strength of Color and Form
Instead of computing the indices for each individual participant, we used a bootstrapping procedure (described in more detail below) to compute the average dform and dcolor from multiple participants to generate the form dominance indices and then assess statistical significance. This was done for two reasons. First, because of noise fluctuation, there were cases where both dform and dcolor were negative, making the index uninterpretable. Second, because the index is a ratio, it can be numerically unstable at the single-participant level, such that if both dform and dcolor are small, small fluctuations because of noise can lead to large fluctuations in the index. A bootstrapping procedure that involves averaging dform and dcolor across multiple participants before computing the index reduces the odds of both dform and dcolor being negative (which we empirically confirmed in our data), and increases the numerical stability of the index estimate by reducing the impact of noise. Neither of these problems arise for dform and dcolor themselves, justifying a different analysis procedure here for the dominance index.
The bootstrapping procedure for computing the form dominance index was as follows. Within each stimulus type (spiral and tessellation), 10,000 samples (with replacement) with a size equal to the sample size of the experiment for that stimulus type were drawn. Sampling was done at the level of individual participants, such that all of a participant's ROIs were sampled together, to preserve information about any relative differences among ROIs from that participant. Within each sample, each of dform and dcolor were then averaged across participants within each ROI, and the form dominance index for that sample was computed using these average values of dform and dcolor., yielding 10,000 such values for each combination of stimulus type and brain region. This procedure for aggregating dform and dcolor within samples almost never led to cases where both values were negative: Every stimulus-type/region combination had at least one positive value for dform and dcolor in all 10,000 samples, with the exception of V1 in the tessellation stimuli (both values negative in only 6 of 10,000 samples) and V2 in the tessellation stimuli (both values negative in only 2 of 10,000 samples). We decided to keep these samples rather than replacing them so as not to bias our sample selection. Thus, this resampling procedure successfully avoids the counterintuitive case where both dform and dcolor are negative, while incorporating data from all participants.
We conducted a number of statistical analyses on the bootstrapped form dominance indices, analogous to the analyses performed on the corrected Euclidean distance measures. Statistical significance was assessed using the inverted confidence interval method (Fox & Weisberg, 2018); specifically, considering a two-sided confidence interval with confidence level 1 − α constructed by taking the respective quantiles of the bootstrapped distribution for a given statistic of interest (e.g., the .025 and .975 quantiles for α = .05), the p value for a given test was defined as the smallest α for which the confidence interval included the statistic's value under the null hypothesis; for example, if the statistic's value under the null hypothesis is zero, and 4% of the bootstrapped values are less than zero, then the resulting two-tailed p value is .08. The resulting p values were corrected for multiple comparisons using the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) controlled at a false discovery rate of q < .05, with the details of each correction described below.
First, to test whether each brain region shows a preference for one feature versus the other in each stimulus type, we tested whether the dominance index for each region significantly differs from zero; these tests were corrected for multiple comparisons across the two tests (one for each of the two stimulus types) within each ROI. Second, to test whether the form dominance index varies between stimulus types (i.e., spirals and tessellations) within each ROI, we took the difference in the dominance index between stimulus types for each ROI for each of the 10,000 bootstrapped samples, and tested whether this difference is significantly different from zero. Because just one such test was conducted for each ROI, no correction for multiple comparisons was applied for this test. Third, to test for differences between the beginning and end of processing, and for differences between LOT and VOT, we computed differences in the dominance index for a pair of ROIs in each of the 10,000 samples, and tested whether this difference was significantly different from zero. This was done separately within each stimulus type, using the same pairs of ROIs as were tested for their difference in dform and dcolor. For the comparison between the average of V1 and V2 and the average of LOT and VOT, averaging was performed on the index within each sample (rather than on dform and dcolor before computing the index). Correction for multiple comparisons was done within the set of two tests performed for each pair of ROIs (one test per stimulus type). Fourth, to test for an overall linear increase or decrease in the dominance index over processing in each substream, within each of the 10,000 samples, we calculated the Pearson correlation coefficient between the dominance index and the rank order of each ROI in its processing substream. We then tested whether the bootstrapped correlation significantly differed from zero. This test was corrected for multiple comparisons across the two correlations (one for each stimulus type) computed for each substream.
CNN Analysis Details
The absolute coding strength of form (dform) and color (dcolor), and the relative coding strength of these two features were computed for CNNs analogously to how they were computed for the fMRI data.
Five CNNs were chosen based on their high object recognition performance, architectural diversity, and prevalence in the literature. Specifically, we included AlexNet (Krizhevsky et al., 2012), VGG19 (Simonyan & Zisserman, 2015), GoogLeNet (Szegedy et al., 2015), ResNet-50 (He et al., 2015), and CORNet-S (Kubilius et al., 2018). AlexNet was included for its high object recognition performance, relative simplicity, and prevalence in the literature. VGG19, GoogLeNet, and ResNet-50 were chosen based on their high object recognition performance and architectural diversity. Both AlexNet and VGG19 have a shallower network structure, whereas GoogLeNet and ResNet-50 have a deeper network structure. CORNet-S is a shallow recurrent CNN designed to approximate the structure of the primate ventral visual pathway, and exhibits high correlation with neural and behavioral metrics. This CNN has recently been argued to be the current best model of the primate ventral visual regions (Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019). Table 1 shows the layers that we sampled from each network. Instead of sampling a few representative layers as we did in previous studies (Mocz, Vaziri-Pashkam, Chun, & Xu, 2022; Xu & Vaziri-Pashkam, 2021a, 2021b, 2022; Taylor & Xu, 2021), to better capture the evolution of color and form representation over the course of processing, here, we sampled all the layers with the following exclusions: We excluded layers before the second convolutional layer (because the representations in these initial layers are still dominated by the input image rather than the network's computations, and to make the layer sampling more analogous to the neural data, which excludes initial retinal representations), all dropout layers (because these are disabled after network training), and the parallel branching (“Inception”) layers of GoogLeNet, because these branches could vary in length, with no well-defined ordinal position in the network. For CORNet_S, all passes of the recurrent layers were sampled, so as to capture the results of each computation.
Network (Layers Used/Total Layers) . | Layers Used . |
---|---|
AlexNet (16/21) | All layers except before conv2 and dropout layers |
CORnet-S (57/61) | All layers except before conv2 (for recurrent layers, included all passes) |
GoogLeNet (9/130) | All layers except before conv2, layers inside the Inception (branching) modules, and dropout layers |
ResNet-50 (154/158) | All layers except before conv2 |
VGG19 (41/45) | All layers except before conv2 and dropout layers |
Network (Layers Used/Total Layers) . | Layers Used . |
---|---|
AlexNet (16/21) | All layers except before conv2 and dropout layers |
CORnet-S (57/61) | All layers except before conv2 (for recurrent layers, included all passes) |
GoogLeNet (9/130) | All layers except before conv2, layers inside the Inception (branching) modules, and dropout layers |
ResNet-50 (154/158) | All layers except before conv2 |
VGG19 (41/45) | All layers except before conv2 and dropout layers |
To examine the extent to which any results might depend on the network's training regime, versus its intrinsic architecture, we examined each network both with weights trained to recognize ImageNet images (Deng et al., 2009) and with random weights. For the random networks, to reduce the possibility that any findings were simply because of random variation in the weight initializations, we computed the average results for dform, dcolor, and the form dominance index from 10 random initializations of each network (with subsequent analyses remaining identical, and treating these averages the same as the single values for the trained networks). In addition, we examined a version of ResNet-50 trained entirely on “stylized” ImageNet images, where the original texture and color of every single image was replaced by the style of a randomly chosen painting, thereby removing the real-world color-form covariation in the natural objects (Geirhos et al., 2019). This training procedure induced ResNet-50 to rely more on global form than on local texture to classify images.
Because trained CNNs can have varying weights based on the random seed with which their weights were initialized, we also examined 10 additional instances of AlexNet trained on object recognition with different random seeds to assess the extent to which the absolute and relative tuning strengths in the trained networks may have been affected by the particular random seed that was used. The pretrained weights for these additional AlexNet instances were taken from open-source data provided by Mehrer, Spoerer, Kriegeskorte, and Kietzmann (2020), and were architecturally identical to the variant of AlexNet used in our main analyses, with the exception of lacking an average-pooling layer before the first fully connected layer, and having a different number of kernels in the penultimate convolutional layer (384 kernels instead of 256 kernels).
The same stimuli as those used in the fMRI experiment were fed to the CNNs, with minor modifications. The stimuli were truncated to square images (224 × 224 pixels), with the entire stimulus pattern located in the center and bounded by a circular aperture subtending 140 × 140 pixels. The specific shades of red and green for the CNN input images were taken from the calibrated isoluminance values for a sample participant.
The stimuli were fed into each network, and the resulting patterns were flattened into 1D vectors (no averaging across space). Because each stimulus has two different phases (depending on which image regions were colored vs. black), both phases of the stimulus were fed into each network, and the results were averaged, to roughly approximate the manner in which BOLD activity presumably reflected an average of the response to the two phases during the fMRI blocks. The resulting averaged patterns were then each z-normalized to remove amplitude differences between the different stimuli as was done with the fMRI response patterns. dform and dcolor were then computed in the same manner as for the fMRI data, except that no reliability adjustment was required because CNNs are noise-free. In the same way that the distances were normalized by the number of voxels for the fMRI data, distances for the CNNs were normalized by dividing by twice the square root of the number of units. Finally, the same form dominance index was computed for the CNN patterns as for the fMRI data for each sampled layer of each network. Here, the index could be calculated directly from dform and dcolor because these values will always be non-negative and because CNNs are noise-free.
To test for each CNN whether there is a significant linear increase or decrease in dform, dcolor, and the form dominance index over the course of processing, we computed the Pearson correlation between each of these three quantities and the rank order of each sampled CNN layer within the network (e.g., for VGG19, a value of 1 for Conv2 and a value of 41 for FC3). Correction for multiple comparisons was applied within the tests conducted for each network, and separately for the distance measures (four tests per network, corresponding to two stimulus types by two features) and for the dominance indices (two tests per network; one for each of the two stimulus types). In addition, we tested for a difference in this correlation value for the dominance index between the two trained versions of ResNet-50 (ImageNet and Stylized images) using a test for the equivalence of correlations (Diedenhofen & Musch, 2015).
Comparing Form Dominance in Brains and CNNs
Previous studies have shown that representations formed in lower and higher layers of CNNs trained for object recognition track those of the human lower and higher visual processing regions, respectively (Xu & Vaziri-Pashkam, 2021a; Eickenberg et al., 2017; Güçlü & van Gerven, 2017; Cichy et al., 2016; Khaligh-Razavi & Kriegeskorte, 2014). We therefore tested next whether the level of the dominance index differs between corresponding stages of processing in the brain and each of the CNNs examined. Specifically, for each network, we resampled the dominance index values from the CNN layers so as to have the same number of resampled “layers” as the number of the sampled brain regions in each of the two substreams we defined (lateral and ventral), allowing us to compare corresponding brain regions and CNN layers. This was done by using SciPy's ndimage.zoom function to apply a first-order (linear) spline interpolation to resample the number of values of a vector while preserving the overall shape. For example, in comparing VGG19 to the lateral substream (comprising V1, V2, V3, and LOT), the values for VGG19's 41 sampled layers were resampled to yield four values, one corresponding to each brain region in the lateral substream. We then tested whether the dominance index for the resampled CNN layer is significantly different from the dominance index for the corresponding brain region in the substream by once again using the inverted confidence interval method. These tests were corrected for multiple comparisons across the six comparisons (i.e., one for each trained network) conducted for each brain region.
RESULTS
In this study, we quantified both the absolute and relative coding strengths of color and form features in a set of brain regions spanning the human ventral visual hierarchy, including V1 to V4, LOT and VOT. In two experiments, we examined the coding of color with both a simple form feature, orientation (Experiment 1), and a mid-level form feature, curvature (Experiment 2). We also examined the absolute and relative coding strengths of the same color and form features in five CNNs with varying network architecture, depth, and presence/absence of recurrent processing, and how coding for these features varies with training. Lastly, we compared how feature coding differs between the ventral pathway and the CNNs we examined.
Absolute Coding Strength of Color and Form
The absolute coding strength of form (dform) for a given brain region or CNN layers was measured as the average Euclidean representational distances between the z-normalized activity patterns corresponding to stimuli of the same color, but different form. For the fMRI data, a reliability correction was applied to the measurement to account for noise differences across the different brain regions (see Methods section). Correction for the number of voxels or CNN units included was also applied to make this measure comparable across differing numbers of voxels or units. The absolute encoding strength of color (dcolor) was computed in an analogous manner. Distances were computed separately for the spiral and tessellation stimuli (i.e., such that dform reflects orientation coding strength for the spiral stimuli differing in their orientation, and reflects curvature coding strength for the tessellation stimuli differing in their level of curvature). Figure 3 shows the values of dform and dcolor for the brain and for all sampled CNNs. As LOT and VOT are both higher visual processing areas but located at distinctive regions in lateral and ventral occipito-temporal cortex, respectively, we plotted the results separately for a lateral substream, including V1-V3 and LOT, and a ventral substream, including V1-V4 and VOT.
We first tested the presence of significant color or form information in each ROI using one-sample, one-tailed t tests to compare each of dform and dcolor against zero (see Table 2 for the detailed statistical results). Correction for multiple comparisons was applied using the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) with false discovery rate controlled at q < 0.05 across the four such comparisons (two features by two stimulus types) conducted for each ROI. Every ROI exhibited significant color and form information for both stimulus types, consistent with results from a past study using a linear decoder (Taylor & Xu, 2022), rather than the corrected Euclidean distance measures as we used here. To test whether dform and dcolor might vary between stimulus types (i.e., spirals and tessellations), two-tailed partially overlapping t tests (to account for several overlapping participants between experiments) were used to compare dform and dcolor in each region between stimulus types, with correction for multiple comparisons applied across two such comparisons conducted for each ROI (see Table 2, right panel). V1 and V2 had a significantly higher dform for the spiral stimuli than the tessellation stimuli (indicative of higher coding strength for orientation than for curvature), whereas LOT had a significantly higher dform for the tessellation stimuli than the spiral stimuli, with V4 and VOT trending toward a higher dform for the tessellation stimuli than the spiral stimuli (indicative of higher coding strength for curvature than for orientation). V4 and VOT trended toward a higher dcolor for the spiral stimuli than the tessellation stimuli, with no other significant or trending differences in dcolor between stimulus types. To more sensitively test for an overall difference (i.e., a main effect) in dform or dcolor between stimulus types when aggregating across ROIs, a mixed-effects analysis (analogous to a two-way repeated-measures ANOVA, but allowing for only partial overlap in participants between conditions) with region, stimulus type, and their interaction as factors was conducted separately for dform and dcolor. Whereas there was a significant main effect of stimulus type for dform (Z = 3.33, p = .001), there was no significant main effect of stimulus type for dcolor (Z = −.31, p = .76), consistent with comparable overall color information for the two types of stimuli when aggregating across ROIs. Overall, we found that significant color and form information were present in all ROIs we examined, with the strength of form information showing a significant or trending difference between the spiral and tessellation stimuli in nearly every ROI we examined, consistent with stronger tuning preference for orientation in V1 and V2, and stronger tuning preference for curvature in LOT, VOT, and V4. Meanwhile, as the same color information was present in both types of stimuli, the coding strength of color information was overall comparable between the two stimulus types across the ROIs.
Region . | Spirals . | Tessellations . | Spiral vs. Tessellations . | ||||||
---|---|---|---|---|---|---|---|---|---|
dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | |
V1 | .0088 | .023 | .46 | .0044 | .0045 | −.01 | .0044 | .019 | .47 |
3.44 / .006 | 5.84 / .0002 | .0008 | 2.73 / .01 | 2.22 / .02 | .91 | 1.44 / .17 | 4.18 / .0013 | .047 | |
** | *** | *** | * | * | ** | * | |||
V2 | .0064 | .022 | .55 | .0055 | .0094 | .28 | .0009 | .012 | .26 |
2.67 / .01 | 6.00 / .0002 | .0004 | 2.28 / .02 | 4.13 / .001 | .08 | .23 / .82 | 2.69 / .03 | .24 | |
* | *** | *** | * | ** | † | * | |||
V3 | .012 | .015 | .14 | .0056 | .014 | .42 | .006 | .001 | −.28 |
2.83 / .01 | 4.86 / .0005 | .48 | 2.58 / .012 | 5.25 / .0004 | .01 | 1.19 / .50 | .48 / .64 | .26 | |
* | *** | * | *** | * | |||||
V4 | .018 | .0062 | −.47 | .007 | .014 | .32 | .01 | −.007 | −.79 |
3.71 / .004 | 2.62 / .01 | .002 | 2.98 / .008 | 3.97 / .004 | .007 | 2.21 / .07 | −1.91 / .073 | <.0001 | |
** | * | ** | ** | ** | ** | † | † | *** | |
LOT | .016 | .011 | −.14 | .0045 | .023 | .68 | .011 | −.012 | −.82 |
2.62 / .02 | 3.30 / .007 | .58 | 2.02 / .03 | 5.08 / .0005 | <.0001 | 1.64 / .12 | −2.37 / .03 | .004 | |
* | ** | * | *** | *** | * | ** | |||
VOT | .016 | .006 | −.46 | .006 | .014 | .40 | .01 | −.007 | −.87 |
5.26 /.0005 | 2.39 / .02 | .007 | 1.86 / .04 | 3.43 / .005 | .07 | 2.04 / 0.57 | −1.76 / .096 | .002 | |
*** | * | ** | * | ** | † | † | † | ** |
Region . | Spirals . | Tessellations . | Spiral vs. Tessellations . | ||||||
---|---|---|---|---|---|---|---|---|---|
dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | |
V1 | .0088 | .023 | .46 | .0044 | .0045 | −.01 | .0044 | .019 | .47 |
3.44 / .006 | 5.84 / .0002 | .0008 | 2.73 / .01 | 2.22 / .02 | .91 | 1.44 / .17 | 4.18 / .0013 | .047 | |
** | *** | *** | * | * | ** | * | |||
V2 | .0064 | .022 | .55 | .0055 | .0094 | .28 | .0009 | .012 | .26 |
2.67 / .01 | 6.00 / .0002 | .0004 | 2.28 / .02 | 4.13 / .001 | .08 | .23 / .82 | 2.69 / .03 | .24 | |
* | *** | *** | * | ** | † | * | |||
V3 | .012 | .015 | .14 | .0056 | .014 | .42 | .006 | .001 | −.28 |
2.83 / .01 | 4.86 / .0005 | .48 | 2.58 / .012 | 5.25 / .0004 | .01 | 1.19 / .50 | .48 / .64 | .26 | |
* | *** | * | *** | * | |||||
V4 | .018 | .0062 | −.47 | .007 | .014 | .32 | .01 | −.007 | −.79 |
3.71 / .004 | 2.62 / .01 | .002 | 2.98 / .008 | 3.97 / .004 | .007 | 2.21 / .07 | −1.91 / .073 | <.0001 | |
** | * | ** | ** | ** | ** | † | † | *** | |
LOT | .016 | .011 | −.14 | .0045 | .023 | .68 | .011 | −.012 | −.82 |
2.62 / .02 | 3.30 / .007 | .58 | 2.02 / .03 | 5.08 / .0005 | <.0001 | 1.64 / .12 | −2.37 / .03 | .004 | |
* | ** | * | *** | *** | * | ** | |||
VOT | .016 | .006 | −.46 | .006 | .014 | .40 | .01 | −.007 | −.87 |
5.26 /.0005 | 2.39 / .02 | .007 | 1.86 / .04 | 3.43 / .005 | .07 | 2.04 / 0.57 | −1.76 / .096 | .002 | |
*** | * | ** | * | ** | † | † | † | ** |
Within each cell, the first row is the mean of that value across participants, or across bootstrapped samples for the dominance index, the second row is the t statistic and p value for the corresponding significance test, or just the p value for the dominance index (because the p value is derived from the distribution of bootstrapped samples rather than from a t test, see Methods section), and the third row indicates the degree of statistical significance. Tests are corrected for multiple comparisons across the tests of a given type performed within each ROI (see Methods section).
p < .05.
p < .01.
p < .001.
p < .1.
Next, several analyses were conducted to test how dform and dcolor vary over processing within each stimulus type. First, to test whether dform and dcolor differ among ROIs, one-way repeated-measures ANOVAs were run within each stimulus type separately for dform and dcolor. dform significantly varied across regions in both stimulus types (Fs > 8.08, ps < .001), whereas dcolor varied across regions for the spiral stimuli, F(5, 55) = 2.88, p = .02, but not the tessellation stimuli, F(5, 60) = 0.36, p = .87. To characterize these differences across regions in more detail, we examined responses within the ventral and lateral substreams separately. To test for an overall change in dform and dcolor over the course of processing, matched-pairs t tests were used to compare dform and dcolor between ROIs at the beginning and end of processing in each substream, and between LOT and VOT (see Table 3 for the detailed statistical results). Corrections for multiple comparisons were performed within tests of the same type within each pair of ROIs that were compared (i.e., four tests for comparing dform and dcolor for the two stimulus types within each ROI pair). dform significantly differed between V1 and LOT, between V1 and VOT, and between the mean of V1 and V2 and the mean of LOT and VOT, for both stimulus types; by contrast, dcolor only showed a trending difference in the spiral stimuli when comparing V1 and VOT. LOT and VOT showed a trending difference in dform for the tessellation stimuli, with no other significant or trending difference for this ROI pair. As a further analysis, we tested for an overall linear increase or decrease in dform and dcolor over processing, by correlating each of dform and dcolor for each ROI with that ROI's rank order in each respective substream (see Table 4 for the detailed statistical results). Correction for multiple comparisons was applied across four such tests conducted for each substream (two features by two stimulus types). dcolor only showed a significant increase for the ventral substream for the spiral stimuli, but not for the lateral substream, and not for the tessellation stimuli. By contrast, dform significantly increased over processing in the tessellation stimuli in both substreams, but in the spiral stimuli, dform significantly decreased over processing in the ventral substream, with a trending decrease in the lateral substream. Thus, we find that color information increases over processing in the ventral substream for the spiral, but not the tessellation stimuli, and form information increases over processing for the tessellation stimuli in both substreams, but decreases or tends to decrease over processing for the spiral stimuli in both substreams.
Regions Compared . | Spirals . | Tessellations . | ||||
---|---|---|---|---|---|---|
dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | |
V1 vs. LOT | −.007 | 0.12 | .60 | −.0001 | −.018 | −.69 |
−1.04 / .43 | 2.79 / .04 | .03 | −.07 / .95 | −4.81 / .0016 | .008 | |
* | * | ** | ** | |||
V1 vs. VOT | −.007 | .017 | .93 | −.002 | −.009 | −.41 |
−2.01 / .09 | 3.46 / .01 | <.0001 | −.61 / .55 | −3.37 / .01 | .07 | |
† | * | *** | * | † | ||
V1/V2 vs. LOT/VOT | −.008 | .014 | .81 | −.0004 | −.011 | −.40 |
−1.87 / .12 | 3.62 / .008 | <.0001 | −.16 / .88 | −3.89 / .008 | .02 | |
** | *** | ** | * | |||
LOT vs. VOT | −.001 | .005 | .43 | −.0016 | .009 | .28 |
−.20 / .84 | 2.14 / .11 | .005 | −.72 / .65 | 2.76 / .068 | .12 | |
** | † |
Regions Compared . | Spirals . | Tessellations . | ||||
---|---|---|---|---|---|---|
dcolor . | dform . | Shape Dominance . | dcolor . | dform . | Shape Dominance . | |
V1 vs. LOT | −.007 | 0.12 | .60 | −.0001 | −.018 | −.69 |
−1.04 / .43 | 2.79 / .04 | .03 | −.07 / .95 | −4.81 / .0016 | .008 | |
* | * | ** | ** | |||
V1 vs. VOT | −.007 | .017 | .93 | −.002 | −.009 | −.41 |
−2.01 / .09 | 3.46 / .01 | <.0001 | −.61 / .55 | −3.37 / .01 | .07 | |
† | * | *** | * | † | ||
V1/V2 vs. LOT/VOT | −.008 | .014 | .81 | −.0004 | −.011 | −.40 |
−1.87 / .12 | 3.62 / .008 | <.0001 | −.16 / .88 | −3.89 / .008 | .02 | |
** | *** | ** | * | |||
LOT vs. VOT | −.001 | .005 | .43 | −.0016 | .009 | .28 |
−.20 / .84 | 2.14 / .11 | .005 | −.72 / .65 | 2.76 / .068 | .12 | |
** | † |
Within each cell, the first row is the mean differences in the value between the ROIs (first ROI in the pair minus the second) across participants, or across bootstrapped samples for the dominance index, the second row is the t statistic and p value for the t test comparing the two ROIs, or the p value corresponding to the bootstrapped distribution of the difference in the case of the dominance index, and the third row indicates statistical significance. Tests are corrected for multiple comparisons across the tests of a given type performed within each ROI (see Methods section).
p < .05.
p < .01.
p < .001.
p < .1.
System . | Spiral . | Tessellations . | ||||
---|---|---|---|---|---|---|
dcolor . | dform . | Form Dominance . | dcolor . | dform . | Form Dominance . | |
Brain, latera | r = .17 | r = −.45 | r = −.84 | r = .008 | r = .69 | r = .93 |
p = .31 | p = .09 | p = .018 | p = .93 | p = .004 | p = .008 | |
ns | † | * | ns | ** | ** | |
Brain, ventral | r = .48 | r = −.61 | r = −.91 | r = .09 | r = .47 | r = .64 |
p = .04 | p = .012 | p < .001 | p = .57 | p = .008 | p = .065 | |
* | * | *** | ns | ** | † | |
AlexNet (ImageNet) | r = −.89 | r = −.98 | r = −.88 | r = −.94 | r = −.39 | r = −.79 |
p < .001 | p < .001 | p < .001 | p < .001 | p = .13 | p < .001 | |
*** | *** | *** | *** | ns | *** | |
AlexNet (Random) | r = −.86 | r = −.95 | r = −.94 | r = −.84 | r = −.94 | r = −.92 |
p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | |
*** | *** | *** | *** | *** | *** | |
CORNet_S (ImageNet) | r = .05 | r = −.59 | r = −.55 | r = .20 | r = .28 | r = .22 |
p = .61 | p < .001 | p < .001 | p = .09 | p = .02 | p = .03 | |
ns | *** | *** | † | * | * | |
CORNet_S (Random) | r = −.93 | r = −.91 | r = −.25 | r = −.93 | r = −.92 | r = −.17 |
p < .001 | p < .001 | p = .04 | p < .001 | p < .001 | p = .11 | |
*** | *** | * | *** | *** | ns | |
GoogLeNet (ImageNet) | r = .06 | r = −.58 | r = −.80 | r = .64 | r = .81 | r = .93 |
p = .88 | p = .14 | p = .009 | p = .12 | p = .04 | p < .001 | |
ns | ns | ** | ns | * | *** | |
GoogLeNet (Random) | r = −.61 | r = −.84 | r = −.88 | r = −.58 | r = −.89 | r = −.94 |
p = .10 | p = .009 | p = .002 | p = .10 | p = .005 | p < .001 | |
ns | ** | ** | ns | ** | *** | |
VGG19 (ImageNet) | r = −.77 | r = −.81 | r = −.46 | r = −.57 | r = .26 | r = .64 |
p < .001 | p < .001 | p = .002 | p < .001 | p = .099 | p < .001 | |
*** | *** | ** | *** | † | *** | |
VGG19 (Random) | r = −.91 | r = −.56 | r = −.03 | r = −.94 | r = −.83 | r = −.63 |
p < .001 | p < .001 | p = .83 | p < .001 | p < .001 | p < .001 | |
*** | *** | ns | *** | *** | *** | |
ResNet-50 (ImageNet) | r = −.11 | r = −.44 | r = −.21 | r = −.05 | r = .03 | r = .06 |
p = .37 | p < .001 | p = .02 | p = .73 | p = .75 | p = .46 | |
ns | *** | * | ns | ns | ns | |
ResNet-50 (Stylized) | r = .28 | r = −.12 | r = −.21 | r = .47 | r = .16 | r = −.21 |
p < .001 | p = .14 | p = .008 | p < .001 | p = .06 | p = .008 | |
*** | ns | ** | *** | † | ** | |
ResNet-50 (Random) | r = −.91 | r = −.93 | r = −.47 | r = −.92 | r = −.92 | r = −.12 |
p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | p = .12 | |
*** | *** | *** | *** | *** | ns |
System . | Spiral . | Tessellations . | ||||
---|---|---|---|---|---|---|
dcolor . | dform . | Form Dominance . | dcolor . | dform . | Form Dominance . | |
Brain, latera | r = .17 | r = −.45 | r = −.84 | r = .008 | r = .69 | r = .93 |
p = .31 | p = .09 | p = .018 | p = .93 | p = .004 | p = .008 | |
ns | † | * | ns | ** | ** | |
Brain, ventral | r = .48 | r = −.61 | r = −.91 | r = .09 | r = .47 | r = .64 |
p = .04 | p = .012 | p < .001 | p = .57 | p = .008 | p = .065 | |
* | * | *** | ns | ** | † | |
AlexNet (ImageNet) | r = −.89 | r = −.98 | r = −.88 | r = −.94 | r = −.39 | r = −.79 |
p < .001 | p < .001 | p < .001 | p < .001 | p = .13 | p < .001 | |
*** | *** | *** | *** | ns | *** | |
AlexNet (Random) | r = −.86 | r = −.95 | r = −.94 | r = −.84 | r = −.94 | r = −.92 |
p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | |
*** | *** | *** | *** | *** | *** | |
CORNet_S (ImageNet) | r = .05 | r = −.59 | r = −.55 | r = .20 | r = .28 | r = .22 |
p = .61 | p < .001 | p < .001 | p = .09 | p = .02 | p = .03 | |
ns | *** | *** | † | * | * | |
CORNet_S (Random) | r = −.93 | r = −.91 | r = −.25 | r = −.93 | r = −.92 | r = −.17 |
p < .001 | p < .001 | p = .04 | p < .001 | p < .001 | p = .11 | |
*** | *** | * | *** | *** | ns | |
GoogLeNet (ImageNet) | r = .06 | r = −.58 | r = −.80 | r = .64 | r = .81 | r = .93 |
p = .88 | p = .14 | p = .009 | p = .12 | p = .04 | p < .001 | |
ns | ns | ** | ns | * | *** | |
GoogLeNet (Random) | r = −.61 | r = −.84 | r = −.88 | r = −.58 | r = −.89 | r = −.94 |
p = .10 | p = .009 | p = .002 | p = .10 | p = .005 | p < .001 | |
ns | ** | ** | ns | ** | *** | |
VGG19 (ImageNet) | r = −.77 | r = −.81 | r = −.46 | r = −.57 | r = .26 | r = .64 |
p < .001 | p < .001 | p = .002 | p < .001 | p = .099 | p < .001 | |
*** | *** | ** | *** | † | *** | |
VGG19 (Random) | r = −.91 | r = −.56 | r = −.03 | r = −.94 | r = −.83 | r = −.63 |
p < .001 | p < .001 | p = .83 | p < .001 | p < .001 | p < .001 | |
*** | *** | ns | *** | *** | *** | |
ResNet-50 (ImageNet) | r = −.11 | r = −.44 | r = −.21 | r = −.05 | r = .03 | r = .06 |
p = .37 | p < .001 | p = .02 | p = .73 | p = .75 | p = .46 | |
ns | *** | * | ns | ns | ns | |
ResNet-50 (Stylized) | r = .28 | r = −.12 | r = −.21 | r = .47 | r = .16 | r = −.21 |
p < .001 | p = .14 | p = .008 | p < .001 | p = .06 | p = .008 | |
*** | ns | ** | *** | † | ** | |
ResNet-50 (Random) | r = −.91 | r = −.93 | r = −.47 | r = −.92 | r = −.92 | r = −.12 |
p < .001 | p < .001 | p < .001 | p < .001 | p < .001 | p = .12 | |
*** | *** | *** | *** | *** | ns |
First row of each cell is the Pearson correlation coefficient between the ordinal position of each brain region or CNN layer in each system, second row is the p value of the correlation coefficient, and third row denotes statistical significance. Tests are corrected for multiple comparisons across the tests of a given type performed within each substream or CNN (see Methods section).
p < .05.
p < .01.
p < .001.
p < .1.
We then examined how coding strength for color and form vary over processing in CNNs, first examining CNNs trained on ImageNet to recognize objects (Figure 3). To statistically test for linear increases or decreases in color and form coding strength over processing, dform and dcolor were correlated with the rank order of each sampled layer in each CNN (see Table 4 for the detailed statistical results); corrections for multiple comparisons were performed across the tests performed within each network (i.e., four tests, corresponding to each combination of feature and stimulus type). First, we examined CNNs trained to recognize objects. For the spiral stimuli differing based on their orientation, dcolor significantly decreased over processing in two of the networks, with others showing no significant change, whereas dform significantly decreased over processing in four of the five CNNs trained on ImageNet. For the tessellation stimuli differing based on curvature, dcolor significantly decreased over processing in two networks, showed a trending increase in one network, and showed no significant or trending changes over processing for the other networks. For dform, two networks showed a significant increase over processing and one showed a trending increase. Thus, when directly examining how absolute coding strength of color and form varies over processing in trained CNNs, the most reliable effect was that the coding strength of orientation tended to decrease over processing, with all networks except one showing a significant or trending decrease and none showing an increase.
Next, we tested the extent to which these results could have arisen from the intrinsic architecture of the networks, rather than from their training regime, by examining untrained networks with random weights. To minimize the effect of the particular random initialization on the results, we computed dform and dcolor separately for 10 random initializations of each network and averaged the results. The highly consistent pattern was that the untrained variants of the networks exhibited much lower values of dform and dcolor than their trained counterparts, with these values always decreasing and never increasing over the course of processing (decrease significant in every case except for dcolor in GoogLeNet for both experiments). Thus, the color and form information we observe in the trained networks indeed appears to arise from their training regime and not merely as a byproduct of their intrinsic architecture. We additionally examined a variant of ResNet-50 trained on a “stylized” version of ImageNet images that encouraged it to de-emphasize texture information and emphasize global shape information (Geirhos et al., 2019). This version of ResNet showed a significant increase in dcolor over processing for both the spiral and tessellation stimuli (in contrast with the ImageNet version, which showed a nonsignificant decrease in dcolor over processing in both cases), a numerically smaller decrease in dform over processing for the spiral stimuli than the ImageNet version, and small, numerically larger increase dform over processing for the tessellation stimuli versus the ImageNet version.
Comparing the results for the brain and the CNNs, for the spiral stimuli differing based on orientation both the lateral and ventral substreams of the brain and many of the CNNs similarly show a decrease in form information over processing. The ventral substream shows a significant increase in dcolor over processing; interestingly, the only network also showing such an increase in color information was the version of ResNet-50 trained on Stylized ImageNet. For the tessellation stimuli differing based on their level of curvature, neither substream showed a significant change in dcolor over processing. By contrast, both the ventral and lateral substream show a significant increase in dform over processing, as do CORNet_S and GoogLeNet, with VGG19 and stylized ResNet-50 exhibiting trending increases. Thus, although CNNs trained to recognize objects generally capture the decreasing coding strength for spiral orientation observed in the ventral substream, fewer networks capture the increases in color coding strength for the spiral stimuli seen in the ventral substream, and the increase in form coding strength for the curvy and spiky tessellation patterns observed in both the ventral and lateral substreams.
Relative Coding Strength of Color and Form
The analyses performed so far examined how the coding strength of color and form varied over processing in isolation from the other feature, but given that information about both features exists at every stage of processing in the brain and CNNs, a further important question is to examine how the relative coding strength of these features varies over processing. To quantify this, we devised a form dominance index (Figure 2B) that compares how strongly color and form influence the representational geometry at a given processing stage, and examined how this index varies across brain regions and CNN layers (see Methods section for more details). This index ranges from −1 to 1, where −1 implies that only color, but not form, affects the representational geometry in a region, 0 implies an equal contribution of the two features to the geometry in a region, and 1 implies that only form, but not color, affects the representational geometry in a region. For the neural data, we employed a bootstrapping procedure in which dform and dcolor were averaged within each of 10,000 samples from the data, with the index then computed within each sample; this was done to improve the numerical stability of the index, and to reduce the likelihood of cases where both dform and dcolor were negative (yielding an uninterpretable index). To assess statistical significance for each test below, we used the inverted confidence interval method; that is, we assessed what proportion of the bootstrapped sample statistics are at least as extreme as the value under the null hypothesis (e.g., if the value of a given statistic under a two-sided null hypothesis is zero, and 4% of the bootstrapped samples have a value less than zero, then the two-tailed p value would be .08). By contrast, for the CNNs, this index could be computed directly from dform and dcolor because CNNs contain no noise.
Figure 4 shows the values of the dominance index for the brain (separately for the lateral and ventral substreams) and for each of the CNNs we examined. To examine for each brain region whether it has a preference for form versus color information, we tested for each region whether the index is significantly different from zero (see Table 2 for the detailed statistical results; corrected for multiple comparisons within the two tests—one per stimulus type—conducted for each ROI). For the spiral stimuli varying based on their orientation, V1 and V2 are significantly form-dominant, V4 and VOT are color dominant, and V3 and LOT show no significant preference in either direction. For the tessellation stimuli varying based on their curvature, V3, V4, and LOT were significantly form dominant, V2 and VOT trended form dominant, and V1 showed no significant feature preference. Next, we tested for each region whether its form dominance significantly differs between the spiral stimuli varying based on their orientation, and the tessellation stimuli varying based on their curvature (Table 2, right column). V1 was significantly more form dominant for the spiral stimuli, whereas V4, LOT, and VOT were significantly more form dominant for the tessellation stimuli. V2 and V3 showed no difference in dominance between the two stimulus types.
We also characterized how form dominance changes over the course of processing in the lateral and ventral substreams. First, we assessed whether there was an overall linear increase or decrease in the dominance index over processing. Specifically, we computed the Pearson correlation coefficient between the dominance index for each ROI and each ROI's position in one of the two substreams within each bootstrapped sample, and tested whether this correlation coefficient significantly differs from zero (Table 4). Correction for multiple comparisons was performed across the two tests (one per stimulus type) performed for each substream. Form dominance significantly decreases over processing in both the lateral and ventral substreams for the spiral stimuli. For the tessellation stimuli, form dominance significantly increases in the lateral substream, with a trending increase in the ventral substream. To examine these broad changes in more detail, and compare the lateral and ventral substreams, we compared the dominance index between ROIs at the beginning and end of processing, and between LOT and VOT (see Table 3 for the detailed statistical results). For the spiral stimuli varying based on their orientation, V1 was significantly more form dominant than LOT and VOT, and the mean of V1/V2 was significantly more form dominant than the mean of LOT/VOT. In addition, LOT was significantly more form dominant than VOT. For the tessellation stimuli varying in their level of curvature, form dominance was significantly higher in LOT than in V1, trended higher in VOT than in V1, and was significantly higher in the mean of LOT/VOT than the mean of V1/V2. LOT and VOT did not significantly differ in their form dominance for the tessellation stimuli. In summary, then, both the lateral and ventral substreams show decreasing form dominance over processing for the spiral stimuli but increasing form dominance for the tessellation stimuli, with the lateral substream becoming significantly more form dominant than the ventral substream by the end of processing for the spiral stimuli, but not the tessellation stimuli.
We then examined how the dominance index changes over processing in CNNs. To test for an overall linear increase or decrease in the dominance index over processing, we computed the Pearson correlation coefficient between the dominance index for each CNN layer and its rank order within the CNN (correction for multiple comparisons performed across the two tests—one per stimulus type—performed for each CNN). Similar to the brain, all trained CNNs showed a significant linear decrease in form dominance over processing for the spiral stimuli varying based on their orientation. For the tessellation stimuli varying based on their level of curvature, four trained CNNs showed a significant increase in form dominance over processing like the brain, the version of ResNet-50 trained on ImageNet showed no change, and the version of ResNet-50 trained on Stylized ImageNet showed a significant decrease in form dominance over processing. To test whether this difference in the two training regimes of ResNet-50 was significant, we conducted a significance test for the equality of two correlation coefficients (Diedenhofen & Musch, 2015) for both the spiral and tessellation stimuli. The spiral stimuli showed no difference in the correlation coefficients between the two versions of ResNet-50 (Z = .078, p = .94), but the tessellation stimuli showed a significant difference between the two correlations (Z = 2.41, p = .02). For the random networks, the index was negative for every layer and every CNN, and generally grew even more negative (i.e., color-dominant) over the course of processing for most of the networks for both the spiral and tessellation stimuli, with the exception of CORNet_S showing no significant change over processing for the tessellation stimuli, VGG19 showing no significant change for the spiral stimuli, and Imagenet-trained ResNet-50 showing no significant change for the tessellation stimuli. Thus, the networks trained to recognize objects on standard ImageNet images, but not the random networks or the version of ResNet-50 trained on stylized ImageNet images, were mostly similar to the brain with respect to whether the dominance index increases or decreases over processing.
To assess the extent to which the feature tuning we observed in the trained CNNs may have been affected by the particular random seed used to initialize their weights, we computed the absolute and relative feature coding strength for 10 additional instances of AlexNet trained on object recognition (weights taken from open-source data provided by Mehrer et al., 2020). Figure 5 shows results for each individual new instance, for the average across the 10 new instances, and for the instance of AlexNet used in earlier analyses for comparison purposes. Results were strikingly consistent across all 10 instances, suggesting that different initializations of a network can nonetheless converge on final weights that yield similar feature tuning strengths.
Finally, we tested whether the brain significantly differs in the specific value of the form dominance from CNNs at corresponding stages of processing (Figure 6); this comparison is valid because the dominance index is a unitless measure that can be validly compared across different systems. To test this, we used spline interpolation to resample the dominance index values from the layers of each CNN so as to match the number of brain regions in either the lateral or ventral substream. We then tested for each brain region whether its dominance index is significantly different from that of each CNN at the corresponding stage of processing, with correction for multiple comparisons applied across the six networks being compared with each brain region. CNN layers showing a significant difference are indicated on Figure 6 with a bold outline around the dot for that layer. For the most part, the values did not significantly differ between each brain region and the corresponding resampled CNN layers, with several notable exceptions: LOT showed significantly higher form dominance for the tessellation stimuli varying based on their curvature for every network except VGG19 at the end of processing, V4 showed significantly greater color dominance than every network for the spiral stimuli, and both trained versions of ResNet50 were significantly more color dominant than the brain for multiple stages of processing for the tessellation stimuli. Overall, CNN index values appear to be more aligned with those of both the lateral and ventral substream for the spiral stimuli varying based on orientation than for the tessellation stimuli varying based on curvature.
In summary, CNNs trained on ImageNet images generally have a similar response profile to the brain in terms of how the dominance index increases or decreases over processing, with the index decreasing for the spiral stimuli, indicative of color being increasingly emphasized relative to orientation, but increasing for the tessellation stimuli, indicative of curvature being increasingly emphasized relative to color, with generally similar magnitudes of the index at corresponding stages of processing.
DISCUSSION
In this study, we examined how the absolute and relative coding strength for color and form vary across regions in the ventral visual pathway (which we subdivided into ventral and lateral substreams) and how they vary across layers in CNNs with various architectures and training regimes, using two different stimulus sets: spirals that vary with respect to a simple form feature, orientation, and tessellations that vary with respect to a midlevel form feature, curvature. To examine absolute coding strength, we used normalized Euclidean distances, and to examine relative coding strength, we devised a novel form dominance index that quantitatively compares how strongly two features affect the representational geometry in a brain region or CNN layer. These metrics can be validly compared between brain regions and CNN layers, enabling a detailed investigation of how absolute and relative coding strength for these features vary both across processing within a given system, and across systems.
In the human brain, we found that every region we examined had significant information about each feature for both the spiral stimuli varying based on their orientation and the tessellation stimuli varying based on their curvature, replicating results from past work that measured color and form information using decoding accuracy (Taylor & Xu, 2022). Because we used normalized Euclidean distance as a distance measure, which is metric-scale (unlike decoding accuracy reported in Taylor & Xu, 2022), we could further examine how coding strength for each feature changes over processing. We found that whereas the absolute coding strength for orientation exhibits either a significant or trending decrease over processing in both the lateral and ventral substreams, the absolute coding strength for curvature significantly increases over processing in both substreams. The absolute coding strength for color showed a significant increase for the spiral stimuli in the ventral substream, with no other significant changes in the coding strength of color over processing. LOT and VOT were comparable in their coding strength for both form and color for both stimulus types, although LOT exhibited a trending advantage in coding strength for form relative to VOT for the tessellation stimuli. Quantifying color and form information with a metric-scale method thus usefully illuminates how the strength of feature coding varies over the lateral and ventral substreams: coding for orientation decreases, coding for curvature increases, and coding for color stays roughly constant.
The form dominance index devised in this study revealed that for the spiral stimuli varying based on orientation, form becomes progressively less dominant than color over the course of processing in both the lateral and ventral substreams; by contrast, for the tessellation stimuli, form becomes progressively more dominant than color in the lateral substream, with a corresponding trend for the ventral substream. Moreover, LOT is significantly more form-dominant than VOT for the spiral stimuli but not the tessellation stimuli, suggesting an important representational difference across form-selective occipitotemporal cortex. Past work has found that VOT might be more tolerant to mirror-image transformations of stimuli than LOT (Dilks, Julian, Kubilius, Spelke, & Kanwisher, 2011); given that the two spiral stimuli are mirror-symmetrical, this could potentially account for the difference we observed here. Finally, whereas V1 is significantly more form-dominant for the spiral stimuli than for the tessellation stimuli, V4, LOT, and VOT are significantly more form dominant for the tessellation stimuli, consistent with V1's strong tuning to orientation, and the tuning of the latter regions to mid-to-high level form features. Thus, the lateral and ventral substreams we examined are broadly similar in terms of their relative coding strength for color and form, where both substreams increasingly de-emphasize orientation relative to color but increasingly emphasize curvature relative to color, with the only difference being that LOT showed higher form dominance than VOT when orientation was the manipulated form feature.
Examining the absolute coding strength of color and form in CNNs revealed a heterogeneous profile of results: All but one of the ImageNet-trained CNNs showed a significant decrease in orientation information over the course of processing, similar to the lateral and ventral pathways, but no such reliable similarity was found for curvature coding, with curvature coding increasing in some networks and showing no change in others. For color processing, whereas the human lateral and ventral pathways showed either no significant change or an increase in color information (in the case of the spiral stimuli in the ventral pathway), CNNs varied widely: Several showed a significant decrease in color coding strength over processing, some showed no significant change, and one network (the version of ResNet-50 trained on stylized ImageNet) showed a significant increase in color coding strength over processing. This last effect is somewhat difficult to explain: On the one hand, the retextured images in the stylized version of ImageNet no longer have color as a diagnostic category feature, such that one might have predicted the network to have reduced color tuning, but on the other hand, the introduction of colored textures to many of the images may have made it more important to identify color-defined contours; either way, however, this distinctive result illustrates the importance of examining networks with different training regimes. Overall, in examining how the absolute coding strength of color and form features varies over processing, ImageNet-trained CNNs tend to capture the decrease in orientation coding strength seen in the brain, less reliably capture the increase in curvature coding strength, and sometimes exhibit a decrease in color coding strength that is not observed in the brain.
Despite these differences, examining coding for color and form features in terms of their relative strength, however, revealed more consistent patterns between the brain and CNNs and among the different CNNs. First, all trained CNNs became significantly less form-dominant over processing for the spiral stimuli varying based on their orientation, similar to both the lateral and ventral pathways of the brain. Second, four of the five ImageNet-trained CNNs became significantly more form dominant over processing for the tessellation stimuli varying based on their curvature, once again similar to both the lateral and ventral pathways. Thus although the absolute coding profiles for color and form features could vary between the brain and CNNs and among the different CNN architectures, the relative coding profiles are strikingly similar. Similar results were also obtained in another recent study in which the coding of object identity and configuration information during scene processing were examined in four different CNNs (Tang, Chin, Chun, & Xu, 2022). In that study, despite differences in the absolute coding strength of the CNNs to object identity and configuration information in a scene, the relative coding strength of these two features follow a very similar profile across the different CNNs. Third, most CNNs were comparable to the brain not just in terms of the presence of an increasing or decreasing trend in the dominance index over processing, but also in terms of the exact values of the index compared with corresponding stages of processing in both the ventral or lateral substeams. It is worth noting that LOT was significantly more form dominant for the tessellation stimuli than every network except VGG19 in its final stage of processing. This is likely because of strong tuning for curvature in this region that does not appear to automatically emerge in CNNs trained for object recognition. In addition, V4 was significantly more color dominant than all of the CNNs at the corresponding stage of processing for the spiral stimuli, reflective of its strong tuning for color.
We note that unlike the other networks we examined, neither trained variant of ResNet-50 showed an increase in form dominance over processing for the tessellation stimuli varying based on their curvature, with the ImageNet-trained version showing no significant change, and the version trained on Stylized Imagenet uniquely showing a significant decrease in form dominance over processing, with these two networks showing a significantly different linear trend in this respect. This decrease in form dominance plausibly arises from the stylized version of ResNet-50 showing a slight increase in form information over processing, but a much larger increase in color information, as discussed earlier. Another interesting difference between ResNet-50 and the other networks is that the values for the absolute and relative feature coding strengths vary much more widely over processing than the other networks for the trained variants of the networks: Although it is unclear why this is the case, we note that ResNet-50's architecture was unique among the networks we examined in that it contains skip connections, and so an interesting question for further work could be to clarify how the topology of a network affects its representations.
To examine the extent to which the CNN results arise from their intrinsic architecture versus their training, we also examined untrained CNN variants with random weights. All such networks showed low levels of both color and form information in both stimulus types that tended to only further decrease with processing, demonstrating that the color and form information we observed in the trained networks indeed emerged as a consequence of their training regime and not merely as a byproduct of their architecture. The random networks exhibited a negative (color-biased) form dominance index at every stage of processing, a bias that tended to only increase over the course of processing. One possibility for this is that a small amount of “color” tuning could arise because of incidental differences in the weights associated with the R, G, and B channels in the input image, whereas meaningful tuning for form features is far less likely to emerge with purely random weights. An interesting question for further research is the extent to which the emergence of feature tuning from training that we observe was because of the specific task of object recognition, versus as a consequence of the particular stimulus set used; this could be examined using data sets such as the Taskonomy data set (Zamir et al., 2018), which varies a network's training task. That said, our results from the version of ResNet-50 trained on a “stylized” version of ImageNet (with retextured images) showed decreasing form dominance over processing for the tessellation stimuli that was not observed in the regular ImageNet-trained CNNs, suggesting that the training data set can indeed influence the feature tuning that emerges with training, rather than an alternative scenario where similar feature tuning emerges across any kind of visual task.
Because the representations in a network can vary based on not only the network's architecture and training regime but also on the particular random seed used to initialize its weights (Mehrer et al., 2020), we also examined 10 additional instances of AlexNet trained for object recognition starting from different random seeds. We found that both the absolute and relative coding strength of color and form are remarkably consistent across these seeds across all network layers—in stark contrast to the completely different tuning profiles found when comparing the trained versus untrained networks—suggesting that training networks of the same architecture on the same task can induce similar profiles of feature tuning strength despite different starting weights.
Overall, our results reveal a striking and consistent correspondence between trained CNNs and the brain in terms of how their relative emphasis of color and form information changes over processing, which is not present when examining the absolute coding strength for these features: For both the brain and trained CNNs, form information is generally increasingly de-emphasized relative to color information for stimuli varying based on orientation, but increasingly emphasized relative to color information for stimuli varying based on curvature. These changes in relative feature coding strength occur both across nearly every network architecture we examined, and across 10 additional initializations of AlexNet trained on object recognition starting from different random seeds, suggesting that it is not merely a chance byproduct of the particular random initializations used by these networks during training. This convergence raises the question of whether this property of the multifeature representational geometry might play some computational role, such as facilitating readout of one feature across variation in the other; future work could examine whether form dominance correlates with a network's object recognition performance, or whether encouraging a network to emphasize certain features relative to others in its loss function might improve its performance. We note that these results are agnostic to the question of how color might affect shape processing and vice versa, because they could arise from a neural population in which color-tuned and shape-tuned neurons are intermingled but non-interacting, or from a population in which individual neurons are jointly tuned to both features, as long as both cases give rise to the same representational distances.
Although we examined color and form coding in this study, with two values of each feature, the form dominance index can in principle be applied to any pair of features that vary in a factorial design within a stimulus type. Given the extensive evidence that visual brain regions and CNN layers often contain population codes that multiplex multiple features (Tang et al., 2022; Taylor & Xu, 2021, 2022; Xu & Vaziri-Pashkam, 2021b; Chang, Bao, & Tsao, 2017; Hong, Yamins, Majaj, & DiCarlo, 2016), it may thus be a useful tool for characterizing how these multifeature population codes vary across regions and systems. An important consideration when interpreting the form dominance index is that it intrinsically depends on the range of variation sampled within each feature: For example, if two highly similar orientations (rather than orthogonal orientations, as used in our study) were presented, then the form-based dissimilarity would decrease, thereby increasing the measured relative dominance of color; the opposite would occur if two highly similar colors (e.g., two slightly different shades of red) were used. Thus, our stimuli were chosen so as to attempt to maximize variation in the features tested: First, both spiral orientations were orthogonal; second, the tessellation stimuli either contained entirely straight contours or entirely curved contours; and third, the red and green colors used are opposites of each other on the color wheel, thus greatly varying the parameter of hue. We thus believe our stimuli reasonably sampled the full range of values for each feature. We note as well that this consideration is not unique to this particular metric: Any attempt to compare how variation in two different features affects neural responses must take care that the two features vary in a comparable way. That being said, given a set of chosen feature values, we can still make valid comparisons across brain regions and CNN layers.
In conclusion, despite differences in how the absolute coding strength of color and form features may vary over the course of visual processing in the human brain and CNNs, our approach identifies an important similarity in how the brain and CNNs process color and form: Both systems increasingly emphasize color relative to a simple form feature, orientation, but increasingly de-emphasize color relative to a more complex form feature, curvature.
Reprint requests should be sent to JohnMark Taylor, Zuckerman Institute, Columbia University, 3227 Broadway, New York, NY 10027, or via e-mail: [email protected].
Data Availability Statement
We make our data (specifically, the beta values from each ROI for each run, subject, and experiment for the fMRI data, and all saved CNN activations for the CNN data) freely available via the Open Science Framework at https://osf.io/ebktz/.
Author Contributions
JohnMark Taylor: Conceptualization; Data curation; Formal analysis; Methodology; Software; Visualization; Writing—Original draft; Writing—Review & editing. Yaoda Xu: Conceptualization; Funding acquisition; Resources; Supervision; Writing—Original draft; Writing—Review & editing.
Funding Information
JohnMark Taylor, National Science Foundation (https://dx.doi.org/10.13039/100000001), grant number: DGE1745303. JohnMark Taylor, National Institute of Health, grant number: 1F32EY033654. Yaoda Xu, National Institute of Health, grant number: 1R01EY022355. Yaoda Xu, National Institute of Health, grant number: 1R01EY030854. National Institute of Health, grant number: S10OD020039.
Diversity in Citation Practices
Retrospective analysis of the citations in every article published in this journal from 2010 to 2021 reveals a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identification of first author/last author) publishing in the Journal of Cognitive Neuroscience (JoCN) during this period were M(an)/M = .407, W(oman)/M = .32, M/W = .115, and W/W = .159, the comparable proportions for the articles that these authorship teams cited were M/M = .549, W/M = .257, M/W = .109, and W/W = .085 (Postle and Fulvio, JoCN, 34:1, pp. 1–3). Consequently, JoCN encourages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the opportunity to report their article's gender citation balance. The authors of this article report its proportions of citations by gender category to be as follows: M/M = .588; W/M = .147; M/W = .118; W/W = .147.