Animacy and real-world size are properties that describe any object and thus bring basic order into our perception of the visual world. Here, we investigated how the human brain processes real-world size and animacy. For this, we applied representational similarity to fMRI and MEG data to yield a view of brain activity with high spatial and temporal resolutions, respectively. Analysis of fMRI data revealed that a distributed and partly overlapping set of cortical regions extending from occipital to ventral and medial temporal cortex represented animacy and real-world size. Within this set, parahippocampal cortex stood out as the region representing animacy and size stronger than most other regions. Further analysis of the detailed representational format revealed differences among regions involved in processing animacy. Analysis of MEG data revealed overlapping temporal dynamics of animacy and real-world size processing starting at around 150 msec and provided the first neuromagnetic signature of real-world object size processing. Finally, to investigate the neural dynamics of size and animacy processing simultaneously in space and time, we combined MEG and fMRI with a novel extension of MEG–fMRI fusion by representational similarity. This analysis revealed partly overlapping and distributed spatiotemporal dynamics, with parahippocampal cortex singled out as a region that represented size and animacy persistently when other regions did not. Furthermore, the analysis highlighted the role of early visual cortex in representing real-world size. A control analysis revealed that the neural dynamics of processing animacy and size were distinct from the neural dynamics of processing low-level visual features. Together, our results provide a detailed spatiotemporal view of animacy and size processing in the human brain.
Object properties that are universally applicable to any object are of primary importance to our understanding of the visual world by providing basic ordering and categorization to our perceptions. Animacy and real-world size are two such universal properties that are pertinent to perception and of high behavioral relevance (Konkle & Oliva, 2011; Chao, Haxby, & Martin, 1999). Every object has a physical size, from small scales, such as a strawberry we can pick up with our fingers, to large scales, such as a building that we can use as a landmark (Konkle & Oliva, 2011, 2012; Mullally & Maguire, 2011). Similarly, every object is either animate or inanimate (Shutts, Markson, & Spelke, 2009). Animacy and real-world size are independent properties of each other, in that animate objects can range from small to large and so can inanimate objects.
Previous research has shown that animacy and object size processing is carried out in a distributed and partly overlapping network of brain regions in the occipital, temporal, and parietal lobes (animacy: Sha et al., 2015; Bell, Hadj-Bouziane, Frihauf, Tootell, & Ungerleider, 2009; Mahon, Anzellotti, Schwarzbach, Zampini, & Caramazza, 2009; Wiggett, Pritchard, & Downing, 2009; Kriegeskorte, Mur, & Bandettini, 2008; Kriegeskorte, Mur, Ruff, et al., 2008; Downing, Chan, Peelen, Dodds, & Kanwisher, 2006; Chao et al., 1999; size: Konkle & Caramazza, 2013, 2017; Fabbri, Stubbs, Cusack, & Culham, 2016; Bainbridge & Oliva, 2015; Konkle & Oliva, 2012; Murray, Boyaci, & Kersten, 2006). Some previous studies have also looked at the temporal dynamics of animacy processing (Clarke, Devereux, Randall, & Tyler, 2015; Carlson, Simmons, Kriegeskorte, & Slevc, 2014; Cichy, Pantazis, & Oliva, 2014; Carlson, Tovar, Alink, & Kriegeskorte, 2013; Liu, Agam, Madsen, & Kreiman, 2009; Thorpe, Fize, & Marlot, 1996), showing that the information related to animacy is processed rapidly within the first 100 msec after stimulus onset.
However, crucial questions about the spatial and temporal neural dynamics underlying animacy and real-world size processing remain unanswered. Concerning the spatial dynamics, it remains unknown how in detail animacy and size information are represented. For example, do all regions that process animacy use the same representational format, or does the representational format differ across regions? Similarly, it remains unknown how animacy and size information interact and are integrated across the overlapping set of regions: Do all regions process animacy and size in parallel in the same fashion, or does a single region stand out? Concerning the temporal dynamics, it remains unknown how animacy and real-world size processing is orchestrated in time. Given that the properties real-world size and animacy are independent of each other, one might expect them to be processed independently in the brain, engaging different cortical sites at different moments. Alternatively, given the universality and importance of those two properties, the brain might use a distributed code where real-world size and animacy are processed in parallel and redundantly in the same cortical sites and with similar temporal dynamics.
To reveal the processing of animacy and real-world size in the human brain, we recorded brain data with fMRI and MEG while participants viewed a set of object images that differed in animacy and real-world size independently. To reveal the spatial dynamics, we applied representational similarity analysis (RSA; Nili et al., 2014; Kriegeskorte & Kievit, 2013; Kriegeskorte, Mur, & Bandettini, 2008; Kriegeskorte, Mur, Ruff, et al., 2008) to fMRI data to investigate the loci of animacy and size processing. To reveal the temporal dynamics, we applied multivariate pattern classification and RSA to MEG data in a time-resolved fashion (Cichy et al., 2014; Carlson et al., 2013). Finally, we combined MEG and fMRI data using a novel extension of MEG–fMRI fusion by representational similarity (Cichy, Pantazis, & Oliva, 2016; Cichy et al., 2014) to reveal the neural dynamics specific to the processing of animacy and size resolved simultaneously in space and time.
The MEG and fMRI data have been fully described in a previous publication (Cichy, Pantazis, et al., 2016) and are available upon request. Here, we repeat the pieces of information that are necessary for reproduction.
Fifteen healthy volunteers (five women; age: mean = 26.60 years, SD = 5.18 years) participated in this experiment. All participants were right-handed with normal or corrected-to-normal vision and provided written consent. The studies were conducted in accordance with the Declaration of Helsinki and approved by the local ethics committee (institutional review board of the Massachusetts Institute of Technology).
Stimulus Set and Stimulus Presentation Parameters
The stimulus set contained 118 real-world object images on real backgrounds. The full image set is available at http://userpage.fu-berlin.de/rmcichy/Khaligh_Razavi_et_al_2018JoCN/118_visual_stimuli.mat. The image set contained 27 animate objects (nine small, three medium, and 15 large) and 91 inanimate objects (33 small, 32 medium, and 26 large). For examples, see Figure 1A. For both MEG and fMRI recordings, images were presented at the center of a screen at 4.0° visual angle overlaid with a gray fixation cross. The presentation duration was 500 msec per image.
fMRI Data Acquisition and Preprocessing
We used the fMRI data as reported in Cichy, Pantazis, et al. (2016), Experiment 2. The experiment consisted of two sessions of 9–11 runs of 486 sec in duration each. In each run, every image was presented once. Image order was randomized with the restriction that the same condition was not presented on consecutive trials. Twenty-five percent of the trials were null trials during which only a gray background was presented and the fixation cross changed luminance for 100 msec. Participants were instructed to respond to the change in luminance with a button press. SOA was 3 sec, or 6 sec with a preceding null trial.
MRI data were acquired on a 3-T Trio scanner (Siemens) with a 32-channel head coil. Structural images were obtained using a standard T1-weighted sequence (192 sagittal slices, field of view [FOV] = 256 mm2, repetition time = 1900 msec, echo time = 2.52 msec, flip angle = 9°). Functional data were obtained with the following parameters: gradient-echo EPI sequence: repetition time = 750 msec, echo time = 30 msec, flip angle = 61°, FOV read = 192 mm, FOV phase = 100% with a partial fraction of 6/8, through-plane acceleration factor of 3, bandwidth = 1816 Hz/Px, resolution = 3 mm3, slice gap = 20%, slices = 33, and ascending acquisition.
The fMRI data were preprocessed using SPM8 (www.fil.ion.ucl.ac.uk/spm/). For each participant, fMRI data from both sessions were realigned and coregistered to the T1 structural scan acquired in the first MRI session. Then, MRI data were normalized to the standard Montreal Neurological Institute template. A general linear model (GLM) was used to estimate the fMRI response to the 118 image conditions. Image onsets and durations entered the GLM as regressors and were convolved with a hemodynamic response function. Movement parameters were included as nuisance parameters. Additional regressors modeling the two sessions were included in the GLM. The estimated condition-specific GLM parameters were converted into t values by contrasting each condition estimate against the implicitly modeled baseline. The t values were used in a later analysis for constructing fMRI dissimilarity matrices.
MEG Data Acquisition and Preprocessing
Participants completed one session of 15 runs of 314 sec in duration each. In each run, every image was shown twice, and the sequence of image presentations was randomized. Trial onset asynchrony varied between 0.9 and 1 sec. Every three to five trials, an image of a paper clip was shown to which participants were asked to respond with an eye blink and a button press. Participants were instructed to not blink at other times.
The MEG signals were acquired from 306 channels (204 planar gradiometers, 102 magnetometers; Elekta Neuromag TRIUX, Elekta) at a sampling rate of 1 kHz, filtered between 0.03 and 330 Hz.
To denoise data, temporal source space separation (maxfilter software, Elekta; Taulu & Simola, 2006; Taulu, Kajola, & Simola, 2004) was applied before further analyzing data with Brainstorm (Tadel, Baillet, Mosher, Pantazis, & Leahy, 2011). In detail, for each trial, we extracted peristimulus data from −100 to +700 msec, removed the prestimulus baseline mean from the signals, and smoothed data with a 20-msec sliding window.
Definition of fMRI ROIs
We parcellated the whole cortex into 60 nonoverlapping ROIs in each hemisphere. ROIs were defined through using two probabilistic atlases of the human brain (Wang, Mruczek, Arcaro, & Kastner, 2015; Tzourio-Mazoyer et al., 2002). First, we used the retinotopically defined ROIs from the Wang et al. (2015) probabilistic atlas. As this atlas does not cover the whole cortex, for the remaining cortex, we used the anatomically defined ROIs of the automated anatomical labeling toolbox (Tzourio-Mazoyer et al., 2002). To create nonoverlapping ROIs from probabilistic maps, we assigned each voxel to the ROI of highest probability; the aggregate probability over all ROIs was ≥33%.
All analyses and statistics were conducted for all ROIs, but for ease of visualization, the results are shown only for those ROIs that had any significant effect after accounting for multiple comparisons across all tests. From the ROIs discussed in the article, the following ROIs were defined using the Wang et al. (2015) atlas: early visual cortex (EVC), lateral occipital cortex (LO), temporal occipital cortex (TO), ventral occipital cortex (VO), parahippocampal cortex (PHC), intraparietal sulcus (IPS), and superior parietal lobe (SPL). EVC was defined as the combination of V1, V2, and V3 masks. We combined these three regions because visual stimuli were presented at 4° visual angle, which is an eccentricity at which the distinction between V1, V2, and V3 is difficult as they are located close together in the foveal confluence. IPS and SPL corresponded to IPS0 and SPL1 in the nomenclature of Wang et al.'s (2015) atlas. LO, TO, VO, and PHC were combinations of masks LO1 and LO2, TO1 and TO2, VO1 and VO2, and PHC1 and PHC2, respectively. Fusiform gyrus (Fusi), inferior temporal gyrus (ITG), and middle temporal (MT) were defined using the automated anatomical labeling atlas (Tzourio-Mazoyer et al., 2002).
RSA enables relating representations obtained from different modalities such as computational models, MEG, and fMRI data (Kietzmann, Gert, Tong, & König, 2017; Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Cichy, Pantazis, et al., 2016; Cichy et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Nili et al., 2014; Kriegeskorte & Kievit, 2013; Kriegeskorte, Mur, & Bandettini, 2008; Kriegeskorte, Mur, Ruff, et al., 2008). The basic analytical tool of RSA is the representational dissimilarity matrix (RDM). An RDM is a square symmetric matrix, defined in rows and columns by experimental conditions, in which off-diagonal elements indicate the dissimilarity between the activation patterns associated with two different conditions. The diagonal of the RDM reflects comparisons between identical conditions and is thus set to 0 and excluded from the analysis. An RDM thus summarizes the representational geometry in terms of dissimilarity between condition-specific activation patterns.
Definition of RDMs
The analysis was conducted separately for each participant. For each time point t of the peristimulus epoch time, we compared pairwise the dissimilarity of MEG data for all 118 conditions (images) to construct a 118 × 118 RDM. The measure of dissimilarity was cross-validated decoding accuracy (as in Cichy, Pantazis, et al., 2016; Cichy et al., 2014) between the 306-channel MEG response patterns.
For each participant and each brain area, the fMRI RDMs were constructed by computing pairwise dissimilarities (1 minus Pearson's correlation) between condition-specific voxel t value patterns.
Animacy model RDM.
The animacy model RDM is binary with entry 1 if two conditions differed in animacy and 0 if both conditions were both animate or inanimate.
Size model RDM.
The size model RDM was constructed based on behavioral ratings of size judgment. In detail, we asked 10 participants to arrange the 118 objects into three real-world size categories: (a) small: finger size or one-hand size; (b) medium: size of two hands; and (c) large: body size or larger. Each image was then assigned to the size category voted by most participants. We used these size categories to construct the size model RDM. RDM elements corresponding to comparisons of images of the same size were assigned the entry 0; RDM elements corresponding to comparisons of images of different sizes were assigned the entry 2 if they were large versus small and 1 if otherwise.
Gist model RDM.
To evaluate the low visual feature content of the images, we computed an RDM based on the Gist descriptor (Oliva & Torralba, 2001, 2006). The Gist code is available at people.csail.mit.edu/torralba/code/spatialenvelope/. In detail, we computed image-specific Gist features by subdividing the image into 16 bins, and then for each bin, we determined the filter energy of Gabor filters in eight orientations over four different spatial scales. We then compared (Spearman's r) image-specific Gist features pairwise to construct a Gist model RDM.
We conducted three standard RSAs differing in the modalities compared: (i) MEG-to-category model, (ii) fMRI-to-category model, and (iii) MEG-to-fMRI.
MEG-to-category model RSA.
We correlated participant-specific MEG RDMs at each time point to the reference category model RDM (size or animacy model RDM). This yielded an MEG-to-category model similarity time course for each participant and category model.
fMRI-to-category model RSA.
We correlated participant-specific fMRI RDMs of each ROI to the reference category model RDMs. This yielded an fMRI-to-category model similarity value for each participant, category model, and ROI.
Also termed MEG–fMRI fusion (Cichy, Pantazis, et al., 2016), for this analysis, we correlated participant-specific fMRI RDMs for each ROI with the participant-averaged MEG RDM at each time point. This yielded an MEG-to-fMRI similarity map for each participant and time point.
To account for the fact that different brain areas have different levels of noise, we estimated the noise ceiling for each brain area (Nili et al., 2014) and the fMRI-to-category model RSA results for each ROI were normalized with the corresponding noise ceiling. The noise ceiling is defined as the mean correlation of each participant's RDM with the participant-averaged RDM, where the average RDM can be thought of as an estimate of the true model's RDM. The noise ceiling indicates the expected correlation of the true model given the noise in the data.
Onset and peak latency.
We calculated the onset and peak latencies for participant-averaged results of the MEG-to-category model and MEG-to-fMRI RSA. The peak latencies and standard error are obtained by 10,000 bootstrap resampling of participants.
For each of these three standard RSAs, namely, MEG-to-model, fMRI-to-model, and MEG-to-fMRI, statistical significance was obtained by a random effects analysis over the participant-specific results. In detail, following the statistical procedure in Nili et al. (2014), we used the Wilcoxon signed-rank test for assessing statistical significance of RDM correlations. This test is nonparametric (Gibbons & Chakraborti, 2011; Hollander & Wolfe, 1999) and thus does not make assumptions about the distribution of the data. When comparing peaks and onset latencies, we used two-sided bootstrap resampling test. In other cases that involved testing if a correlation is significantly above zero, we conducted one-sided signed-rank test with p < .05. We used false discovery rate (FDR; Simmons, Nelson, & Simonsohn, 2011; Benjamini & Hochberg, 1995) to correct for multiple comparisons: fMRI-to-category model RSA results were corrected across brain regions, MEG-to-category model RSA was corrected across time points, and MEG-to-fMRI RSA was corrected across both brain regions and time points.
Content-dependent MEG–fMRI Fusion Algorithm
Standard MEG–fMRI fusion reveals the whole spatiotemporal neural dynamics after visual stimulation. To further characterize spatiotemporal neural dynamics specific to the processing of a particular type of information, such as animacy or real-world size (referred to here as “content”), additional algorithmic constraints are required. Toward this aim, we used a statistical procedure called “conjunction inference” (Nichols, Brett, Andersson, Wager, & Poline, 2005) to combine the results of the three standard RSA procedures explained above, namely, (i) MEG-to-fMRI, (ii) MEG-to-category model RSA, and (iii) fMRI-to-category model RSA. In particular, statistical inference as described above yielded significant effects for each procedure alone. To reveal the spatiotemporal dynamics specific to a particular content, we then combined the results of the three procedures with an ‘AND’ operation. The resulting combined statistical map indicates MEG time points linked with fMRI locations that have a similar representational geometry to each other and to the reference model RDM. Visualizing the obtained 3-D time–location–content statistical map at each millisecond results in a movie of MEG–fMRI correspondence with respect to a prespecified content (e.g., size or animacy), thus revealing the spatiotemporal neural dynamics underlying the processing of the particular content.
General Experimental Rationale
We present the overall experimental rationale to investigate the spatiotemporal neural dynamics of processing the object properties animacy and real-world size in Figure 1. The analyses were based on fMRI and MEG data collected by Cichy, Pantazis, et al. (2016) on 118 different natural images of everyday objects. The object set comprised both animate and inanimate objects in either of three sizes, namely, small, medium, and large. Figure 1A displays examples of stimuli in each subgroup. During the study, observers watched the object images presented for 0.5 sec each while their brain activity was recorded with fMRI and MEG in separate sessions, to resolve neuronal activity in both high spatial and temporal resolutions, respectively. Participants performed a task orthogonal to the experimental conditions presented and were neither told to attend to specific properties of the objects nor informed about the hypotheses of the study. This ensures that the brain responses measured are not specific to a particular task but reflect task-independent and automatic processing of these properties.
To model the effect of size and animacy on brain representations independently of each other, we constructed two RDMs (Figure 1B). These RDMs are models of the assumption that dissimilarity is higher for objects across category boundary than within category boundary.
To reveal the spatiotemporal neural dynamics of real-world size and animacy processing, we used RSA in three different ways. First, to identify the brain regions involved in processing animacy and object size, we correlated the model RDMs with 60 fMRI RDMs, one for each of the 60 nonoverlapping regions in each hemisphere (Figure 1C). This analysis yielded one value of representational similarity between each model and each region. Second, to identify when in time during perception animacy and object size are processed, we correlated model RDMs with time-resolved MEG RDMs built from sensor activation patterns in 306 MEG sensors (Figure 1D). This analysis yielded one value of representational similarity between each model and each time point. Third, for a view on brain activity resolved simultaneously in space and time, we conducted MEG-to-fMRI RSA (termed “MEG–fMRI fusion”; Cichy, Pantazis, et al., 2016) by correlating all region-specific RDMs with all time-point-specific RDMs (Figure 1E). To identify the aspects of neural activity specific to the processing of animacy and size, we conducted a conjunction test on significant results from MEG-to-model RSA, fMRI-to-model RSA, and MEG-to-fMRI RSA for the animacy and size model separately. This content-dependent fusion approach yielded a spatiotemporally resolved picture of brain activity for animacy and object size processing.
A Distributed and Partly Overlapping Network of Brain Regions Represents Animacy and Real-world Object Size
To identify the network of brain regions representing animacy and object size, we conducted fMRI-to-model RSA (Figure 1C). Figure 2 shows the brain regions that were significantly correlated with at least one of the model RDMs (animacy: Figure 2A; physical size: Figure 2B; n = 15, one-sided signed-rank test, with results FDR-corrected across all brain parcels at p < .05).
We found a network of brain areas primarily located in the ventral temporal cortex, with the set of brain regions representing animacy being a subset of the regions representing object size. In detail, animacy was encoded in LO, TO, VO, PHC, Fusi, ITG, MT, and the left IPS. Size was encoded in all those regions as well and, in addition, in the left EVC and right SPL.
Within the identified network, PHC stood out through a combination of characteristics that were unique. First, PHC in both hemispheres encoded both animacy and object size. Second, the correlation with the size model was higher that with most other regions (for all regions in both hemispheres except left LO and Fusi where the difference was not statistically significant, n = 15, two-sided signed-rank test, FDR-corrected at p < .05 across all pairs of ROIs). Third, the correlation also with the animacy model was higher than with most other regions (except left and right VO).
To evaluate the robustness of the results, we investigated the processing of size and animacy independent of each other by repeating the fMRI-to-model RSA within each level of the factors size and animacy separately. In detail, we conducted fMRI-to-size model RSA twice separately for animate and inanimate objects (Figure 3C–E) and fMRI-to-animacy model RSA thrice separately for each size level (Figure 3A and B). In all cases, we found ROIs with a significant correlation with the reference model RDM. However, for small objects, the effect of animacy was particularly small and only observed in the right PHC.
In addition, to ameliorate the concern that the scene background might bias the results related to the object size, we determined whether there was a significant correlation between object size and the retinal size of the object. For each image, we calculated the number of pixels covered by the objects and evaluated whether this number differed across the three real-world size categories (i.e., small, medium, and large). Results are reported in Supplementary Figure 3. There was no difference between the three real-world size categories, thus demonstrating that object size effect is not confounded by the pixel size of the objects.
In summary, the fMRI-to-model RSA revealed a distributed and partly overlapping set of brain regions encoding animacy and size, with PHC representing both object properties stronger than most other regions. Our results corroborate and extend previous results of fMRI studies using univariate methods (Konkle & Caramazza, 2013, 2017; Bainbridge & Oliva, 2015; Konkle & Oliva, 2012) that reported similar brain networks of animacy and real-world size processing.
Fine-grained Differences in the Representation of Animacy across Brain Regions
The finding of a distributed and partly overlapping set of regions for animacy and size poses the question about the nature of animacy and size representations in those regions. One possibility is that size and animacy are represented the same way in all regions with a highly redundant code. Another possibility is that different regions represent animacy or size in different ways.
To investigate this issue qualitatively, we visualized the representational geometry of each of the 10 regions involved in either size or object processing. For this, we conducted multidimensional scaling (MDS; Borg & Groenen, 2005; Kruskal & Wish, 1978) on the region-specific RDMs and plotted the results in the first two dimensions (Figure 4A). Visual inspection suggested that ROIs differed in the representational geometry for animacy. For example, TO showed a tight cluster for animate objects but not for inanimate objects. PHC revealed the opposite pattern, that is, a tight cluster for inanimate but not animate objects. LO had two equally tight clusters for animate and inanimate objects.
To quantify these qualitative observations, we defined three model RDMs (Figure 4B) exploring different assumptions about animacy representations. Model 1 assumes equally tight clustering for animate and inanimate objects. Model 2 assumes tighter clustering for animate objects; and Model 3, for inanimate objects. We then correlated the region-specific fMRI RDMs for each of the 10 regions with the Model 1–3 RDMs. This analysis (Figure 4C) confirmed the qualitative observations quantitatively for LO, TO, and PHC and did not reveal any significant effects that were consistent across hemispheres in any other region. Thus, different ROIs have different representational geometries of animacy. We found that some regions (e.g., LO) did not show a specific preference for one of the representational geometries, whereas both TO and PHC preferred a different representational geometry of animacy (Models 2 vs. 3). This shows that LO versus TO and PHC have different representational geometries of animacy. This suggests that, although all those regions represent animacy, they do so in different ways, potentially because they weigh features that define animacy differently.
In equivalent fashion, we further explored different representational geometries for object size. However, none of the brain areas showed a consistent and significant preference for one representational geometry over the others.
Temporal Dynamics of Animacy and Object Size Processing during Object Vision
Using MEG-to-model RSA, we investigated when in time during visual perception are animacy and real-world object size processed (Figure 1D). The results are shown in Figure 5A. We determined peak and onset latencies to describe the resulting time courses. Onset latencies were similar for size and animacy but slightly shorter for size (size: 144 ± 1 msec [SEM]; animacy: 156 ± 1 msec; p < .01, bootstrap resampling of participants). There was no significant difference between the peak latencies for size and animacy (size: 193 ± 29 msec; animacy: 184 ± 23 msec; bootstrap resampling, p > .05).
To ascertain the robustness of the results, we investigated the processing of size and animacy independent of each other by repeating the MEG-to-model RSA within each level of the factors size and animacy separately. In detail, we conducted MEG-to-size model RSA twice separately for animate and inanimate objects (Figure 5B) and MEG-to-animacy model RSA thrice separately for each size level (Figure 5C). The analyses revealed significant effects in all cases, demonstrating the robustness of our findings.
Content-dependent MEG–fMRI Fusion Resolves Processing of Animacy and Size Simultaneously in Space and Time
To resolve the neural dynamics of animacy and size processing simultaneously in space and time, we used an extension of MEG–fMRI fusion that made it content dependent (Figure 1E). The general rationale of fMRI–MEG fusion by RSA is this: If an RDM based on MEG patterns at time point t is similar to an RDM based on fMRI patterns at location l, location l might be the source of the signal observed at t. Conducted for all time points and location combinations, this yields a spatiotemporally resolved map of brain activity. Here, we extend this formulation by the constraint that this spatiotemporal map must have a representational geometry significantly correlated with the theoretical model of animacy or the theoretical model of size (Figure 1E), thus turning it into content-dependent MEG–fMRI fusion. This analysis has the potential to reveal the dynamics of size and animacy processing beyond what MEG or fMRI alone can do by disambiguating the many-to-many mapping between locations and time points.
The full content-dependent MEG–fMRI fusion results for size and animacy processing are shown in Movie 1. Figure 6 shows the results at key time points. The main findings were threefold. First, we found a spatiotemporally distributed and partly overlapping network of brain areas involved in processing both size and animacy from ∼156 msec onward. Second, PHC showed distinct behavior from other regions within this network, highlighting it as a key region in animacy and size processing. Third, EVC proved to be a differentiator between animacy and size processing in coding only real-world object size (starting at 144 msec), but not animacy. We detail each main finding below.
Distributed and Partly Overlapping Spatiotemporal Dynamics Underlie Animacy and Real-world Size Processing
We found that the spatiotemporal dynamics underlying animacy and real-world size processing were distributed and partly overlapping (Figure 6). In detail, starting at ∼144 msec, several brain regions represented real-world object size, including the left EVC, bilateral PHC, LO, VO, and Fusi. From ∼156 to ∼350 msec, we found a distributed and partially overlapping set of brain areas including PHC, ITG, VO, LO, Fusi, and TO. In particular, we inspected the regions active at the peak times of animacy and size processing as indicated by MEG-to-model analysis above (Figure 5A). At the peak time for animacy processing, that is, 178 msec, bilateral LO, VO, Fusi, PHC, and left MT and ITG were active.
At the peak time for object size processing, that is, 193 msec, bilateral PHC, right VO, and left MT were active. Moving ahead to the period of 220–260 msec, we found that PHC alone represented both size and animacy. At 350 msec, activity related to both size and animacy processing in several regions reemerged. Finally, at 655 msec, we found that EVC represented real-world size.
In summary, these results delineate the neural dynamics in the overlapping set of brain regions that represent size and animacy, showing that they represent both object properties with similar temporal dynamics.
PHC is a key region for representing animacy and real-world size. The fMRI-to-model RSA (Figure 2) had singled out PHC as a key region in processing both size and animacy. The results of content-dependent MEG–fMRI fusion extend this finding by showing that PHC simultaneously codes for both size and animacy over several hundreds of milliseconds from ∼156 to ∼430 msec.
A supplementary analysis of standard ROI-based MEG fusion further emphasized the role of PHC in size and animacy processing by showing that peak and onset latencies (Supplementary Table 1) in PHC were later than for many ventral stream regions (bootstrap test, FDR-corrected at p < .05). Specifically, in the left hemisphere, PHC's onset latency was significantly later than those of VO, LO, and Fusi, and PHC's peak latency was significantly later than that of LO. In the right hemisphere, PHC's peak latency was significantly later than those of LO and Fusi, and PHC's onset latency was significantly later than that of LO.
In summary, we found that PHC was more strongly engaged in both animacy and size processing than other visual regions and that peak activity in PHC occurred later than in other ventral visual stream regions. Together, this suggests that PHC may be a cortical hub where object-related information is aggregated after being processed in regions lower in the visual processing hierarchy.
The EVC: From Representing Low-level Visual Features to High-level Real-world Size Information
We observed that the left EVC represented real-world size information starting at 144 msec (±1 SEM; Figure 6) but did not code for animacy at any time point. This singles out EVC as a differentiator between real-world size and animacy processing.
This result poses the question whether the observed real-world size-related activity reflects feedforward processing of low-level visual information or whether it reflects feedback processing of high-level visual information. To investigate this, we conducted a supplementary standard ROI-based fusion analysis, revealing the temporal dynamics of EVC independent of content (Supplementary Figure 1, Supplementary Table 1). We reasoned that standard content-independent analysis of EVC activity would reveal the dynamics of low-level visual information processing and thus allow discerning potential differences in timing that might differentiate between the two options. We found that EVC had an early onset latency of 78 msec (±4 SEM) and a peak latency of 128 msec (±2 SEM) in the left hemisphere (for the right EVC: onset = 72 msec [±4 SEM], peak = 129 msec [±12 SEM]), likely reflecting feedforward processing of low-level visual information (for further evidence, see also the next section). The temporal precedence and thus dissociation of peak and onset latencies (bootstrap test, p < .05) observed in the content-independent analysis compared with the content-dependent analysis argues against the idea of feedforward processing of low-level features as underlying the real-world object size signal in EVC. It instead favors the view that the observed representation of real-world size in EVC reflects processing high-level visual information feedback from higher visual areas.
In summary, we find evidence of two distinct processing phases in EVC: one early phase of processing low-level visual information and a later phase (starting at ∼144 msec) of processing higher-level visual information such as real-world size.
The Spatiotemporal Dynamics of Processing Animacy and Size Are Distinct from Processing Low-level Visual Features
Animacy and real-world size are high-level visual properties that are not trivially encoded in low-level visual features, as demonstrated by the complex computational transformations of low- to high-level visual features necessary for successful image categorization in artificial systems (Kietzmann, McClure, & Kriegeskorte, 2017; Krizhevsky, Sutskever, & Hinton, 2012; Rajaei, Khaligh-Razavi, Ghodrati, Ebrahimpour, & Abadi, 2012; Serre, Oliva, & Poggio, 2007). Under this assumption, the neural activity related to animacy and object-size processing described here reflects processing of high- rather than low-level visual properties.
To control whether this assumption in fact holds in our experimental setting, we investigated whether the spatiotemporal dynamics of low-level visual processing are distinct from the dynamics of processing animacy and size. To capture the low-level visual feature content of the stimulus set, we used the Gist descriptor (Oliva & Torralba, 2001, 2006). Gist is a computer vision image descriptor based on localized responses of Gabor filters at different scales and orientations. It has been shown to predict fMRI responses of early and midlevel visual areas previously (Khaligh-Razavi, Henriksson, Kay, & Kriegeskorte, 2017), making it a reasonable model of low-level visual processing in the cortex.
Importantly, as the GIST model has been found to predict several scene layout-related properties such as openness, expanse, and indoor/outdoor (Oliva & Torralba, 2001, 2006; Oliva, 2005), this also serves as a control for the scene layout. We constructed a model RDM based on Gist features for the 118 object images and compared it with the size and also animacy model RDMs. We found that the correlation was very low (i.e., Gist-size model correlation was .045 and Gist-animacy model correlation was .008), which makes it unlikely that the size or animacy effect observed here was driven by low-level to midlevel visual features or the scene layout. To further investigate the role of low-level visual features, we calculated size/animacy correlations with MEG, after regressing out the Gist effect (Supplementary Figure 4). Comparing time courses where the Gist RDM was partialed out with the time courses where it was not (same as reported in Figure 5) shows a significant difference only early between 93 and 130 msec for animacy and 79 and 150 msec for size. This further demonstrates that the later components of the size and animacy effects are unlikely to be affected by low-level/midlevel visual features as captured by the Gist features.
We also conducted the fMRI-to-model RSA (Figure 1C), MEG-to-model RSA (Figure 1D), and content-dependent MEG–fMRI fusion (Figure 1E) using a model RDM constructed from the Gist descriptor of the stimulus set. The fMRI-to-Gist-model RSA revealed significant correlations with bilateral EVC and left V4, but no other visual areas (Figure 7A). The MEG-to-Gist-model RSA revealed a time course different from the one revealed for the animacy and size model (as shown in Figure 5A). In particular, for Gist, both the onset latency at 67 msec (±6 msec [SEM]) and the peak latency at 120 msec (±7 msec [SEM]) were significantly smaller than the onset and peak latencies for either animacy or real-world size (bootstrap resampling test, p < .001).
Finally, content-dependent MEG–fMRI fusion for Gist further revealed that the spatiotemporal dynamics of processing Gist were different from processing animacy and size (Figure 7C and Movie 2). The identified dynamics started in EVC at 84 msec, followed by V4 20 msec later, and persisted in EVC for almost the entire percept duration until 500 msec. This is in stark contrast to the onset of processing animacy and size at ∼150 msec in a set of high-level visual regions for animacy and size.
A further important difference was the timing and role of EVC activation. In EVC, Gist processing started at 84 msec, which was earlier than size processing that started at around 144 msec (Figures 5 and 6; p < .05, based on bootstrap resampling test). This strengthens the idea that the observation of size processing in EVC is driven by feedback from higher visual areas.
We identified the spatiotemporal neural dynamics of animacy and object size by applying RSA to fMRI and MEG data. Investigation of the cortical regions representing animacy and size using fMRI revealed a distributed and partly overlapping set of cortical regions. This set of regions extended from occipital to ventral and medial temporal cortex and parietal cortex. Within the set, PHC was singled out as the region representing animacy and size stronger than most other regions. Fine-grained analysis of representational geometry revealed that different regions processed animacy in distinct ways. Investigation of the time course of animacy and size processing using MEG revealed overlapping temporal extent starting at around 150 msec and provided the first time-resolved neuromagnetic signature of real-world object size processing. Using an extension of MEG–fMRI fusion that shows neural dynamics specific to particular content, we revealed simultaneously in space and time the neural dynamics related to processing animacy and size. We found partly overlapping and distributed dynamics, with PHC singled out as a region that represented size and animacy persistently when other regions did not. A control analysis revealed that the neural dynamics of processing animacy and size were distinct from the neural dynamics of processing low-level visual features.
Limitations of the Content-dependent MEG–fMRI Fusion Method
Although MEG–fMRI fusion in general and in the content-dependent formulation is a promising method to resolve neural activity in space and time, acknowledgment of its limitations is necessary to guide interpretation. For one, the method in its formulation here compares neural activity measured at largely different spatial scales: Whereas sensor-space MEG patterns reflect activity from all of the cortices, fMRI searchlight patterns reflect activity in local neural populations. Thus, depending on how neural signals from different sources impact MEG sensor-level activation patterns, correspondences might go unnoticed. Similarly, for MEG–fMRI fusion to detect spatiotemporal correspondences, relevant representational structure must be present in both MEG and fMRI. However, there is no guarantee that conditions evoking significant and robustly different activation patterns in one imaging modality will do so in the other imaging modality, too. For example, Proklova, Kaiser, and Peelen (2017) found that, whereas the animacy of objects in a particularly controlled stimulus set was clearly detectable from fMRI patterns, it was not detectable from MEG sensor patterns. In summary, as MEG–fMRI fusion depends on two methods with particular sensitivities and limitations that both establish correlations rather than causality, both positive and negative findings have to be interpreted cautiously.
A Distributed and Partly Overlapping Set of Cortical Regions Represents Object Animacy and Real-world Size
Our multivariate analyses based on representational similarity showed that object animacy and size are represented in a distributed and partly overlapping set of regions, rather than a single region. This result is consistent with previous work, demonstrating shared and distributed networks of brain regions involved in processing object properties and stimulus domains (Huth, de Heer, Griffiths, Theunissen, & Gallant, 2016; Behrmann & Plaut, 2013; de Beeck, Brants, Baeck, & Wagemans, 2010; O'Toole, Jiang, Abdi, & Haxby, 2005; Spiridon & Kanwisher, 2002; Haxby et al., 2001; Ishai, Ungerleider, Martin, Schouten, & Haxby, 1999). Thus, our results further strengthen the view that the brain processes information in a distributed rather than localized fashion. The advantage of such a visual architecture might be manifold: For example, distributed representations are more robust to impairments, and importantly, distributed but integrated large-scale networks are essential to ensure rapid and accurate visual object recognition (Behrmann & Plaut, 2013; Catani, 2007; Mesulam, 1990).
Our results go beyond previous work in two ways. First, we described the detailed representational geometry used by the brain to represent animacy. Although TO, LO, and PHC all represented animacy, we showed that they did so in different ways. This qualifies the notion of distributed networks for visual information processing by showing that different nodes of the network conduct related but different processing. Our results thus allow detailed predictions about behavioral deficits that might result from damage to those regions or disturbance in processing therein. Future neuropsychological or brain stimulation studies using TMS might test these predictions.
Our results also go beyond previous work in revealing the full spatiotemporal dynamics in the network of regions processing animacy and size (Konkle & Caramazza, 2013, 2017; Fabbri et al., 2016; Bainbridge & Oliva, 2015; Konkle & Oliva, 2012). We demonstrated that, starting from 160 msec after stimulus onset, both size and animacy are represented in partially overlapping brain areas with similar temporal dynamics. Within this spatiotemporally shared network, PHC was singled out among other regions in three ways: It simultaneously coded for both size and animacy from ∼160 to ∼430 msec, it was the only region representing both size and animacy from 220 to 260 msec, and it had the highest RDM correlation with the reference animacy and size model RDMs. These results combined indicate a critical role for PHC in representing size and animacy in the human brain. They further suggest that PHC is a key cortical node that combines the outputs of two parallel processing streams for size and animacy.
Previous studies did not report a special role for PHC in processing animacy and real-world size but rather in processing scenes (Epstein, 2005; Epstein & Kanwisher, 1998). This discrepancy might be explained by differences in the nature of the stimulus sets used to probe the visual system. Previous studies investigating animacy and size processing used silhouette images without background (Konkle & Caramazza, 2013, 2017; Cichy et al., 2014; Carlson et al., 2013; Connolly et al., 2012; Konkle & Oliva, 2011, 2012; Bell et al., 2009; Wiggett et al., 2009; Kriegeskorte, Mur, & Bandettini, 2008; Kriegeskorte, Mur, Ruff, et al., 2008). In contrast, here, we presented images on natural backgrounds as they appear in the real world. Furthermore, we found that (after controlling for low-level visual features) PHC indeed differentiated between scenes that contain animates versus scenes that contain inanimates. Thus, size and animacy processing in PHC might depend on the presence of a real-world background. We hypothesize that there is an interaction here between animacy processing in PHC and the object images being on a natural background. In other words, PHC only processes animates in the presence of a background scene. This observation further highlights the importance of using naturalistic stimulus sets to understanding the workings of the visual system under real-world conditions.
Finally, we found that processing of low-level visual features as captured by the Gist descriptor occurred predominantly in low-level and midlevel visual areas rather than midlevel and high-level visual areas. This result is consistent with the notion that animacy and size are high- rather than low-level visual properties (Khaligh-Razavi et al., 2017; Konkle & Oliva, 2012; Kriegeskorte, Mur, & Bandettini, 2008; Kriegeskorte, Mur, Ruff, et al., 2008) and thus being processed at later stages in the hierarchical processing cascade of the ventral visual system.
Time Courses of Processing Animacy and Real-world Object Size
Our results concur with previous studies investigating the time course of object processing (Clarke et al., 2015; Carlson et al., 2013, 2014; Cichy et al., 2014; Liu et al., 2009; Thorpe et al., 1996) in revealing a similar time course for the processing of animacy. The peak of animacy processing at 178 msec fell within the peak confidence interval reported in previous MEG studies at 152–302 msec (Cichy et al., 2014) and at 150–240 msec (Carlson et al., 2013).
Our results go beyond previous studies by, to our knowledge for the first time, reporting the time course of real-world object size processing. Importantly, we were able to directly compare the speed of processing real-world size with that of animacy with high precision, as we studied these two object properties using the same stimulus set and the same imaging technique and analysis pipeline. Compared with animacy processing, size processing was slightly faster. In detail, the onset latency for size processing at 144 msec was ∼10 msec earlier than the onset latency for animacy processing at 156 msec. However, the peak latency of size processing was not significantly different from that of animacy. Together with the observation that a partly overlapping set of brain regions underlies animacy and size processing, this shows that neural activity underlying animacy and size processing is distributed and overlapping in both space and time.
Compared with the timing of processing low-level visual features as captured by the Gist descriptor, onsets and peak latencies for animacy and size processing were clearly and significantly later in time. This result further strengthened the notion that size and animacy are object properties processed at a late stage of the hierarchical processing cascade of the ventral visual system. Together with the fMRI results, this differentiates the processing of animacy and size from low-level visual processing in both space and time.
The Role of Feedback in the EVC Representation of Real-world Size
Representing real-world size, as opposed to retinal size, is a critical and useful feature of our visual system, referred to as object size constancy (Sperandio, Chouinard, & Goodale, 2012). As objects move with respect to one another, their retinal size varies continuously. If our visual system were to solely represent objects based on their retinal size, without an understanding of their real-world size, the world around us would appear distorted and unstable.
Previous studies have shown that real-world object size is already represented at the first stage of cortical visual processing, that is, primary visual cortex (Sperandio et al., 2012; Schwarzkopf, Song, & Rees, 2011; Murray et al., 2006). Our content-dependent fusion results extend those studies by detailing the temporal dynamics and the role of feedback in the representation of object size in EVC through three observations.
First, we observed that the left EVC represented real-world object size starting at ∼150 msec, which is significantly later than the onset at 78 msec and peak at 128 msec of overall visual activity in EVC (Supplementary Table 2). This temporal delay strongly suggests that size representations in EVC emerged through feedback processing rather than feedforward processing.
Second, we further observed that, at 80 msec, EVC represented Gist features earlier in time than it represented object size. Given that the Gist model assesses low-level visual processing in early visual areas, whereas object size was most strongly represented in high-level visual cortex, this further suggests that size representations in EVC emerged through feedback. The relative temporal succession of processing low- and high-level visual information in EVC observed here relatively matches results in nonhuman primates (Supèr, Spekreijse, & Lamme, 2001; Lee, Mumford, Romero, & Lamme, 1998). Single-cell electrophysiology in monkeys revealed that recurrent dynamics in the primary visual cortex emerge in the afterburst starting at 100 msec and have a role in processing higher-level visual information (Lamme & Roelfsema, 2000; Lee et al., 1998).
Finally, examining brain regions active after stimulus offset revealed that, at 655 msec, only the left EVC was representing real-world size. This might also suggest involvement of feedback given that there is no incoming feedforward visual signal at this time.
Future Directions and Limitations
In this study, we used an already existing neuroimaging data set (Cichy, Pantazis, et al., 2016) to showcase the potential of the content-dependent fusion method for studying the representation of visual properties, in particular, real-world size and animacy. Although the data and stimulus set were not primarily designed to investigate real-world size or animacy and were thus not fully controlled for those two object properties, it did allow for a set of analyses that revealed novel insights into their neural representations.
The stimulus set here consisted of images of real-world objects on natural backgrounds. Although this ensures some ecological validity of our findings, it precludes a systematic analysis of which features drive object size and animacy processing. For this, fully controlled stimulus sets are required that are designed with the intention to differentiate between candidate features. Univariate fMRI analyses (Julian, Ryan, Hamilton, & Epstein, 2016; Bainbridge & Oliva, 2015; Konkle & Oliva, 2012) suggest that the cortical representation of object size might reflect object properties that correlate with size, such as the interaction envelope, or whether objects are used as landmarks in navigation (Julian et al., 2016). A recent study (Long, Yu, & Konkle, 2017) also suggests a significant contribution for midlevel visual features accounting for the large-scale organization of the ventral visual stream when processing object size and animacy. We see great potential for future studies with fully controlled experimental results that use content-dependent fusion to unravel the features that underlie animacy and size processing.
We thank Caitlin Mullin and Santani Teng for comments on the article. This work was funded by a National Eye Institute grant EY020484 (to A. O.), Emmy Noether award CI-241/1-1 to R. M. C., the Vannevar Bush Faculty Fellowship through ONR grant N00014-16-1-3116 (to A. O.), NSF award 1532591, and the McGovern Institute Neurotechnology Program (to A. O. and D. P.). The study was conducted at the Athinoula A. Martinos Imaging Center, MIBR, MIT.
Reprint requests should be sent to Seyed-Mahdi Khaligh-Razavi, Department of Brain and Cognitive Sciences, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran, or via e-mail: firstname.lastname@example.org.
This paper is part of a Special Focus deriving from a symposium at the 2017 annual meeting of Cognitive Neuroscience Society, entitled, “The Dynamics of Cognitive Processes: Multivariate Approaches.”