Neural encoding and decoding provide perspectives for understanding neural representations of sensory inputs. Recent functional magnetic resonance imaging (fMRI) studies have succeeded in building prediction models for encoding and decoding numerous stimuli by representing a complex stimulus as a combination of simple elements. While arbitrary visual images were reconstructed using a modular model that combined the outputs of decoder modules for multiscale local image bases (elements), the shapes of the image bases were heuristically determined. In this work, we propose a method to establish mappings between the stimulus and the brain by automatically extracting modules from measured data. We develop a model based on Bayesian canonical correlation analysis, in which each module is modeled by a latent variable that relates a set of pixels in a visual image to a set of voxels in an fMRI activity pattern. The estimated mapping from a latent variable to pixels can be regarded as an image basis. We show that the model estimates a modular representation with spatially localized multiscale image bases. Further, using the estimated mappings, we derive encoding and decoding models that produce accurate predictions for brain activity and stimulus images. Our approach thus provides a novel means of revealing neural representations of stimuli by automatically extracting modules, which can be used to generate effective prediction models for encoding and decoding.
Predictive models provide a basis for our understanding of the neural representation of sensory inputs. Previous studies have been conducted based on the concept of neural encoding and decoding (Dayan & Abbott, 2001; Pereira, Mitchell, & Botvinick, 2009; Naselaris, Kay, Nishimoto, & Gallant, 2011). An encoding model is formulated to predict brain activity from a stimulus, whereas a decoding model is formulated to predict a stimulus from brain activity. Recent functional magnetic resonance imaging (fMRI) studies using statistical learning-based encoding models have succeeded in predicting fMRI activity patterns from visual images (Kay, Naselaris, Prenger, & Gallant, 2008). The use of decoding models has also been successful in predicting visual features and images from fMRI activity patterns (Kamitani & Tong, 2005, 2006; Miyawaki et al., 2008).
A challenge in such data-driven modeling is how to generalize the model to an arbitrary stimulus or numerous possible stimuli that were not used to train the model. It is not feasible to measure brain activity for all possible stimuli. The utility of a model is limited if it provides a mapping only between a limited set of stimuli and their measured neural responses. A solution has been proposed using a modular representation of the mapping between a stimulus and brain activity. Mitchell et al. (2008) presented an encoding model that predicted brain activity for arbitrary nouns represented by modular semantic features defined by frequently associated verbs. Once the mapping between each semantic feature and brain activity was learned using a set of nouns, the model was generalized to arbitrary nouns by decomposing them into semantic features that were combined to predict the brain activity patterns. Miyawaki et al. (2008) demonstrated that arbitrary visual images can be reconstructed from human fMRI using a modular decoding model. Modular decoders were trained to predict the mean contrasts of predefined multiscale image bases (, , and pixels, covering an entire image) using a set of random visual images and the neural responses recorded by fMRI. As each image basis has a small number of possible states, a small subset of all the possible random images was able to provide sufficient data to train the modular decoders. Once the decoder was trained, a visual image was reconstructed by linearly combining the image bases with the predicted contrasts. Note that in neuroimaging literature, a module often refers to a large contiguous region with functional specificity. Here, however, modules can be much smaller units corresponding to elemental features of a stimulus or behavior.
The performance of such a modular model is critically dependent on the selection of the elemental features used to compose a stimulus. In particular, although the multiscale image bases used in Miyawaki et al. (2008) outperformed alternative basis sets, they were heuristically selected, and thus not necessarily optimal. If image bases are automatically estimated from training data, they may reveal elemental features used by the brain, leading to better prediction performance.
In this letter, we propose a method to build encoding and decoding models while automatically determining image bases based on measured fMRI activity patterns related to visual images. We employ the framework of canonical correlation analysis (CCA) in which two multidimensional observations are related using a common coordinate system to maximize the correlations between transformed variables (Anderson, 2003). When applied to the mapping between a visual image and an fMRI activity pattern, CCA finds multiple correspondences from a weighted sum of pixels to a weighted sum of voxels, which constitute functional modules for visual image representation. The estimated pixel weights for each module can be regarded as an image basis, and the combination of the multiple modules allows us to represent a variety of visual images.
Because the early visual cortex is retinotopically organized, we can assume that a small set of pixels is represented by a small set of voxels. To facilitate the mapping between small sets of variables, we extend CCA to Bayesian CCA (Wang, 2007) with sparseness priors. Bayesian CCA treats multiple correspondences as latent variables, with two mapping matrices relating to two sets of observations. One matrix maps the latent variables to the visual image serving as a set of image bases, and the other matrix maps the latent variables to the fMRI voxels serving as a set of fMRI voxel weights. The matrices are assumed to be random variables with hyperparameters. We introduce a sparseness prior into each element of the matrices to ensure that only small subsets of voxels and pixels are related to nonzero matrix elements.
Once the mapping matrices are estimated, the Bayesian CCA model can be used to generate both encoding and decoding models. The encoding model can be derived by first inverting the mapping from the latent variables to the visual image and then combining the inverted mapping with that from the latent variables to the fMRI voxels. The decoding model can be derived in a reverse fashion. Using the encoding and decoding models derived by this framework, we can predict an fMRI activity pattern from a novel visual image and predict a visual image from a novel fMRI activity pattern.
We applied this framework to a data set consisting of random visual images and corresponding fMRI data used in our previous work (Miyawaki et al., 2008). The visual images were contrast-defined checkerboard patterns, and the fMRI signals were collected from area V1. Because the pixels in random images are not spatially correlated, the estimated image bases are expected to reflect intrinsic mappings between the visual field and brain activity rather than the correlations between pixels derived from a particular set of visual images. We show that Bayesian CCA estimates a modular representation consisting of spatially localized image bases and fMRI voxel weights. The estimated image bases had similar shapes to those used in our previous study (Miyawaki et al., 2008), although different shapes were also found. The fMRI voxel weights corresponding to the spatially localized image bases were spatially localized in the brain, consistent with the retinotopic organization of the visual cortex. The performance of the encoding and decoding models was evaluated by predicting fMRI activity patterns and visual images, respectively. We also demonstrate that the sparseness priors introduced into the Bayesian CCA model were important for estimating localized image bases and voxel weights. These results suggest that Bayesian CCA can effectively estimate the modular mapping between the stimulus and the brain, which can be used to generate prediction models for both encoding and decoding. A preliminary version of this study has been published in conference proceedings (Fujiwara, Miyawaki, & Kamitani, 2009), where we outlined the method for estimating image bases using Bayesian CCA.
We constructed a model to relate an fMRI activity pattern and a visual image via latent variables (see Figure 1). A mapping from each latent variable to a set of visual image pixels can be regarded as an image basis. The latent variable also has links to a set of fMRI voxels, which serves as a voxel weight. Hence, the latent variable bundles a subset of fMRI voxels responding to a specific element of a visual image. We used the framework of canonical correlation analysis (CCA) to build the Bayesian CCA model for relating visual image pixels and fMRI voxels via latent variables. In the following, we introduce CCA and probabilistic CCA, a probabilistic reformulation of CCA (Bach & Jordan, 2005). Next, we derive our Bayesian CCA model, which allows sparse selection of visual image pixels and fMRI voxels for each link.
2.1. Canonical Correlation Analysis
2.2. Probabilistic CCA
2.3. Bayesian CCA
We derive a Bayesian CCA model (Wang, 2007) based on probabilistic CCA that selects the relevant image pixels and fMRI voxels automatically. In addition to the visual image I, fMRI activity pattern r, and latent variables z, Bayesian CCA treats the image basis set WI and the voxel weight set Wr as random variables. This assumption provides a Bayesian estimation of image bases and voxel weights, and each variable is evaluated using a posterior distribution. We also introduce hyperprior distributions for the inverse variance of each element of the image bases and the voxel weights. The relationship between these parameter variables in Bayesian CCA is shown in Figure 1B. The image bases, the voxel weights, and the inverse variances are estimated by employing the variational Bayesian method (Attias, 1999). After the parameters are determined, encoding and decoding models can be derived as predictive distributions.
This configuration of the prior and hyperprior distribution is known as automatic relevance determination (ARD) and has the effect that irrelevant parameters are automatically driven to zero (MacKay, 1994; Neal, 1996). In the current case, these priors and hyperpriors lead to a sparse selection of links from each latent variable to pixels and voxels. The use of this sparse prior may be validated by the fact that a spatially localized visual stimulus evokes activity in only a small number of voxels in the early visual area. The sparse parameter estimation avoids overfitting the training data by pruning irrelevant features, thereby helping to achieve high prediction performance (Tipping, 2001; Yamashita, Sato, Yoshioka, Tong, & Kamitani, 2008).
2.4. Parameter Estimation of Bayesian CCA Employing the Variational Bayesian Method
2.5. Predictive Distribution for the Encoding and Decoding Models
2.6. fMRI Data
We used the data set obtained from Miyawaki et al. (2008) that contained fMRI signals of two subjects observing visual images consisting of contrast-defined checkerboard patches (the data set is available at our web site).1 Each patch was either a flickering checkerboard (spatial frequency, 1.74 cycle/deg; temporal frequency, 6 Hz) or a homogeneous gray area. The data set consisted of two independent sessions. One is a random image session, in which a spatially random pattern was presented for 6 s, followed by a 6 s rest period. A total of 440 different random patterns was presented to each subject. The other is a figure image session, where a letter of the alphabet or a simple geometric shape was presented for 12 s, followed by a 12 s rest period. Five letters of the alphabet and five geometric shapes were presented six or eight times per subject. To estimate the model parameters, we used a set of single fMRI volumes acquired every 2 s from V1 during the stimulus presentation period (stimulus labels were shifted by 4 s to compensate for the hemodynamic delay). The prediction performance of the model was tested using block-averaged data (average voxel intensities of 3 volumes [6 s] or 6 volumes [12 s]).
2.7. Evaluation of Model Performance
We conducted 10-fold cross-validation analysis to evaluate the prediction performance of the encoding and decoding models derived from our Bayesian CCA approach. We used the data from the random image session to avoid possible biases due to specific shapes contained in the presented visual images. The data were divided into 10 groups, each consisting of 44 different random images and the corresponding fMRI data. The model parameters, including image bases and fMRI voxel weights, were estimated with 9 groups, and the predictions of the encoding and decoding models were tested on the one remaining group. The prediction models trained on random images were also tested with the data from the figure image session. The results were used only for the illustration of reconstruction quality.
To evaluate the encoding model, we performed an image identification analysis (Kay et al., 2008) in which the presented image was identified from among a candidate image set based on the fMRI activity patterns predicted by the encoding model. For each candidate image, the encoding model predicted an fMRI activity pattern, and the correlation coefficient between the predicted and measured fMRI activity patterns was calculated. The candidate image corresponding to the predicted fMRI activity pattern that best correlated with the measured pattern was selected as the identified image. The candidate set consisted of a presented image (true image) and a variable number of randomly generated images that were not used in the experiment. The set size of the randomly generated image set was increased from 10 to 1000 by one image step. We repeated the identification 200 times for each image and each set size to obtain the percentage of correct identifications. We then calculated the mean performance across images and the confidence interval for each set size.
To evaluate the performance of the decoding model, we reconstructed visual images from the fMRI activity patterns (Miyawaki et al., 2008). Reconstruction performance was quantified by the mean squared error in pixel values between the presented and reconstructed images. The performance was compared with that of the previous model using the multiscale fixed image bases (Miyawaki et al., 2008).
3.1. Estimated Image Bases and fMRI Voxel Weights
Figure 2A shows representative image bases estimated by Bayesian CCA. The sign of the pixel values (darker or brighter) is not important here, because the same CCA model holds when the pixel and voxel weights have flipped signs. A black image basis should be interpreted as being an image basis associated with the voxel weights of flipped signs. The shapes were similar to those used in the previous study (Miyawaki et al., 2008) (, , and shown in the first and second rows of Figure 2A). However, we also found image bases with other shapes (e.g., L-shape, and , the third row of Figure 2A) that were not assumed in the previous study.
To evaluate the relationship between the size of the image bases and the corresponding locations in the visual field, we calculated the distribution of the image bases over eccentricity for different sizes (see Figure 2A, right). The image bases of the smallest size () were the most frequent, and most of them were found within 3 degrees of eccentricity. Larger-sized image bases were more frequently found in the peripheral visual field than in the foveal visual field. These results are consistent with the larger receptive field of visual cortical neurons or voxels for peripheral visual fields (Hubel & Wiesel, 1974; Smith, Singh, Williams, & Greenlee, 2001; Domoulin & Wandell, 2008).
In addition to image bases, Bayesian CCA simultaneously estimates fMRI voxel weights corresponding to each estimated image basis. Figure 2B shows two representative voxel weights corresponding to image bases located to the left and right of the fixation point (the corresponding image bases are shown in the top left of each cortical surface map). Voxel weights were mapped to the cortical surface using the retinotopic map for eccentricity (white lines) obtained in a separate retinotopy experiment using a conventional procedure (Sereno et al., 1995; Engel, Glover, & Wandell, 1997). Large voxel weights were localized in the cortical region corresponding to small eccentricity contralateral to the location of the image basis, consistent with the retinotopic location of the image basis.
The distribution of absolute weight values on the cortical surface for image bases is summarized in Figure 2C. Image bases were sorted by the eccentricity and the polar angle of the central position in the visual field, and voxels were sorted by the corresponding eccentricity and polar angle identified in the retinotopic mapping experiment. The absolute voxel weights were normalized by the maximum value and then averaged for each eccentricity or polar angle bin of the image basis. Large values were distributed along the diagonal, indicating that the cortical location of the voxel weights was congruent to the retinotopic map of the image bases.
3.2. Evaluation of the Encoding Model by Identification Analysis
To evaluate the performance of the encoding model derived from Bayesian CCA, we used an image identification analysis (Kay et al., 2008) in which the encoding model produced predicted brain patterns for candidate visual images, and they were compared with the measured fMRI activity pattern. The visual image whose predicted fMRI activity pattern was best correlated with the measured fMRI activity pattern was selected. If the presented image was selected among the candidate images, the identification was successful (correct identification). We repeated this procedure for all presented images and a variable candidate set size. The performance achieved with the Bayesian CCA encoding model was compared to that with the fixed-basis model and the voxel RF model. In contrast to the Bayesian CCA model, the fixed-basis model lacks the ability to estimate images bases from data. The voxel RF model is trained to treat each voxel independently and is thus unable to take voxel correlations into account.
The identification performance of the Bayesian CCA model remained at a high level, exceeding 30% even for a set size of 1000 candidates (see Figure 3). That is, using the encoding model, we were able to correctly identify the presented visual image from among 1000 candidates with a probability of above 30%, whereas the chance level is 0.1% (=1/1000). Note that when we trained the Bayesian CCA model with shuffled data (the correspondence between visual images and fMRI volumes was shuffled), the performance was at the chance level. The fixed-basis model showed a comparable performance with the Bayesian CCA model for small set sizes but significantly lower performance for larger set sizes (more than 200). The voxel RF model showed much lower performance than the other models for all set sizes. We further extrapolated the identification performance by fitting the sigmoid function with logarithmic scales of set sizes. The extrapolation analysis suggests that the Bayesian CCA model could achieve an accuracy of 10% with a set size of 104.3, exceeding the performance of the fixed-basis and voxel RF models (10% accuracy at 104 and 103.2 set sizes, respectively). These results indicate that both the data-driven estimation of image bases and voxel weights contributed to the prediction performance of the encoding model.
3.3. Evaluation of the Decoding Model by Visual Image Reconstruction
We also derived a decoding model from Bayesian CCA analysis and used it to reconstruct visual images from fMRI activity patterns. Reconstructed images from the figure image session, including letters of the alphabet and geometric shapes, are shown in Figure 4A. Images reconstructed by the Bayesian CCA decoding model captured the essential features of the presented images (see Figure 4A, second row). In particular, they showed fine reconstruction for figures consisting of thin lines, such as small frames and letters of the alphabet. However, the reconstruction was noisier than that of the previous model using multiscale fixed image bases (Miyawaki et al., 2008; see Figure 4A, third row). This is presumably because few image bases were estimated in the peripheral regions by Bayesian CCA (see Figure 2A, right). To evaluate the reconstruction performance quantitatively, we calculated the mean square errors in pixel contrast between the presented and reconstructed images using the data obtained in the random image session at each eccentricity (see Figure 4B). Whereas there was a small difference in the errors of the Bayesian CCA and fixed-basis models in foveal pixels, the fixed-basis model outperformed the Bayesian CCA model in parafoveal pixels. Both models made large errors in peripheral pixels, although the difference between the two was small.
3.4. Effects of Sparseness Priors
Bayesian CCA extracted localized image bases and voxel weights using sparseness priors. To examine the advantages of using sparseness priors, we compared the image bases estimated by Bayesian CCA with those estimated under the following three conditions: CCA with sparseness priors for image bases alone, CCA with sparseness priors for voxel weights alone, and CCA without any sparseness priors. Figure 5A shows representative image bases estimated under these conditions. Overall, most of the image bases estimated by CCA lacking either or both of the sparseness priors were less spatially localized. A quantitative evaluation of the sparseness using kurtosis showed that the image bases estimated by Bayesian CCA were significantly sparser than those under the three conditions (see Figure 5B; Kruskal-Wallis test, p<0.01, Bonferroni corrected for multiple comparisons). These results indicate that both sparseness priors contribute to the estimation of localized image bases.
We also examined the reconstruction performance using the nonlocalized image bases and compared it with the reconstruction performance using localized image bases estimated by Bayesian CCA. All three of the CCA models lacking sparseness priors had significantly poorer reconstruction performance than the Bayesian CCA model (see Figure 5C; ANOVA, p<0.01, Bonferroni corrected for multiple comparisons). The CCA model without any sparseness priors exhibited the worst performance. Thus, the sparseness priors for both image bases and voxel weights contribute to attaining an accurate reconstruction performance, too.
We have proposed a new method for estimating encoding and decoding models based on a unified framework using Bayesian canonical correlation analysis (CCA). Bayesian CCA finds multiple correspondences between visual image pixels and fMRI voxels via latent variables. Using this model, we were able to estimate spatially localized image bases and fMRI voxel weights. Based on these estimates, we derived an encoding model that predicted brain activity patterns given stimulus images. Our model outperformed the fixed-basis and voxel RF models. We also derived a decoding model that succeeded in reconstructing visual images from brain activity patterns, though its performance was lower than that of the fixed-basis model. Sparseness priors for the image bases and voxel weights facilitated the selection of spatially localized, sparse pixels as image bases, and contributed to improving the reconstruction accuracy.
Bayesian CCA estimated spatially localized image bases around the foveal region. However, a small number of image bases were estimated outside the foveal region (see Figure 2A). This may be due to the small cortical magnification factor of the visual cortex for the peripheral visual field (Tootell, Silverman, Switkes, & De Valois, 1982; Engel et al., 1997). Since our model uses sparseness priors under the assumption that a small number of voxels are associated with a small number of pixels, the model may fail to relate a small number of peripheral voxels to a large number of pixels in the peripheral visual field. Elaborate adjustment of the degree of sparseness based on the eccentricity and the cortical magnification factor may help to improve basis estimation outside the foveal region.
The encoding model derived from Bayesian CCA demonstrated a higher identification performance than the fixed-basis and voxel RF models. The superior performance of the Bayesian CCA model over the fixed-basis model indicates that the data-driven estimation of image bases does indeed contribute to the prediction of brain activity. The much poorer performance of the voxel RF model may be due to its lack of ability to exploit correlation structures among voxels. fMRI voxels are generally highly correlated, and the correlation can carry relevant information about stimuli or tasks, even in the absence of information in individual voxels (Yamashita et al., 2008). Both the Bayesian CCA and fixed-basis models exploit voxel correlations to estimate weight parameters (voxel correlations are taken into account in the iterative calculation of the trial distributions and in equations 2.27 to 2.33), thus producing predictions that reflect voxel correlations. In contrast, the voxel RF model makes predictions in a voxel-by-voxel manner, ignoring any correlation among voxels. This difference may account for the poor performance of the voxel RF model.
Whereas the encoding model derived from Bayesian CCA was more accurate than the fixed-basis encoding model, the decoding model did not outperform the previously proposed fixed-basis decoding model (Miyawaki et al., 2008; see also Figure 4). As noted above, one reason for the inferior performance with Bayesian CCA may be that fewer image bases were estimated outside the foveal region (see Figure 2A), while the fixed-basis decoding model enforced image bases that covered the whole region. The lack of image bases outside the foveal region is likely to have a more profound impact on the decoding performance than on the encoding performance. The calculation of reconstruction error in the decoding model treated all pixels equally, whereas the evaluation of the encoding model using predicted fMRI activity patterns should depend more heavily on foveal pixels, which have a larger cortical representation than peripheral pixels due to the cortical magnification factor. These issues may underlie the discrepancy in performance between the encoding and decoding models.
It is known that fMRI voxels for foveal representation often suffer from signal dropout because of the proximity to the superior sagittal sinus (Dagli, Ingeholm, & Haxby, 1999). We indeed found lower signal-to-noise ratios (SNRs) in foveal voxels using data from the retinotopy experiment (see Figure 6; SNR was calculated as the ratio of the power at the frequency of the periodically moving stimuli to the power at other frequencies). The high level of accuracy for foveal pixels, despite the low SNR in retinotopic voxels, suggests that the pattern of many voxels, including those extending beyond the conventional retinotopy, contributed to the prediction of this region.
It should also be noted that whereas Bayesian CCA could exploit the correlation structure of input variables, stimulus images used for model training consisted of spatially uncorrelated random patterns. Therefore, the encoding and decoding models did not take full advantage of the spatial correlation structure inherent in natural visual images (Olshausen & Field, 1996; Bell & Sejnowski, 1997). Using images with natural correlations, Bayesian CCA may be able to extract image bases and voxel weights reflecting natural image statistics. It remains to be seen in future work how they differ from those derived from random images.
A similar method, known as sparse CCA, has been proposed in the field of biostatistics. This has been applied to find a relationship between gene expression and DNA copy number (Witten, Tibshirani, & Hastie, 2009; Parkhomenko, Tritchler, & Beyene, 2009). Sparse CCA has penalized terms (e.g., an L1 constraint) for vectors of weight matrices instead of the sparseness priors assumed in Bayesian CCA. Depending on the penalized term, sparse CCA is expected to produce similar results to Bayesian CCA. Thus, sparse CCA may be used to estimate spatially localized image bases and voxel weights and to derive encoding and decoding models.
The sparseness priors for image bases and voxel weights played an important role in estimating spatially localized image bases (see Figure 5A). If either was lacking, the estimated bases were not localized, and the reconstruction performance declined. Thus, the sparseness priors effectively pruned off irrelevant mappings, leaving only relevant mappings between a small number of pixels and a small number of voxels. Although we did not explicitly incorporate knowledge of the retinotopic map, the voxels and pixels were linked in a manner consistent with this map (see Figures 2B and 2C). These results suggest that the sparseness priors not only improve model performance, but may also help to find physiologically meaningful relationships between a visual image and brain activity.
Our approach provides a general procedure for estimating a modular representation of perceptual or behavioral elements. Our previous study (Miyawaki et al., 2008) used multiple predefined image bases that assumed a putative image representation of the early visual cortex. Such a predefined model may work in brain areas for which the underlying neural representation is well known. In contrast, Bayesian CCA allows us to estimate modular representation without explicit assumptions about elemental features. Although our results obtained with visual image data are rather confirmatory, demonstrating the proof of concept for the automatic extraction of functional modules by Bayesian CCA modeling, the model can be applied to many different domains. For instance, when applied to higher visual areas, our approach may uncover elemental representations of visual objects or scenes (Fujita, Tanaka, Ito, & Cheng, 1992; Grill-Spector & Malach, 2004; Yamane, Carlson, Bowman, Wang, & Connor, 2008). Another possible application would be to estimate the relationship between motor behaviors and corresponding brain activity patterns (Poggio & Bizzi, 2004). If motor behavior consists of a combination of “synergies,” putative components of motor control (d’Avella, Saltiel, & Bizzi, 2003; Graziano, 2006), Bayesian CCA may allow specific synergies to be found as modular neural representations underlying a great variety of motor behaviors. Thus, our approach provides a new analysis tool for investigating neural representations across multiple cortical areas and modalities.
Appendix: Derivation of Variance Parameters
We thank M. Takemiya, O. Yamashita, and T. Shimokawa for helpful comments on the manuscript. This study was supported by the Nissan Science Foundation, SCOPE (SOUMU), and SRPBS (MEXT).