Abstract

We consider a hierarchical two-layer model of natural signals in which both layers are learned from the data. Estimation is accomplished by score matching, a recently proposed estimation principle for energy-based models. If the first-layer outputs are squared and the second-layer weights are constrained to be nonnegative, the model learns responses similar to complex cells in primary visual cortex from natural images. The second layer pools a small number of features with similar orientation and frequency, but differing in spatial phase. For speech data, we obtain analogous results. The model unifies previous extensions to independent component analysis such as subspace and topographic models and provides new evidence that localized, oriented, phase-invariant features reflect the statistical properties of natural image patches.

1.  Introduction

A variety of methods like independent component analysis (ICA; see Comon, 1994) and sparse coding (Olshausen & Field, 1997) have been applied to model the statistical structure of natural signals such as images and sounds. In computational neuroscience, the goal of modeling these signals with unsupervised learning methods is to gain a better understanding of sensory processing, which is assumed to be linked to the statistics of ecologically valid stimuli (Barlow, 1961; Hyvärinen, Hurri, & Hoyer, 2009).

Linear ICA is limited in scope and cannot capture arbitrary dependencies, so more recent models use a nonlinear representation to better capture the structure of the data. In particular, there is a growing number of hierarchical models with two weight layers. These include direct extensions to ICA such as independent subspace analysis (ISA) and topographic ICA (TICA; see Hyvärinen, Hoyer, & Inki, 2001; Hyvärinen & Hoyer, 2000), which can be viewed as employing a manually selected, fixed second layer that pools over first-layer features modeling dependencies that cannot be removed by a linear transform. The fixed second layer in these models has the advantage that the probability density function (pdf) can still be normalized in closed form, or approximations for the likelihood can be found. Thus, a straightforward estimation of these models by maximizing likelihood is possible.

More recently, models where the second layer is also learned from the data have received attention. However, this comes at the expense of a more complicated estimation, since these models in general cannot be normalized in closed form, making maximum likelihood learning very difficult. Two recent models of this kind are the hierarchical Bayesian model by Karklin and Lewicki (2005, 2006) and the hierarchical product of experts (Osindero, Welling, & Hinton, 2006). The first is a generative model in which the components are not independent and identically distributed (i.i.d.), but the variance is given by hidden variables. The second model is an energy-based model with an intractable partition function, similar to the one we consider here, and it is estimated using contrastive divergence (CD; Hinton, 2002).

We present a two-layer model of natural stimuli where both layers are estimated from the data and analyze the resulting pooling patterns in the second layer. Following the classical energy model of complex cells (Adelson & Bergen, 1985; Spitzer & Hochstein, 1985), linear filter outputs from the first layer are squared and then pooled, where the pooling is learned from the data.

In our analysis, we focus on two points in particular. We compare the results obtained by estimating both layers of the model simultaneously, with a simplified model where the second layer is estimated on top of a fixed ICA basis in the first layer, and report differences in tuning of the linear filters as well as the higher-order units. Furthermore, we analyze the effect of L1-normalization of the second layer on the resulting outputs and show that this normalization plays a significant role in obtaining pooling patterns in line with previous complex cell models.

The model is estimated with score matching (Hyvärinen, 2005, 2007a), a consistent estimation method for energy-based models that cannot be normalized in closed form. Traditionally these energy-based models would have to be estimated with Markov chain Monte Carlo (MCMC) methods, which is computationally expensive and makes it hard to evaluate convergence. While recent methods like contrastive divergence are computationally more efficient, it is still necessary to set up a Markov chain, the choice of which may greatly influence the convergence properties. Score matching, in contrast, gives an objective function that can simply be optimized by gradient methods.

This letter is organized as follows. In section 2, we discuss previous models of natural images, focusing on ICA and its extensions. In section 3, the two-layer model is presented, including details of our implementation and the score matching estimation. In addition, we review how the score matching objective function is derived. We test the model and estimation on synthetic data in section 4. In section 5, we apply the method to natural image data. We present models estimated with different constraints and analyze the tuning statistics of the model cells for their complex cell properties. We compare the effect of different normalization methods for the second layer of the model and compare the results with a complete model to those with an overcomplete first layer. We focus on models with a nonnegative second layer but also consider models without the nonnegativity constraint. In section 6 we perform similar experiments with speech data, where we obtain a sparse pooling in the second layer that is very similar to the results for natural images. In section 7 we discuss how our work compares to other recently developed two-layer models, highlighting the principled estimation with score matching and the analysis of the complex cell–like properties of the outputs in our model. Finally we conclude with section 8. Preliminary results have been published in Köster and Hyvärinen (2007).

2.  Modeling of Natural Images

Ever since mammalian visual receptive fields were described by Hubel and Wiesel in the 1960s (see, e.g., Hubel & Wiesel, 1959, 1962), efforts have been made to understand why receptive fields have the observed properties. One successful approach is based on the idea that neural processing should be matched to statistics of ecologically valid stimuli, that is, natural images. This led to the development of statistical models like sparse coding (Olshausen & Field, 1996) and ICA (Jutten & Hérault, 1991; Comon, 1994), which result in basis functions with a strong resemblance to the receptive fields of simple cells.

The approach we use in this letter is inspired by the classical ICA model, so we will briefly look at ICA and its application to natural images. For the ICA model, we suppose that a vector of independent components, or sources s, is mixed to generate the observed data vector x. This can be written as
formula
2.1
In the simplest case, which is usually considered, the dimensionality of the source vector and the data vector is the same, so A is a square, invertible mixing matrix. Thus, the components can be recovered from the data using the filter matrix W=A−1 as
formula
2.2
There is a range of methods for the estimation of this model; here we focus on the likelihood-based approach. The distribution of the individual components is modeled by densities pi, so by independence, we have
formula
2.3
This allows us to write the pdf of the data as
formula
2.4
where wTi are the rows of W, and the determinant is a normalization factor due to the transformation of the density. Thus, we obtain the log likelihood of the parameters for a finite sample of data as
formula
2.5
where x(t) runs over T samples from the data. The likelihood can easily be maximized with regard to the filters W by gradient ascent. Estimating an ICA model for natural image patches results in filters that are localized, oriented, and bandpass, resembling the spatial receptive fields of simple cells in the primary visual cortex (V1).
However, a large fraction of cells in V1 is not well described by a linear response. In particular, complex cells, which are insensitive to spatial phase, cannot be modeled with a linear transform. To account for these responses, the ICA model can be extended by adding a fixed second layer on top of squared linear filter responses. Methods such as independent subspace analysis (ISA; see Hyvärinen & Hoyer, 2000) and topographic ICA (TICA; see Hyvärinen et al., 2001) employ a pooling of linear filter responses to model residual dependencies between linear filters. In ISA, the vector of components s is projected onto a number of subspaces. Squared norms of projections onto subspaces are then computed as
formula
2.6
where the index i runs over all the components that belong to the jth subspace. The pdf of the model then takes the form
formula
2.7
where the scalar nonlinearity f(.) defines the overall shape of the pdf. The pooling can also be viewed as a second linear transformation, or weight layer, where a number of first-layer units converge into one higher-order unit. Since independence is assumed only for the higher-order units uj, the linear features that are projected onto one subspace may have dependencies. Applied to natural images, this results in a pooling of features with similar frequency, orientation, and location but a different spatial phase. Thus, it can be argued that complex cells are tuned to capture dependencies, in particular correlations in the variance of linear filters (Schwartz & Simoncelli, 2001), which the model makes explicit by computing squared norms.

3.  The Model and Estimation

3.1.  The Two-Layer Model.

While the ISA model described above gives important insights into the interpretation of simple and complex cells as feature detectors tuned to the statistics of natural stimuli, it is somewhat limited as an explanation as to why the specific pooling is taking place, since only a single linear transformation is learned from the data and the additional connectivity is prespecified. This rules out certain types of connectivity that might provide a better model of the data, in favor of architectures that have been hypothesized from theoretical principles. It would be preferable to estimate a full two-layer model to allow us to evaluate whether the kind of connectivity used in earlier models is actually valid from the point of view of statistical optimality. A conceptually simple extension to the ISA model described in the previous section would be to retain the basic structure, but learning the second layer from the data rather than fixing it. Thus we define a pdf that can be viewed as describing a two-layer network,
formula
3.1
where the first term in the log probability is given by a sum over the outputs of individual second-layer units. Here Z(W, V) is the partition function of the model—a function of the model parameters that ensures that the pdf integrates to unity. The vTh are rows of the second-layer weight matrix V, while the first layer W has been retained from the ICA model. The two weight matrices need not be square, so in general, W will be of size and V of size . We have two scalar nonlinearities g(.) and f(.), the first of which computes nonlinear features from the data, whereas the second shapes the overall pdf. Such a model cannot be normalized in closed form, since the normalization constant Z is given by an intractable integral. Therefore, we use score matching for the estimation, which provides a straightforward method for learning in energy-based models.
For the results presented in this work, we have defined g to be a squaring operation unless otherwise specified. In addition, the second layer was constrained nonnegative. This is a natural choice for a model of complex cells, where outputs are computed by pooling over squared or rectified simple cell responses (Pollen & Ronner, 1983; Spitzer & Hochstein, 1985). The second nonlinear function is chosen to be of the form
formula
3.2
which ensures that the overall distribution of the model is supergaussian. Again, this nonlinearity was used in all the simulations unless otherwise mentioned. When these nonlinearities are used, the model distribution becomes
formula
3.3
We further constrained the vectors vTh to be normalized to unit L2 or, alternatively, to unit L1-norm, which corresponds to constraining the second-layer units to have unit output energy and encourages sparse connectivity.

3.2.  Score Matching.

Score matching (Hyvärinen, 2005, 2007a, 2007b) is an estimation method that allows learning statistical models that are specified only up to a multiplicative normalization constant (partition function). Consider samples from a random vector that follows a pdf and to which we would like to fit a model. We define a parameterized model density , which includes the true pdf and where is a parameter vector that we would like to estimate. Suppose that the normalization constant Z of the pdf cannot be computed in closed form, and we use q to denote the unnormalized distribution. In the form of a log probability, we have the model
formula
3.4
The model score function, which we define as the gradient of the log probability with respect to the data, is obviously identical for q and p and given by
formula
3.5
The score function of the observed data is denoted by
formula
3.6
Working with the score function thus has the advantage that it does not depend on the normalization constant Z. The model can now be estimated by minimizing the squared distance between the model score function and the data score function . This objective function is defined by
formula
3.7
This may not appear to be very useful at first sight, because estimating the data score function is a nonparametric problem and would require no less effort than estimating the normalization constant. However, a much simpler form of the objective function can be obtained. The full proof can be found in Hyvärinen (2005). We start by expanding the squared term to
formula
3.8
Here we note that the first term does not depend on the data score function, so by rewriting it with the squared norm expanded as a sum, we get
formula
3.9
where the are elements of the score function. The second term is constant with regard to , so we simply set
formula
3.10
Thus, we focus on the third term, where we start by writing out the inner product,
formula
3.11
and consider a single element of the sum. We now use the definition of the score function , so when the chain rule is used, the term becomes
formula
3.12
We then use multivariate partial integration (Hyvärinen, 2005) to obtain the ith term as
formula
3.13
where the integration constant D is zero as . Working with a finite sample of data, we can replace the exact expectations with sample averages. Collecting the terms, we then obtain the expression
formula
3.14
which is easy to evaluate since it contains only terms depending on the model pdf. Score matching has been shown to provide a consistent estimator (Hyvärinen, 2005), so if the data follow the model, is asymptotically minimized for the true parameters.

3.3.  Estimating the Model.

We can now apply the score matching framework to the model defined in equation 3.3. The score function of the two-layer network is given by
formula
3.15
so we can write the score matching objective—the squared distance between model and data score function—as
formula
3.16

Optimizing this objective is straightforward by gradient descent, which requires the gradients of the above expression with respect to the elements of the weight matrices W and V. These gradients are given in the appendix. The nonnegativity and norm constraint were implemented by projecting onto the constraint set after each gradient step.

4.  Experiments on Simulated Data

To verify the identifiability of the model, we estimated it for simulated data with a known higher-order structure. We generated data following the ISA model, which is a special case of the proposed two-layer model and easy to sample from. The data generated in this way contain higher-order dependencies in the form of common variances for groups of source variables, which cannot be captured by ICA. Samples from the ISA model were generated as follows. To obtain T observations of an n-dimensional vector that contains k subspaces, we first create a matrix M of observations from an i.i.d. gaussian with unit variance, and a matrix B of variance parameters from a uniform distribution. We introduce dependencies within groups of the gaussians by multiplying them with a common variance from the uniform distribution: . The supergaussian variables produced in this way are then multiplied with a mixing matrix A that is also generated randomly, so the data matrix is X=AU. Before the estimation, the data are whitened. For the experiments shown below, we set T=5000, n=21, and k=7, so each subspace has three elements.

The experiments with artificial data mainly served the purpose of confirming the consistency of the estimation, but also to try out various initialization and normalization procedures for the experiments on natural stimuli. We compare L1 and L2 normalization of the second-layer matrix V, and we compare randomly initializing V and initializing it with an identity matrix, which allows prelearning of W as an ICA model.

For the visualization of the results, note that in ICA, one can simply multiply the mixing matrix A with the estimated filter matrix W to obtain a permuted diagonal matrix if the components are identified correctly. Thus, visual inspection of can be used to to determine convergence. The ISA model is identifiable only up to subspaces due to rotational symmetry, where the second layer determines the subspace ownership of each element of Z. By multiplying the second-layer matrix V with Z, a block-diagonal matrix with permuted rows should be obtained if the algorithm converges correctly.

Results are presented in Figure 1. In each of the four experiments (a–d) we performed, the top row shows the second layer V on the left, and the product on the right. In the bottom row, on the left, we show where the rows of V have been permuted in such a way that identical rows are next to another. This is purely for visualization purposes and does not affect the objective function. Again, on the right, we plot the product . If this results in a permuted block-diagonal matrix, the second layer has correctly identified the dependency structure in the first layer.

Figure 1:

Simulations with generated data following the ISA model. For each of the four plots, we show the second-layer matrix V on the top left and the product on the top right. The bottom row contains the same matrices as the top row, but with the vectors permuted for visualization purposes. (a) Both layers estimated simultaneously with W initialized with gaussian white noise and V with an identity matrix. The rows of V are constrained to unit L1-norm. (b) Like a, but both weight layers initialized with white noise. Convergence takes nearly an order of magnitude longer, but the the global minimum is found nevertheless. (c) Like a, but with rows of V constrained to unit L2-norm. Note that the second layer converged to a local minimum. (d) The estimation can be simplified by estimating only W first, with V held constant. In the second step, both layers are learned. The quality of the optimum does not change, but speed of convergence is increased.

Figure 1:

Simulations with generated data following the ISA model. For each of the four plots, we show the second-layer matrix V on the top left and the product on the top right. The bottom row contains the same matrices as the top row, but with the vectors permuted for visualization purposes. (a) Both layers estimated simultaneously with W initialized with gaussian white noise and V with an identity matrix. The rows of V are constrained to unit L1-norm. (b) Like a, but both weight layers initialized with white noise. Convergence takes nearly an order of magnitude longer, but the the global minimum is found nevertheless. (c) Like a, but with rows of V constrained to unit L2-norm. Note that the second layer converged to a local minimum. (d) The estimation can be simplified by estimating only W first, with V held constant. In the second step, both layers are learned. The quality of the optimum does not change, but speed of convergence is increased.

Firstly, the comparison between Figures 1a and 1b shows that convergence is possible both from V initialized with the identity matrix and from a random V. However the number of iterations is about an order of magnitude greater starting from random. Second between Figures 1a and 1c, we compare the effect of L1 and L2 normalization. In this case, the L2-normalized model has converged to a local minimum and has not identified all the components correctly. In general, however, there was no major difference between the L1 and L2 normalization for these data. Finally, in Figures 1a and 1d, we analyze the effect of estimating the layers sequentially, which means that W is learned only while V is fixed to identity, after which W is held fixed and learning is continued with V only. In this simple example, the model converges to the correct solution, but as we will see, it is preferable to estimate both layers simultaneously.

5.  Experiments on Natural Images

5.1.  Methods.

All experiments were performed on images taken from P. O. Hoyer's ImageICA package (www.cs.helsinki.fi/patrik.hoyer), using 20, 000 image patches of size . The whole images were preprocessed by approximate whitening assuming a power spectrum and contrast gain control with a gaussian neighborhood of 16 pixels diameter. Details of this preprocessing can be found in Hyvärinen and Köster (2007). The preprocessing can be given a physiological justification in terms of the processing in the retina and lateral geniculate nucleus, or it can be viewed more pragmatically as simplifying the statistical structure of the images slightly. We then randomly sampled patches from the images and removed the DC component from the patches. We also discarded any image patches with low variance, since they contribute little to the gradient and slow down learning. Finally, we whitened the patches and simultaneously reduced the dimensionality from 256 to 120 using principal component analysis. The dimensionality reduction corresponds to low-pass filtering and eliminates aliasing artifacts due to the rectangular sampling grid from the image patches. Both weight matrices W and V were chosen to be square, of size , unless otherwise noted.

The matrix W was initialized with gaussian white noise, and V was initialized as an identity matrix for the experiments in section 5.2 and with noise from a uniform distribution for the experiments in section 5.3. The models were optimized using gradient descent with a constant step size. To increase the speed of convergence of the experiments in section 5.2, we initialized by estimating the first layer only, keeping the second layer fixed. After the convergence of the first layer to an ICA basis, we performed two experiments. In the first, both layers were estimated simultaneously. In the second, W was held fixed after initial convergence to an ICA basis, and only V was estimated. In all experiments, the outputs units (rows of V) were normalized to unit L1 or L2 norm after every step. Convergence was determined by visual inspection and took about 300 hours on a Pentium IV workstation.

To analyze the tuning properties of the filters in the first layer, we fit Gabor functions to the basis functions obtained by inverting the filter matrix W. For each of the first-layer responses, we used a least-squares fit (adapted from Hyvärinen et al., 2001) to determine location, orientation, size, phase, and frequency of the optimal Gabor. To compute tuning curves of the second-layer outputs, we followed the model by taking squares of the first-layer filter responses and summing them, weighted by rows of V. The second-layer nonlinearity is required in the estimation to define a supergaussian pdf, but it is not considered part of our complex cell model, so we analyzed the second-layer outputs without any further nonlinearity. To obtain tuning curves, we designed the test stimulus for each higher-order unit as a Gabor function constructed from a weighted average of the constituent first-layer Gabor parameters. One of the parameters (location, orientation, phase, frequency) was then varied while the others were held at the optimum value.

5.2.  Results.

In Figure 2 we analyze the first layer features and how they differ from those of an ICA model. Figure 2a shows a random selection of 48 of the 120 basis functions for models estimated with L1 and L2 normalization, as well as the ICA basis functions obtained with V fixed to identity, for comparison. It can be seen that all the filters are Gabor-like and tuned in orientation, position, frequency, and phase, resembling the responses of simple cells. The basis functions from the L1 model appear slightly more localized, but less frequency and orientation selective than the filters from the L2 model, with ICA falling between the two extremes. Comparing the score matching objective function for the three models, we observe that the L2 model has the best fit to the data with , followed by the L1 model with , while the ICA model achieves only a , which improves very little to if the second layer is learned on top of the ICA basis without simultaneously allowing the basis functions in W to adapt to the new pooling patterns.

Figure 2:

(a) A subset of 48 randomly selected linear basis functions in the first weight layer. On the left and in the middle, features for models with an L1-normalized and L2-normalized second layer are shown. On the right we show a model with the second layer fixed to identity which corresponds to an ICA model. The L2-norm model was initialized with this ICA basis, so the filters are similar. The L1-norm model shows somewhat more location selectivity, whereas the L2-norm model has more precise frequency and orientation tuning. (b) Change in tuning properties of the linear filters in W as the first layer adapts to the pooling patterns in the second layer. The scatter plots show how the tuning of the individual Gabors changes as we go from the ICA model to the L2-normalized model. The horizontal axis shows the value of the parameter in the ICA model and the vertical axis the value after learning both layers. While orientation tuning changes very little, and position as well as frequency also remain relatively stable, the spatial phase changes more dramatically.

Figure 2:

(a) A subset of 48 randomly selected linear basis functions in the first weight layer. On the left and in the middle, features for models with an L1-normalized and L2-normalized second layer are shown. On the right we show a model with the second layer fixed to identity which corresponds to an ICA model. The L2-norm model was initialized with this ICA basis, so the filters are similar. The L1-norm model shows somewhat more location selectivity, whereas the L2-norm model has more precise frequency and orientation tuning. (b) Change in tuning properties of the linear filters in W as the first layer adapts to the pooling patterns in the second layer. The scatter plots show how the tuning of the individual Gabors changes as we go from the ICA model to the L2-normalized model. The horizontal axis shows the value of the parameter in the ICA model and the vertical axis the value after learning both layers. While orientation tuning changes very little, and position as well as frequency also remain relatively stable, the spatial phase changes more dramatically.

In Figure 2b, we investigate how the ICA basis functions in W change when the first layer adapts to the pooling patterns in the second layer. For this comparison, the L2-norm model was estimated from the same random seed as the ICA model, where V was fixed to identity. As can already be seen in Figure 2a, the features change only little, but from a scatter plot showing the changes in the Gabor parameters for each linear filter, some systematic changes become visible. The orientation tuning is least affected by the estimation of both layers, so the parameter does not change significantly. Positions and frequencies change slightly for most of the features, but strong changes in the tuning are rare. The phase tuning is very different, however, with many linear filters completely changing the phase tuning to better adapt to the pooling in the second layer. This gives some idea why the improvement in model fit is so small if V is estimated on top of a fixed ICA basis, without allowing the features in W to adapt.

In Figure 3 we show the second layer of the model and the emerging pooling patterns in more detail. In Figures 3a and 3b, we show a subset of the pooling patterns in a representation adapted from Hyvärinen, Gutmann, and Hoyer (2005), which also allows easy comparison with Karklin and Lewicki (2005). For each higher-order unit, the linear filters that contribute to the output are represented by ellipses. The location and the orientation of each ellipse correspond to the location and orientation of the underlying first-layer basis function. Frequency is represented by the size of the ellipse, where larger corresponds to lower frequencies. The shading of the ellipse represents the connection strength, light gray being close to zero and black corresponding to maximal contribution. For the L1 model, most of the outputs pool over a small number of linear filters, which share similar orientations and positions, while the pooling is more heterogeneous for the L2 model. While the sparseness of the pooling is more pronounced with L1 regularization, the average number of significantly active linear filters is still well below 10% for the L2 norm model.

Figure 3:

(a, b) A random selection of 24 higher-order features, corresponding to individual rows of V. Each feature is represented by a number of ellipses corresponding to individual first-layer basis functions with the same orientation and position as the ellipse. Spatial phase is not shown in this representation. Each unit can be seen to pool over a small number of basis functions that tend to be iso-oriented and co-localized. This is typical behavior for complex cell receptive fields. While the L1-norm penalty in a leads to a relatively uniform population of outputs, the features with an L2-norm constraint in b show a distinct splitting into two subpopulations. Some features pool over a larger number of inputs and lose much of the location selectivity, while the rest of the features pool over fewer features than with the L1-norm. (c, d) Left: Pooling patterns visualized in more detail by plotting the most active linear filters contributing to some randomly selected higher-order units. Each row corresponds to one output, and the black bars represent the relative strength of the linear filter inputs. Right: Plot of the second-layer matrix V.

Figure 3:

(a, b) A random selection of 24 higher-order features, corresponding to individual rows of V. Each feature is represented by a number of ellipses corresponding to individual first-layer basis functions with the same orientation and position as the ellipse. Spatial phase is not shown in this representation. Each unit can be seen to pool over a small number of basis functions that tend to be iso-oriented and co-localized. This is typical behavior for complex cell receptive fields. While the L1-norm penalty in a leads to a relatively uniform population of outputs, the features with an L2-norm constraint in b show a distinct splitting into two subpopulations. Some features pool over a larger number of inputs and lose much of the location selectivity, while the rest of the features pool over fewer features than with the L1-norm. (c, d) Left: Pooling patterns visualized in more detail by plotting the most active linear filters contributing to some randomly selected higher-order units. Each row corresponds to one output, and the black bars represent the relative strength of the linear filter inputs. Right: Plot of the second-layer matrix V.

In Figures 3c and 3d, this is further analyzed for the two models: the plot on the left shows the most active features for a random selection of second-order units. The units can be seen to share similar frequency, orientation, and location but differ in spatial phase. On the right-hand side, the second-layer matrix V is shown directly. Again, the connectivity can be seen to be sparse, with only a few first-layer features contributing to each row of V. Here the linear filter inputs were sorted by frequency and output rows by sparseness. The pooling is quite homogeneous for the L1 case, but for L2 normalization, large groups of high-frequency filters are pooled into one output. There are several nearly identical copies of these outputs, indicating a comparably large contribution to the model pdf.

In Figure 4 we analyze the complex cell properties of the higher-order outputs for the different models, following Hyvärinen and Hoyer (2001). One parameter of the fitted Gabor was changed at a time, and the normalized response was plotted as a function of the tuning parameter. The solid line shows the mean response of 120 tested cells, and the dashed lines give 10% and 90% quantiles. We further investigate the effect of simultaneous versus sequential estimation of the weight layers and the difference between L1 and L2 normalization. The tuning curves are computed by taking the optimal Gabor stimulus for each higher-order unit and changing one of the parameters (phase, position, orientation and frequency) at a time. In Figure 4a, only the first layer was learned and the second fixed to the identity matrix, so the model corresponds to ICA. This results in simple cell behavior with strong phase selectivity. Sequential estimation of the two weight layers with L1 normalization is shown in Figure 4b, so the first-layer filters are not adapted to the pooling patterns. There is a decrease in selectivity to spatial phase, indicating complex cell properties. In Figure 4c, both layers have been estimated simultaneously with L1 normalization. The adaptation of the phase of the linear filters to the pooling patterns leads to a striking decrease in phase selectivity, that is, the second-layer outputs become more complex cell-like. In particular the upper 10% quantile of the outputs becomes essentially completely phase invariant, whereas in the sequential estimation, there is still a 40% modulation in this quantile. At the same time, the selectivity for position, orientation, and frequency is not affected considerably, with only a slight broadening. In Figure 4d, we show the responses for a model with simultaneous estimation of both layers and L2 normalization. Due to the heterogeneous pooling patterns, much of the selectivity, in particular for position, is lost. At the same time, the large number of simple cell-like outputs with only a single strongly active linear filter leads to a loss of phase invariance. The regularization with an L1 norm seems to be an important requirement to obtain complex cell-like responses.

Figure 4:

Analysis of complex cell properties of the second-layer outputs. (a) Only the first layer, W, was estimated, and V was fixed to identity. (b) After W had converged, it was held constant, and V was estimated using this constant first layer. (c) W was initialized as above, but then both layers were estimated simultaneously. This shows significantly less phase sensitivity in the tuning curves, indicating that W has adapted to the pooling imposed by V. (d) Responses obtained with L2 normalization under simultaneous optimization. Not only is some of the phase invariance lost, but position and orientation tuning are significantly worse than for the L1 case.

Figure 4:

Analysis of complex cell properties of the second-layer outputs. (a) Only the first layer, W, was estimated, and V was fixed to identity. (b) After W had converged, it was held constant, and V was estimated using this constant first layer. (c) W was initialized as above, but then both layers were estimated simultaneously. This shows significantly less phase sensitivity in the tuning curves, indicating that W has adapted to the pooling imposed by V. (d) Responses obtained with L2 normalization under simultaneous optimization. Not only is some of the phase invariance lost, but position and orientation tuning are significantly worse than for the L1 case.

5.3.  Estimation of an Overcomplete Model.

To generalize our experiments to an overcomplete model, we propose a model in which the number of linear filters is higher than the data dimensionality, but the dimensionality is reduced again for the higher-order units. This is motivated by the observation that with no normalization on the second layer, many of the outputs go to zero (experiments not shown). Since we do not need to take the normalizability of the model into account, it is straightforward to make the set of filters in W overcomplete. Here we analyze a model that is overcomplete by a factor of 4, with 240 filters in W estimated on data with the dimensionality reduced to 60, but otherwise identical to the image data used in the previous section.

In order to make the model overcomplete, we need to drop two simplifications used so far. First, we cannot use an ICA initialization since this would require as many output units as linear filters. Instead of the initialization with an identity matrix, we initialize V randomly with uniform noise. Second, the matrix W can no longer be constrained to be an orthogonal rotation, so it is estimated with rows constrained to unit L2 norm.

To analyze the effect of the random initialization separately from that of overcompleteness, we compare the results from the overcomplete model with those from a complete model with a randomly initialized second layer. Both models were estimated with L2 normalization on the second layer. Figure 5 shows the results in the same way as the previous plots—the first- and second-layer features and pooling patterns for these models. For the overcomplete model, it can be seen that some of the features, in particular at higher frequencies, are less localized than in the models with ICA initialization. The reason is evident when considering the pooling patterns: there are many higher-order units pooling over a large number of linear filters, so selectivity in these features is reduced, and more global pooling patterns emerge. It is also worth pointing out that some of the basis functions in the overcomplete case contain multiple Gabor functions, indicating convergence to a local minimum. This problem is not due to the random initialization, as can be seen from the complete model, which in fact has a and is thus not significantly different in quality from the diagonally initialized model we considered earlier. Rather, we suggest that the orthogonality of W is an important requirement to avoid convergence to local minima.

Figure 5:

Comparison of a complete and a four times overcomplete model, both with randomly initialized V and L2 normalization on the second layer. (a, b) A subset of the filters in W for the two models. Random initialization does not lead to qualitative differences, but the overcomplete model learns features that are less localized and more frequency selective. (c, d) A representation of the higher-order features (see Figure 3 for details). The overcomplete model can be seen to give rise to a population of highly selective outputs which include only a single linear filter as well as a population of largely invariant higher-order units pooling over a large fraction of inputs and retaining only orientation selectivity. (e, f) The second-layer weight matrix V with inputs arranged by frequency and outputs ordered by sparseness of the pooling.

Figure 5:

Comparison of a complete and a four times overcomplete model, both with randomly initialized V and L2 normalization on the second layer. (a, b) A subset of the filters in W for the two models. Random initialization does not lead to qualitative differences, but the overcomplete model learns features that are less localized and more frequency selective. (c, d) A representation of the higher-order features (see Figure 3 for details). The overcomplete model can be seen to give rise to a population of highly selective outputs which include only a single linear filter as well as a population of largely invariant higher-order units pooling over a large fraction of inputs and retaining only orientation selectivity. (e, f) The second-layer weight matrix V with inputs arranged by frequency and outputs ordered by sparseness of the pooling.

5.4.  Estimation Without Nonnegativity Constraints.

In addition to the experiments with a nonnegative second layer, we also estimated the model with two different nonlinearities that did not require a nonnegativity constraint, allowing us to model negative energy correlations in addition to positive ones. First, we analyzed the case of a symmetric nonlinearity f(u)=log cosh(u), which leads to an output distribution that is sparse for both negative and positive outputs. Additionally we report on the results with an asymmetric nonlinearity, where the negative half of the nonlinearity corresponds to a low-variance gaussian distribution, and the positive half follows a logistic distribution. In both cases, an L2-norm constraint was imposed on rows of V, which was initialized randomly with gaussian white noise. Moreover, the first-layer nonlinearity was set as g(u)=log cosh(u) instead of the squaring, in order to be able to better model a heavy-tailed distribution in conjunction with the second nonlinearity.

In the symmetric model (results not shown), we found sparse connectivity of the second layer, but with higher-order features very different from the complex cell-like responses of the nonnegative model. For some of the outputs, the first layer forms pairs of features that both contain two Gabors in their receptive field—one that is identical in both features and one that is identical but of opposite sign, as in Lindgren and Hyvärinen (2006). Individual higher-order outputs pool one such pair of features with one strong negative and one positive weight. Using the identity a2b2=(ab)(a+b), this can be interpreted as taking the product of the sum and difference of the two linear filters, which turns out to be the product of two individual Gabor filters. Thus the model is effectively taking products of linear filter outputs, with results very similar to those observed previously in quadratic ICA (Lindgren & Hyvärinen, 2006). In the model with sparse positive and gaussian negative outputs (results not shown), we report higher-order features with one or a small number of highly active positive inputs and a larger number of small negative inputs. This could possibly be interpreted as surround inhibition or a gain control phenomenon.

Thus we see that the results depend strongly on the choice of the nonlinearity f, and very different results can be obtained by changing the nonlinearity. It seems difficult to make a principled choice of f because the usual measures of sparseness do not easily generalize to two-layer models. In future research, f could possibly be estimated from the data. In the current work, we choose to analyze only the results from the nonnegative model, because it seems to be most in line with the visual processing in mammalian visual cortex, that is, complex cell responses. Furthermore, we view our model as a direct extension to models with nonnegative second layers such as ISA and topographic ICA. These have previously been motivated by the observation of strong positive energy correlations between linear filter outputs, so nonnegativity seems to be a reasonable constraint.

6.  Experiments on Audio Data

In order to demonstrate the general applicability of our model to a variety of data sets, we also tested it on speech data from the TIMIT database. The model was estimated in exactly the same way as for image data, and the data were preprocessed as follows. We took 100 short utterances from the database and resampled them to 8 kHz. The data were high-pass-filtered with a cutoff at 100 Hz and then normalized to unit variance. We sampled 10,000 random sound windows of 16 ms length, which corresponds to 128-dimensional data and removed the DC component. We also applied our standard preprocessing consisting of whitening and contrast gain control, as described above for image data. Simultaneously we reduced the dimensionality from 128 to 120, which amounts to low-pass filtering and serves to eliminate artifacts due to the windowing procedure. The rows of the second layer were constrained to unit L1-norm.

The results we obtained from the speech data are remarkably similar to those from image data and are presented in Figure 6. In Figure 6a, we show the first-layer features in the time domain, which are tuned to specific frequency bands as well as onset time and duration. The second layer in Figure 6b shows that a sparse connectivity has been learned between groups of first-layer features. This is analyzed further in Figure 6c, where we show which first-layer features are pooled in the second layer. We obtain higher-order features where similar frequencies with slightly different temporal onset are pooled. Interestingly, the pooling size is considerably smaller than for image data; some of the outputs have as few as three contributing first-order features. This indicates that individual linear filters outputs are closer to independent for audio data than for images and that residual dependencies after the first layer are smaller.

Figure 6:

Experiments for speech data from the TIMIT database. (a) The first layer gives outputs localized in both frequency and time. (b) The second layer shows connections between features with dependencies of squares. (c) A random selection of output units. Each row shows the active first-layer filters in one row of V.

Figure 6:

Experiments for speech data from the TIMIT database. (a) The first layer gives outputs localized in both frequency and time. (b) The second layer shows connections between features with dependencies of squares. (c) A random selection of output units. Each row shows the active first-layer filters in one row of V.

7.  Discussion

Hierarchical models have a long history in computational neuroscience and machine learning, and they are a promising approach to modeling the complicated structure of natural signals. However, it has not been feasible until recently to estimate multiple layers of these models from the data. The estimation can be difficult in different ways, such as the need to integrate over latent variables in generative models or the computation of the intractable partition function in energy-based models. In this work, we have used score matching for estimating an energy-based model while avoiding computation of the partition function, and we have shown how both layers of a two-layer model can be estimated from the data.

Two recent models have a similar hierarchical structure but are estimated in a different way from the model considered here. In particular the hierarchical product of experts (PoE; Osindero et al., 2006) is closely related to our work. Instead of using the independent component point of view, the model is defined as an overcomplete PoE model with experts following Student-t distributions. The estimation is performed using contrastive divergence, which was recently shown (Hyvärinen, 2007a) to be related to score matching. The results obtained are similar to those reported here. Estimating the model on natural images also leads to pooling of linear filters with similar position and orientation but a different spatial phase. However, the authors do not show the pooling patterns, so it is not clear whether they also observed sparse connectivity in the second layer. A surprising difference to our model is that the authors observe it makes little difference whether the second layer is estimated on top of a fixed, predetermined first layer. However, model comparison is not straightforward with CD, which does not provide an objective function, so it is not clear if there is no further change in the linear filters in the PoE model, or if the authors simply did not investigate this further. In any case, the change in first-layer units, which leads a significantly more complex cell-like behavior of output units, shows the importance of estimating the layers simultaneously in our model. The first-layer features adapt to the structure of the second layer by changing the spatial phase tuning, which results in a better model fit as gauged by the score matching objective function, as well as increasing the phase invariance of the output units.

The hierarchical Bayesian model by Karklin and Lewicki (2005, 2006) is also related to our work in that the authors estimate two layers of a hierarchical model of natural images, but it is different in that the model is generative rather than energy based. It is related to the generative topographic ICA model (Hyvärinen et al., 2001), where sources are not generated independently but have dependencies introduced by multiplication with hidden variance variables, which are themselves given by a linear mixing of higher-order sources. An exact estimation of such a generative model is not tractable because it requires integration over the possible states of the higher-order variance variables. The authors thus resort to using a maximum a posteriori (MAP) approximation for the higher-order sources. Similar to our results, the authors obtain Gabor-like filters in the first layer, but the second layer, which describes patterns of variance dependencies between linear filters, is quite different from the results of our energy-based models. The authors report broadly tuned features, encoding global properties of the data, and do not obtain simple pooling patterns with, for example, common orientations and spatial frequencies. In agreement with our results, the linear filters in the first layer are reported to change depending on the pooling patterns in the second layer.

While the sparseness of the connectivity in the second layer is an important factor in obtaining complex cell-like responses, it is important to note that sparseness was not explicitly enforced in our model. Experiments with L2 normalization and random initialization of V still result in sparse connectivity, even though the output units lose many of their complex cell properties. The nonnegativity and L1-norm penalty appear to be important factors in obtaining a uniform population of phase-invariant higher-order cells, which pool over a small number of first-order units. The L1-norm can be related to energy-efficient coding and wiring length constraints, which has been proposed to play a role in shaping the receptive fields in retinal cells (Vincent, Baddeley, Troscianko, & Gilchrist, 2005) and may be of similar importance for early visual cortex. With the L2-norm we observed a fragmentation into two populations of output cells with different pooling patterns. In addition to one very sparse population, a second population pools over a larger number of inputs and loses much selectivity.

8.  Conclusion

We have presented an energy-based model of natural images and sounds with two layers of weights estimated from the data. The two-layer model was estimated using score matching, which allows estimation without knowledge of the partition function. On natural images, the estimation of both layers with an L1-norm penalty on the second layer leads to the emergence of complex cell properties for the higher-order units. We analyze how the model parameters differ if the second layer is estimated on top of a fixed ICA basis and report that the phase tuning of the linear filters changes when both layers are estimated simultaneously, which results in an increase in phase invariance of the outputs. We performed experiments with a randomly initialized second layer, L2 normalization and without nonnegativity constraints, which all lead to sparse pooling but less complex cell-like outputs.

Appendix:  Derivatives of the Objective Function.

We need to evaluate the gradients in the objective function (equation 3.16) with regard to the elements of W and V. Since the expression can readily separated into a sum of three terms, which we call A, B, and C, we treat these separately. We get six terms , , and so on. Writing these out, we get for the first term
formula
A.1
formula
A.2
formula
A.3
formula
A.4
For the second term, we get
formula
A.5
formula
A.6
formula
A.7
formula
A.8
and for the third term, the derivative is
formula
A.9
For better readability, we substitute in all subsequent equations:
formula
A.10
formula
A.11
formula
A.12
Next we evaluate the derivatives for V, for the first term:
formula
A.13
formula
A.14
the second term:
formula
A.15
formula
A.16
formula
A.17
and similar for the third term:
formula
A.18
formula
A.19
formula
A.20

Acknowledgments

We were supported by the Academy of Finland, project 111786, and the Centre-of-Excellence in Algorithmic Data Analysis (Algodan). U.K. was supported by a scholarship from the Alfried Krupp von Bohlen und Halbach-Stiftung.

References

Adelson
,
E.
, &
Bergen
,
J.
(
1985
).
Spatiotemporal energy models for the perception of motion
.
J. Opt. Soc. Am. A
,
2
284
299
.
Barlow
,
H.
(
1961
).
Possible principles underlying the transformation of sensory messages
.
Cambridge, MA
:
MIT Press
.
Comon
,
P.
(
1994
).
Independent component analysis—a new concept?
Signal Processing
,
36
,
287
314
.
Hinton
,
G. E.
(
2002
).
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
(
8
),
1771
1800
.
Hubel
,
D. H.
, &
Wiesel
,
T. N.
(
1959
).
Receptive fields of single neurones in the cat's striate cortex
.
J. Physiol.
,
148
,
574
591
.
Hubel
,
D. H.
, &
Wiesel
,
T. N.
(
1962
).
Receptive fields, binocular interaction and functional architecture in cat's visual cortex
.
J. Physiol.
,
160
,
106
154
.
Hyvärinen
,
A.
(
2005
).
Estimation of non-normalized statistical models using score matching
.
Journal of Machine Learning Research
,
6
,
695
709
.
Hyvärinen
,
A.
(
2007a
).
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables
.
IEEE Transactions on Neural Networks
,
18
(
5
),
1529
1531
.
Hyvärinen
,
A.
(
2007b
).
Some extensions of score matching
.
Computational Statistics and Data Analysis
,
51
,
2499
2512
.
Hyvärinen
,
A.
,
Gutmann
,
M.
, &
Hoyer
,
P.
(
2005
).
Statistical model of natural stimuli predicts edge-like pooling of spatial frequency channels in V2
.
BMC Neuroscience
,
6
(
12
).
Hyvärinen
,
A.
, &
Hoyer
,
P.
(
2000
).
Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces
.
Neural Computation
,
12
(
7
),
1705
1720
.
Hyvärinen
,
A.
, &
Hoyer
,
P. O.
(
2001
).
A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images
.
Vision Research
,
41
,
2413
2423
.
Hyvärinen
,
A.
,
Hoyer
,
P.
, &
Inki
,
M.
(
2001
).
Topographic independent component analysis
.
Neural Computation
,
13
(
7
),
1527
1558
.
Hyvärinen
,
A.
,
Hurri
,
J.
, &
Hoyer
,
P. O.
(
2009
).
Natural image statistics
.
Berlin
:
Springer-Verlag
.
Hyvärinen
,
A.
, &
Köster
,
U.
(
2007
).
Complex cell pooling and the statistics of natural images
.
Network: Computation in Neural Systems
,
18
,
81
100
.
Jutten
,
C.
, &
Hérault
,
J.
(
1991
).
Blind separation of sources part I: An adaptive algorithm based on neuromimetic architecture
.
Signal Processing
,
24
,
1
10
.
Karklin
,
Y.
, &
Lewicki
,
M. S.
(
2005
).
A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals
.
Neural Computation
,
17
(
2
),
397
423
.
Karklin
,
Y.
, &
Lewicki
,
M. S.
(
2006
).
Is early vision optimized for extracting higher-order dependencies?
In
Y. Weiss, B. Schölkopf, & J. Platt
(Eds.),
Advances in neural information processing systems
,
18
(pp.
625
642
).
Cambridge, MA
:
MIT Press
.
Köster
,
U.
, &
Hyvärinen
,
A.
(
2007
).
A two-layer ICA-like model estimated by score matching
. In
Artificial Neural Networks—ICANN 2007, Lecture Notes in Computer Science
(pp.
798
807
).
Berlin
:
Springer-Verlag
.
Lindgren
,
J. T.
, &
Hyvärinen
,
A.
(
2006
).
Emergence of conjunctive visual features by quadratic independent component analysis
. In
B. Schölkopf, J. Platt, & T. Hofmann
(Eds.),
Advances in neural information processing systems
,
19
(pp.
897
904
).
Cambridge, MA
:
MIT Press
.
Olshausen
,
B.
, &
Field
,
D.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
,
607
609
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1997
).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research
,
37
(
23
),
3311
3325
.
Osindero
,
S.
,
Welling
,
M.
, &
Hinton
,
G. E.
(
2006
).
Topographic product models applied to natural scene statistics
.
Neural Computation
,
18
,
381
414
.
Pollen
,
D.
, &
Ronner
,
S.
(
1983
).
Visual cortical neurons as localized spatial frequency filters
.
IEEE Trans. on Systems, Man, and Cybernetics
,
13
,
907
916
.
Schwartz
,
O.
, &
Simoncelli
,
E. P.
(
2001
).
Natural signal statistics and sensory gain control
.
Nature Neuroscience
,
4
(
8
),
819
825
.
Spitzer
,
H.
, &
Hochstein
,
S.
(
1985
).
A complex-cell receptive-field model
.
J. Neurophysiol.
,
53
,
1266
1286
.
Vincent
,
B.
,
Baddeley
,
R.
,
Troscianko
,
T.
, &
Gilchrist
,
I.
(
2005
).
Is the early visual system optimised to be energy efficient?
Network
,
16
(
2–3
),
175
190
.