## Abstract

We consider a hierarchical two-layer model of natural signals in which both layers are learned from the data. Estimation is accomplished by score matching, a recently proposed estimation principle for energy-based models. If the first-layer outputs are squared and the second-layer weights are constrained to be nonnegative, the model learns responses similar to complex cells in primary visual cortex from natural images. The second layer pools a small number of features with similar orientation and frequency, but differing in spatial phase. For speech data, we obtain analogous results. The model unifies previous extensions to independent component analysis such as subspace and topographic models and provides new evidence that localized, oriented, phase-invariant features reflect the statistical properties of natural image patches.

## 1. Introduction

A variety of methods like independent component analysis (ICA; see Comon, 1994) and sparse coding (Olshausen & Field, 1997) have been applied to model the statistical structure of natural signals such as images and sounds. In computational neuroscience, the goal of modeling these signals with unsupervised learning methods is to gain a better understanding of sensory processing, which is assumed to be linked to the statistics of ecologically valid stimuli (Barlow, 1961; Hyvärinen, Hurri, & Hoyer, 2009).

Linear ICA is limited in scope and cannot capture arbitrary dependencies, so more recent models use a nonlinear representation to better capture the structure of the data. In particular, there is a growing number of hierarchical models with two weight layers. These include direct extensions to ICA such as independent subspace analysis (ISA) and topographic ICA (TICA; see Hyvärinen, Hoyer, & Inki, 2001; Hyvärinen & Hoyer, 2000), which can be viewed as employing a manually selected, fixed second layer that pools over first-layer features modeling dependencies that cannot be removed by a linear transform. The fixed second layer in these models has the advantage that the probability density function (pdf) can still be normalized in closed form, or approximations for the likelihood can be found. Thus, a straightforward estimation of these models by maximizing likelihood is possible.

More recently, models where the second layer is also learned from the data have received attention. However, this comes at the expense of a more complicated estimation, since these models in general cannot be normalized in closed form, making maximum likelihood learning very difficult. Two recent models of this kind are the hierarchical Bayesian model by Karklin and Lewicki (2005, 2006) and the hierarchical product of experts (Osindero, Welling, & Hinton, 2006). The first is a generative model in which the components are not independent and identically distributed (i.i.d.), but the variance is given by hidden variables. The second model is an energy-based model with an intractable partition function, similar to the one we consider here, and it is estimated using contrastive divergence (CD; Hinton, 2002).

We present a two-layer model of natural stimuli where both layers are estimated from the data and analyze the resulting pooling patterns in the second layer. Following the classical energy model of complex cells (Adelson & Bergen, 1985; Spitzer & Hochstein, 1985), linear filter outputs from the first layer are squared and then pooled, where the pooling is learned from the data.

In our analysis, we focus on two points in particular. We compare the results obtained by estimating both layers of the model simultaneously, with a simplified model where the second layer is estimated on top of a fixed ICA basis in the first layer, and report differences in tuning of the linear filters as well as the higher-order units. Furthermore, we analyze the effect of *L*_{1}-normalization of the second layer on the resulting outputs and show that this normalization plays a significant role in obtaining pooling patterns in line with previous complex cell models.

The model is estimated with score matching (Hyvärinen, 2005, 2007a), a consistent estimation method for energy-based models that cannot be normalized in closed form. Traditionally these energy-based models would have to be estimated with Markov chain Monte Carlo (MCMC) methods, which is computationally expensive and makes it hard to evaluate convergence. While recent methods like contrastive divergence are computationally more efficient, it is still necessary to set up a Markov chain, the choice of which may greatly influence the convergence properties. Score matching, in contrast, gives an objective function that can simply be optimized by gradient methods.

This letter is organized as follows. In section 2, we discuss previous models of natural images, focusing on ICA and its extensions. In section 3, the two-layer model is presented, including details of our implementation and the score matching estimation. In addition, we review how the score matching objective function is derived. We test the model and estimation on synthetic data in section 4. In section 5, we apply the method to natural image data. We present models estimated with different constraints and analyze the tuning statistics of the model cells for their complex cell properties. We compare the effect of different normalization methods for the second layer of the model and compare the results with a complete model to those with an overcomplete first layer. We focus on models with a nonnegative second layer but also consider models without the nonnegativity constraint. In section 6 we perform similar experiments with speech data, where we obtain a sparse pooling in the second layer that is very similar to the results for natural images. In section 7 we discuss how our work compares to other recently developed two-layer models, highlighting the principled estimation with score matching and the analysis of the complex cell–like properties of the outputs in our model. Finally we conclude with section 8. Preliminary results have been published in Köster and Hyvärinen (2007).

## 2. Modeling of Natural Images

Ever since mammalian visual receptive fields were described by Hubel and Wiesel in the 1960s (see, e.g., Hubel & Wiesel, 1959, 1962), efforts have been made to understand why receptive fields have the observed properties. One successful approach is based on the idea that neural processing should be matched to statistics of ecologically valid stimuli, that is, natural images. This led to the development of statistical models like sparse coding (Olshausen & Field, 1996) and ICA (Jutten & Hérault, 1991; Comon, 1994), which result in basis functions with a strong resemblance to the receptive fields of simple cells.

**s**, is mixed to generate the observed data vector

**x**. This can be written as In the simplest case, which is usually considered, the dimensionality of the source vector and the data vector is the same, so

**A**is a square, invertible mixing matrix. Thus, the components can be recovered from the data using the filter matrix

**W**=

**A**

^{−1}as There is a range of methods for the estimation of this model; here we focus on the likelihood-based approach. The distribution of the individual components is modeled by densities

*p*, so by independence, we have This allows us to write the pdf of the data as where

_{i}**w**

^{T}

_{i}are the rows of

**W**, and the determinant is a normalization factor due to the transformation of the density. Thus, we obtain the log likelihood of the parameters for a finite sample of data as where

**x**(

*t*) runs over

*T*samples from the data. The likelihood can easily be maximized with regard to the filters

**W**by gradient ascent. Estimating an ICA model for natural image patches results in filters that are localized, oriented, and bandpass, resembling the spatial receptive fields of simple cells in the primary visual cortex (V1).

**s**is projected onto a number of subspaces. Squared norms of projections onto subspaces are then computed as where the index

*i*runs over all the components that belong to the j

*th*subspace. The pdf of the model then takes the form where the scalar nonlinearity

*f*(.) defines the overall shape of the pdf. The pooling can also be viewed as a second linear transformation, or weight layer, where a number of first-layer units converge into one higher-order unit. Since independence is assumed only for the higher-order units

*u*, the linear features that are projected onto one subspace may have dependencies. Applied to natural images, this results in a pooling of features with similar frequency, orientation, and location but a different spatial phase. Thus, it can be argued that complex cells are tuned to capture dependencies, in particular correlations in the variance of linear filters (Schwartz & Simoncelli, 2001), which the model makes explicit by computing squared norms.

_{j}## 3. The Model and Estimation

### 3.1. The Two-Layer Model.

*Z*(

**W**,

**V**) is the partition function of the model—a function of the model parameters that ensures that the pdf integrates to unity. The

**v**

^{T}

_{h}are rows of the second-layer weight matrix

**V**, while the first layer

**W**has been retained from the ICA model. The two weight matrices need not be square, so in general,

**W**will be of size and

**V**of size . We have two scalar nonlinearities

*g*(.) and

*f*(.), the first of which computes nonlinear features from the data, whereas the second shapes the overall pdf. Such a model cannot be normalized in closed form, since the normalization constant

*Z*is given by an intractable integral. Therefore, we use score matching for the estimation, which provides a straightforward method for learning in energy-based models.

*g*to be a squaring operation unless otherwise specified. In addition, the second layer was constrained nonnegative. This is a natural choice for a model of complex cells, where outputs are computed by pooling over squared or rectified simple cell responses (Pollen & Ronner, 1983; Spitzer & Hochstein, 1985). The second nonlinear function is chosen to be of the form which ensures that the overall distribution of the model is supergaussian. Again, this nonlinearity was used in all the simulations unless otherwise mentioned. When these nonlinearities are used, the model distribution becomes We further constrained the vectors

**v**

^{T}

_{h}to be normalized to unit

*L*

_{2}or, alternatively, to unit

*L*

_{1}-norm, which corresponds to constraining the second-layer units to have unit output energy and encourages sparse connectivity.

### 3.2. Score Matching.

*Z*of the pdf cannot be computed in closed form, and we use

*q*to denote the unnormalized distribution. In the form of a log probability, we have the model The model score function, which we define as the gradient of the log probability with respect to the data, is obviously identical for

*q*and

*p*and given by The score function of the observed data is denoted by Working with the score function thus has the advantage that it does not depend on the normalization constant

*Z*. The model can now be estimated by minimizing the squared distance between the model score function and the data score function . This objective function is defined by This may not appear to be very useful at first sight, because estimating the data score function is a nonparametric problem and would require no less effort than estimating the normalization constant. However, a much simpler form of the objective function can be obtained. The full proof can be found in Hyvärinen (2005). We start by expanding the squared term to Here we note that the first term does not depend on the data score function, so by rewriting it with the squared norm expanded as a sum, we get where the are elements of the score function. The second term is constant with regard to , so we simply set Thus, we focus on the third term, where we start by writing out the inner product, and consider a single element of the sum. We now use the definition of the score function , so when the chain rule is used, the term becomes We then use multivariate partial integration (Hyvärinen, 2005) to obtain the

*i*th term as where the integration constant

*D*is zero as . Working with a finite sample of data, we can replace the exact expectations with sample averages. Collecting the terms, we then obtain the expression which is easy to evaluate since it contains only terms depending on the model pdf. Score matching has been shown to provide a consistent estimator (Hyvärinen, 2005), so if the data follow the model, is asymptotically minimized for the true parameters.

### 3.3. Estimating the Model.

Optimizing this objective is straightforward by gradient descent, which requires the gradients of the above expression with respect to the elements of the weight matrices **W** and **V**. These gradients are given in the appendix. The nonnegativity and norm constraint were implemented by projecting onto the constraint set after each gradient step.

## 4. Experiments on Simulated Data

To verify the identifiability of the model, we estimated it for simulated data with a known higher-order structure. We generated data following the ISA model, which is a special case of the proposed two-layer model and easy to sample from. The data generated in this way contain higher-order dependencies in the form of common variances for groups of source variables, which cannot be captured by ICA. Samples from the ISA model were generated as follows. To obtain *T* observations of an *n*-dimensional vector that contains *k* subspaces, we first create a matrix **M** of observations from an i.i.d. gaussian with unit variance, and a matrix **B** of variance parameters from a uniform distribution. We introduce dependencies within groups of the gaussians by multiplying them with a common variance from the uniform distribution: . The supergaussian variables produced in this way are then multiplied with a mixing matrix **A** that is also generated randomly, so the data matrix is **X**=**AU**. Before the estimation, the data are whitened. For the experiments shown below, we set *T*=5000, *n*=21, and *k*=7, so each subspace has three elements.

The experiments with artificial data mainly served the purpose of confirming the consistency of the estimation, but also to try out various initialization and normalization procedures for the experiments on natural stimuli. We compare *L*_{1} and *L*_{2} normalization of the second-layer matrix **V**, and we compare randomly initializing **V** and initializing it with an identity matrix, which allows prelearning of **W** as an ICA model.

For the visualization of the results, note that in ICA, one can simply multiply the mixing matrix **A** with the estimated filter matrix **W** to obtain a permuted diagonal matrix if the components are identified correctly. Thus, visual inspection of can be used to to determine convergence. The ISA model is identifiable only up to subspaces due to rotational symmetry, where the second layer determines the subspace ownership of each element of **Z**. By multiplying the second-layer matrix **V** with **Z**, a block-diagonal matrix with permuted rows should be obtained if the algorithm converges correctly.

Results are presented in Figure 1. In each of the four experiments (a–d) we performed, the top row shows the second layer **V** on the left, and the product on the right. In the bottom row, on the left, we show where the rows of **V** have been permuted in such a way that identical rows are next to another. This is purely for visualization purposes and does not affect the objective function. Again, on the right, we plot the product . If this results in a permuted block-diagonal matrix, the second layer has correctly identified the dependency structure in the first layer.

Firstly, the comparison between Figures 1a and 1b shows that convergence is possible both from **V** initialized with the identity matrix and from a random **V**. However the number of iterations is about an order of magnitude greater starting from random. Second between Figures 1a and 1c, we compare the effect of *L*_{1} and *L*_{2} normalization. In this case, the *L*_{2}-normalized model has converged to a local minimum and has not identified all the components correctly. In general, however, there was no major difference between the *L*_{1} and *L*_{2} normalization for these data. Finally, in Figures 1a and 1d, we analyze the effect of estimating the layers sequentially, which means that **W** is learned only while **V** is fixed to identity, after which **W** is held fixed and learning is continued with **V** only. In this simple example, the model converges to the correct solution, but as we will see, it is preferable to estimate both layers simultaneously.

## 5. Experiments on Natural Images

### 5.1. Methods.

All experiments were performed on images taken from P. O. Hoyer's ImageICA package (www.cs.helsinki.fi/patrik.hoyer), using 20, 000 image patches of size . The whole images were preprocessed by approximate whitening assuming a power spectrum and contrast gain control with a gaussian neighborhood of 16 pixels diameter. Details of this preprocessing can be found in Hyvärinen and Köster (2007). The preprocessing can be given a physiological justification in terms of the processing in the retina and lateral geniculate nucleus, or it can be viewed more pragmatically as simplifying the statistical structure of the images slightly. We then randomly sampled patches from the images and removed the DC component from the patches. We also discarded any image patches with low variance, since they contribute little to the gradient and slow down learning. Finally, we whitened the patches and simultaneously reduced the dimensionality from 256 to 120 using principal component analysis. The dimensionality reduction corresponds to low-pass filtering and eliminates aliasing artifacts due to the rectangular sampling grid from the image patches. Both weight matrices **W** and **V** were chosen to be square, of size , unless otherwise noted.

The matrix **W** was initialized with gaussian white noise, and **V** was initialized as an identity matrix for the experiments in section 5.2 and with noise from a uniform distribution for the experiments in section 5.3. The models were optimized using gradient descent with a constant step size. To increase the speed of convergence of the experiments in section 5.2, we initialized by estimating the first layer only, keeping the second layer fixed. After the convergence of the first layer to an ICA basis, we performed two experiments. In the first, both layers were estimated simultaneously. In the second, **W** was held fixed after initial convergence to an ICA basis, and only **V** was estimated. In all experiments, the outputs units (rows of **V**) were normalized to unit *L*_{1} or *L*_{2} norm after every step. Convergence was determined by visual inspection and took about 300 hours on a Pentium IV workstation.

To analyze the tuning properties of the filters in the first layer, we fit Gabor functions to the basis functions obtained by inverting the filter matrix **W**. For each of the first-layer responses, we used a least-squares fit (adapted from Hyvärinen et al., 2001) to determine location, orientation, size, phase, and frequency of the optimal Gabor. To compute tuning curves of the second-layer outputs, we followed the model by taking squares of the first-layer filter responses and summing them, weighted by rows of **V**. The second-layer nonlinearity is required in the estimation to define a supergaussian pdf, but it is not considered part of our complex cell model, so we analyzed the second-layer outputs without any further nonlinearity. To obtain tuning curves, we designed the test stimulus for each higher-order unit as a Gabor function constructed from a weighted average of the constituent first-layer Gabor parameters. One of the parameters (location, orientation, phase, frequency) was then varied while the others were held at the optimum value.

### 5.2. Results.

In Figure 2 we analyze the first layer features and how they differ from those of an ICA model. Figure 2a shows a random selection of 48 of the 120 basis functions for models estimated with *L*_{1} and *L*_{2} normalization, as well as the ICA basis functions obtained with **V** fixed to identity, for comparison. It can be seen that all the filters are Gabor-like and tuned in orientation, position, frequency, and phase, resembling the responses of simple cells. The basis functions from the *L*_{1} model appear slightly more localized, but less frequency and orientation selective than the filters from the *L*_{2} model, with ICA falling between the two extremes. Comparing the score matching objective function for the three models, we observe that the *L*_{2} model has the best fit to the data with , followed by the *L*_{1} model with , while the ICA model achieves only a , which improves very little to if the second layer is learned on top of the ICA basis without simultaneously allowing the basis functions in **W** to adapt to the new pooling patterns.

In Figure 2b, we investigate how the ICA basis functions in **W** change when the first layer adapts to the pooling patterns in the second layer. For this comparison, the *L*_{2}-norm model was estimated from the same random seed as the ICA model, where **V** was fixed to identity. As can already be seen in Figure 2a, the features change only little, but from a scatter plot showing the changes in the Gabor parameters for each linear filter, some systematic changes become visible. The orientation tuning is least affected by the estimation of both layers, so the parameter does not change significantly. Positions and frequencies change slightly for most of the features, but strong changes in the tuning are rare. The phase tuning is very different, however, with many linear filters completely changing the phase tuning to better adapt to the pooling in the second layer. This gives some idea why the improvement in model fit is so small if **V** is estimated on top of a fixed ICA basis, without allowing the features in **W** to adapt.

In Figure 3 we show the second layer of the model and the emerging pooling patterns in more detail. In Figures 3a and 3b, we show a subset of the pooling patterns in a representation adapted from Hyvärinen, Gutmann, and Hoyer (2005), which also allows easy comparison with Karklin and Lewicki (2005). For each higher-order unit, the linear filters that contribute to the output are represented by ellipses. The location and the orientation of each ellipse correspond to the location and orientation of the underlying first-layer basis function. Frequency is represented by the size of the ellipse, where larger corresponds to lower frequencies. The shading of the ellipse represents the connection strength, light gray being close to zero and black corresponding to maximal contribution. For the *L*_{1} model, most of the outputs pool over a small number of linear filters, which share similar orientations and positions, while the pooling is more heterogeneous for the *L*_{2} model. While the sparseness of the pooling is more pronounced with *L*_{1} regularization, the average number of significantly active linear filters is still well below 10% for the *L*_{2} norm model.

In Figures 3c and 3d, this is further analyzed for the two models: the plot on the left shows the most active features for a random selection of second-order units. The units can be seen to share similar frequency, orientation, and location but differ in spatial phase. On the right-hand side, the second-layer matrix **V** is shown directly. Again, the connectivity can be seen to be sparse, with only a few first-layer features contributing to each row of **V**. Here the linear filter inputs were sorted by frequency and output rows by sparseness. The pooling is quite homogeneous for the *L*_{1} case, but for *L*_{2} normalization, large groups of high-frequency filters are pooled into one output. There are several nearly identical copies of these outputs, indicating a comparably large contribution to the model pdf.

In Figure 4 we analyze the complex cell properties of the higher-order outputs for the different models, following Hyvärinen and Hoyer (2001). One parameter of the fitted Gabor was changed at a time, and the normalized response was plotted as a function of the tuning parameter. The solid line shows the mean response of 120 tested cells, and the dashed lines give 10% and 90% quantiles. We further investigate the effect of simultaneous versus sequential estimation of the weight layers and the difference between *L*_{1} and *L*_{2} normalization. The tuning curves are computed by taking the optimal Gabor stimulus for each higher-order unit and changing one of the parameters (phase, position, orientation and frequency) at a time. In Figure 4a, only the first layer was learned and the second fixed to the identity matrix, so the model corresponds to ICA. This results in simple cell behavior with strong phase selectivity. Sequential estimation of the two weight layers with *L*_{1} normalization is shown in Figure 4b, so the first-layer filters are not adapted to the pooling patterns. There is a decrease in selectivity to spatial phase, indicating complex cell properties. In Figure 4c, both layers have been estimated simultaneously with *L*_{1} normalization. The adaptation of the phase of the linear filters to the pooling patterns leads to a striking decrease in phase selectivity, that is, the second-layer outputs become more complex cell-like. In particular the upper 10% quantile of the outputs becomes essentially completely phase invariant, whereas in the sequential estimation, there is still a 40% modulation in this quantile. At the same time, the selectivity for position, orientation, and frequency is not affected considerably, with only a slight broadening. In Figure 4d, we show the responses for a model with simultaneous estimation of both layers and *L*_{2} normalization. Due to the heterogeneous pooling patterns, much of the selectivity, in particular for position, is lost. At the same time, the large number of simple cell-like outputs with only a single strongly active linear filter leads to a loss of phase invariance. The regularization with an *L*_{1} norm seems to be an important requirement to obtain complex cell-like responses.

### 5.3. Estimation of an Overcomplete Model.

To generalize our experiments to an overcomplete model, we propose a model in which the number of linear filters is higher than the data dimensionality, but the dimensionality is reduced again for the higher-order units. This is motivated by the observation that with no normalization on the second layer, many of the outputs go to zero (experiments not shown). Since we do not need to take the normalizability of the model into account, it is straightforward to make the set of filters in **W** overcomplete. Here we analyze a model that is overcomplete by a factor of 4, with 240 filters in **W** estimated on data with the dimensionality reduced to 60, but otherwise identical to the image data used in the previous section.

In order to make the model overcomplete, we need to drop two simplifications used so far. First, we cannot use an ICA initialization since this would require as many output units as linear filters. Instead of the initialization with an identity matrix, we initialize **V** randomly with uniform noise. Second, the matrix **W** can no longer be constrained to be an orthogonal rotation, so it is estimated with rows constrained to unit *L*_{2} norm.

To analyze the effect of the random initialization separately from that of overcompleteness, we compare the results from the overcomplete model with those from a complete model with a randomly initialized second layer. Both models were estimated with *L*_{2} normalization on the second layer. Figure 5 shows the results in the same way as the previous plots—the first- and second-layer features and pooling patterns for these models. For the overcomplete model, it can be seen that some of the features, in particular at higher frequencies, are less localized than in the models with ICA initialization. The reason is evident when considering the pooling patterns: there are many higher-order units pooling over a large number of linear filters, so selectivity in these features is reduced, and more global pooling patterns emerge. It is also worth pointing out that some of the basis functions in the overcomplete case contain multiple Gabor functions, indicating convergence to a local minimum. This problem is not due to the random initialization, as can be seen from the complete model, which in fact has a and is thus not significantly different in quality from the diagonally initialized model we considered earlier. Rather, we suggest that the orthogonality of **W** is an important requirement to avoid convergence to local minima.

### 5.4. Estimation Without Nonnegativity Constraints.

In addition to the experiments with a nonnegative second layer, we also estimated the model with two different nonlinearities that did not require a nonnegativity constraint, allowing us to model negative energy correlations in addition to positive ones. First, we analyzed the case of a symmetric nonlinearity *f*(*u*)=log cosh(*u*), which leads to an output distribution that is sparse for both negative and positive outputs. Additionally we report on the results with an asymmetric nonlinearity, where the negative half of the nonlinearity corresponds to a low-variance gaussian distribution, and the positive half follows a logistic distribution. In both cases, an *L*_{2}-norm constraint was imposed on rows of **V**, which was initialized randomly with gaussian white noise. Moreover, the first-layer nonlinearity was set as *g*(*u*)=log cosh(*u*) instead of the squaring, in order to be able to better model a heavy-tailed distribution in conjunction with the second nonlinearity.

In the symmetric model (results not shown), we found sparse connectivity of the second layer, but with higher-order features very different from the complex cell-like responses of the nonnegative model. For some of the outputs, the first layer forms pairs of features that both contain two Gabors in their receptive field—one that is identical in both features and one that is identical but of opposite sign, as in Lindgren and Hyvärinen (2006). Individual higher-order outputs pool one such pair of features with one strong negative and one positive weight. Using the identity *a*^{2}−*b*^{2}=(*a*−*b*)(*a*+*b*), this can be interpreted as taking the product of the sum and difference of the two linear filters, which turns out to be the product of two individual Gabor filters. Thus the model is effectively taking products of linear filter outputs, with results very similar to those observed previously in quadratic ICA (Lindgren & Hyvärinen, 2006). In the model with sparse positive and gaussian negative outputs (results not shown), we report higher-order features with one or a small number of highly active positive inputs and a larger number of small negative inputs. This could possibly be interpreted as surround inhibition or a gain control phenomenon.

Thus we see that the results depend strongly on the choice of the nonlinearity *f*, and very different results can be obtained by changing the nonlinearity. It seems difficult to make a principled choice of *f* because the usual measures of sparseness do not easily generalize to two-layer models. In future research, *f* could possibly be estimated from the data. In the current work, we choose to analyze only the results from the nonnegative model, because it seems to be most in line with the visual processing in mammalian visual cortex, that is, complex cell responses. Furthermore, we view our model as a direct extension to models with nonnegative second layers such as ISA and topographic ICA. These have previously been motivated by the observation of strong positive energy correlations between linear filter outputs, so nonnegativity seems to be a reasonable constraint.

## 6. Experiments on Audio Data

In order to demonstrate the general applicability of our model to a variety of data sets, we also tested it on speech data from the TIMIT database. The model was estimated in exactly the same way as for image data, and the data were preprocessed as follows. We took 100 short utterances from the database and resampled them to 8 kHz. The data were high-pass-filtered with a cutoff at 100 Hz and then normalized to unit variance. We sampled 10,000 random sound windows of 16 ms length, which corresponds to 128-dimensional data and removed the DC component. We also applied our standard preprocessing consisting of whitening and contrast gain control, as described above for image data. Simultaneously we reduced the dimensionality from 128 to 120, which amounts to low-pass filtering and serves to eliminate artifacts due to the windowing procedure. The rows of the second layer were constrained to unit *L*_{1}-norm.

The results we obtained from the speech data are remarkably similar to those from image data and are presented in Figure 6. In Figure 6a, we show the first-layer features in the time domain, which are tuned to specific frequency bands as well as onset time and duration. The second layer in Figure 6b shows that a sparse connectivity has been learned between groups of first-layer features. This is analyzed further in Figure 6c, where we show which first-layer features are pooled in the second layer. We obtain higher-order features where similar frequencies with slightly different temporal onset are pooled. Interestingly, the pooling size is considerably smaller than for image data; some of the outputs have as few as three contributing first-order features. This indicates that individual linear filters outputs are closer to independent for audio data than for images and that residual dependencies after the first layer are smaller.

## 7. Discussion

Hierarchical models have a long history in computational neuroscience and machine learning, and they are a promising approach to modeling the complicated structure of natural signals. However, it has not been feasible until recently to estimate multiple layers of these models from the data. The estimation can be difficult in different ways, such as the need to integrate over latent variables in generative models or the computation of the intractable partition function in energy-based models. In this work, we have used score matching for estimating an energy-based model while avoiding computation of the partition function, and we have shown how both layers of a two-layer model can be estimated from the data.

Two recent models have a similar hierarchical structure but are estimated in a different way from the model considered here. In particular the hierarchical product of experts (PoE; Osindero et al., 2006) is closely related to our work. Instead of using the independent component point of view, the model is defined as an overcomplete PoE model with experts following Student-*t* distributions. The estimation is performed using contrastive divergence, which was recently shown (Hyvärinen, 2007a) to be related to score matching. The results obtained are similar to those reported here. Estimating the model on natural images also leads to pooling of linear filters with similar position and orientation but a different spatial phase. However, the authors do not show the pooling patterns, so it is not clear whether they also observed sparse connectivity in the second layer. A surprising difference to our model is that the authors observe it makes little difference whether the second layer is estimated on top of a fixed, predetermined first layer. However, model comparison is not straightforward with CD, which does not provide an objective function, so it is not clear if there is no further change in the linear filters in the PoE model, or if the authors simply did not investigate this further. In any case, the change in first-layer units, which leads a significantly more complex cell-like behavior of output units, shows the importance of estimating the layers simultaneously in our model. The first-layer features adapt to the structure of the second layer by changing the spatial phase tuning, which results in a better model fit as gauged by the score matching objective function, as well as increasing the phase invariance of the output units.

The hierarchical Bayesian model by Karklin and Lewicki (2005, 2006) is also related to our work in that the authors estimate two layers of a hierarchical model of natural images, but it is different in that the model is generative rather than energy based. It is related to the generative topographic ICA model (Hyvärinen et al., 2001), where sources are not generated independently but have dependencies introduced by multiplication with hidden variance variables, which are themselves given by a linear mixing of higher-order sources. An exact estimation of such a generative model is not tractable because it requires integration over the possible states of the higher-order variance variables. The authors thus resort to using a maximum a posteriori (MAP) approximation for the higher-order sources. Similar to our results, the authors obtain Gabor-like filters in the first layer, but the second layer, which describes patterns of variance dependencies between linear filters, is quite different from the results of our energy-based models. The authors report broadly tuned features, encoding global properties of the data, and do not obtain simple pooling patterns with, for example, common orientations and spatial frequencies. In agreement with our results, the linear filters in the first layer are reported to change depending on the pooling patterns in the second layer.

While the sparseness of the connectivity in the second layer is an important factor in obtaining complex cell-like responses, it is important to note that sparseness was not explicitly enforced in our model. Experiments with *L*_{2} normalization and random initialization of **V** still result in sparse connectivity, even though the output units lose many of their complex cell properties. The nonnegativity and *L*_{1}-norm penalty appear to be important factors in obtaining a uniform population of phase-invariant higher-order cells, which pool over a small number of first-order units. The *L*_{1}-norm can be related to energy-efficient coding and wiring length constraints, which has been proposed to play a role in shaping the receptive fields in retinal cells (Vincent, Baddeley, Troscianko, & Gilchrist, 2005) and may be of similar importance for early visual cortex. With the *L*_{2}-norm we observed a fragmentation into two populations of output cells with different pooling patterns. In addition to one very sparse population, a second population pools over a larger number of inputs and loses much selectivity.

## 8. Conclusion

We have presented an energy-based model of natural images and sounds with two layers of weights estimated from the data. The two-layer model was estimated using score matching, which allows estimation without knowledge of the partition function. On natural images, the estimation of both layers with an *L*_{1}-norm penalty on the second layer leads to the emergence of complex cell properties for the higher-order units. We analyze how the model parameters differ if the second layer is estimated on top of a fixed ICA basis and report that the phase tuning of the linear filters changes when both layers are estimated simultaneously, which results in an increase in phase invariance of the outputs. We performed experiments with a randomly initialized second layer, *L*_{2} normalization and without nonnegativity constraints, which all lead to sparse pooling but less complex cell-like outputs.

## Appendix: Derivatives of the Objective Function.

**W**and

**V**. Since the expression can readily separated into a sum of three terms, which we call

*A*,

*B*, and

*C*, we treat these separately. We get six terms , , and so on. Writing these out, we get for the first term For the second term, we get and for the third term, the derivative is For better readability, we substitute in all subsequent equations:

## Acknowledgments

We were supported by the Academy of Finland, project 111786, and the Centre-of-Excellence in Algorithmic Data Analysis (Algodan). U.K. was supported by a scholarship from the Alfried Krupp von Bohlen und Halbach-Stiftung.