We study a dynamical model of processing and learning in the visual cortex, which reflects the anatomy of V1 cortical columns and properties of their neuronal receptive fields. Based on recent results on the fine-scale structure of columns in V1, we model the activity dynamics in subpopulations of excitatory neurons and their interaction with systems of inhibitory neurons. We find that a dynamical model based on these aspects of columnar anatomy can give rise to specific types of computations that result in self-organization of afferents to the column. For a given type of input, self-organization reliably extracts the basic input components represented by neuronal receptive fields. Self-organization is very noise tolerant and can robustly be applied to different types of input. To quantitatively analyze the system's component extraction capabilities, we use two standard benchmarks: the bars test and natural images. In the bars test, the system shows the highest noise robustness reported so far. If natural image patches are used as input, self-organization results in Gabor-like receptive fields. In quantitative comparison with in vivo measurements, we find that the obtained receptive fields capture statistical properties of V1 simple cells that algorithms such as independent component analysis or sparse coding do not reproduce.
Self-organizing systems are commonly used to study learning in cortical circuitries. Classical examples are self-organization of afferent fibers in primary visual cortex (von der Malsburg, 1973), systems that learn input categories (Carpenter & Grossberg, 2003), and systems that are learning the neighborhood relationship of the input data (Kohonen, 1995). In this letter we show that the anatomical fine structure of cortical columns can give rise to a self-organization process of neuronal receptive fields (RFs) that extracts components of the presented input. Our model of the cortical column is based on populations of excitatory neurons as found by Yoshimura, Dantzker, and Callaway (2005) and builds on advances in the analysis of columnar cortical microcircuits (see, e.g., Douglas & Martin, 2004, for a review). We focus on the feedforward influence of afferents to the column along with long-term changes of neuronal receptive fields. Columnar processing stages important for this feedforward pathway are abstractly modeled. Feedback from higher cortical areas and longer-range lateral interactions are not considered in this letter but can be incorporated in future extensions of the model.
In our model, neural activity is abstractly modeled using stochastic differential equations that describe activities in populations of neurons and their coupling to feedforward and lateral inhibition. In this way, we show that cortical connectivity can give rise to columnar processing that, if coupled to synaptic plasticity, results in a particular type of self-organization of afferent fibers. The distinguishing feature of this self-organization is its robustness to noise and different types of input. We quantitatively evaluate the system's functional capabilities on the basis of the bars benchmark test (Földiák, 1990) and by using natural images patches. This allows a comparison with a number of bottom-up and functional modeling approaches that can usually be applied to one of the two benchmarks.
2. System Dynamics
In experiments using focal uncaging of glutamate combined with intracellular recordings, Yoshimura et al. (2005) found evidence for a particular fine-scale organization within V1 cortical columns. The experiments showed a fine structure of functionally relatively disjoint networks of pyramidal cells in cortical layers 2/3. That is, neurons were found to form clusters of different populations, each defined by a strong functional coupling among its members. Two neurons of the same population were found to be connected with high probability, whereas connectivity between two neurons of different populations was very low. The functional disjointness of these populations was, furthermore, shown to be reflected by population-specific inputs from layer 4: any two populations in layer 2/3 were found to receive input from different populations of excitatory cells in layer 4. The findings suggest a functional role of these excitatory subpopulations. Together with their coupling to inhibitory systems in the column, they are likely to represent an essential part of columnar processing in the cortex.
In this section, we model the dynamic implications of such subpopulations and address the following questions:
What are the potential neurodynamic consequences of disjoint populations of pyramidal cells in layer 2/3 (see section 2.1)?
What are the possible computational consequences of different types of inhibitory coupling among these populations (see section 2.2)?
How can we model afferent input that affects layer 2/3 populations via layer 4 (see section 2.3)?
How can the resulting computational properties be used for learning (see section 2.4)?
Later, we discuss simulations of model columns with up to 100 subpopulations. If we assume 10,000 to 20,000 neurons per cortical column in this case, a subpopulation would contain 100 to 200 neurons. Note that the same number of neurons is estimated to constitute cortical minicolumns (see, e.g., Mountcastle, 1978), which might be related to the functionally disjoint populations in layer 2/3 (see section 4 for a discussion). In somatosensory cortex, it was estimated that roughly 80 minicolumns make up larger-scale functional units the size of cortical columns (see, e.g., Favorov & Diamond, 1990, or Mountcastle, 1997, for a review). In functional applications to visual object recognition (e.g., Lades et al., 1993), similar numbers of subunits were found to be necessary to appropriately represent local image textures. One hundred subpopulations might therefore be a reasonable choice. Note, however, that the data in Yoshimura et al. (2005) do not allow estimating the total number of subpopulations within a column.
2.1. Dynamic Properties of Excitatory Populations.
One possibility for studying the dynamic properties of populations of excitatory neurons is to directly model them as sets of interconnected model neurons. Such an approach was pursued by Lücke and von der Malsburg (2004), who suggested subnets of excitatory neurons as basic functional units for columnar processing. In this letter, we follow a more abstract approach by asking: What are the potential qualitative consequences for the dynamic behavior of such populations?
To answer this question, let us first consider a single population of excitatory neurons. As not much is known about a structured connectivity within the subpopulations of layer 2/3, we assume random connectivity (compare Braitenberg & Schüz, 1998; Maass, Natschläger, & Markram, 2002; Kalisman, Silberberg, & Markram, 2005; Amirikian, 2005; Muresan & Savin, 2007). In the absence of any external input, we expect such a randomly interconnected population to show the following qualitative behavior:
Because of the excitatory recurrent connections, we expect that from a critical number of active neurons on, the activity within the population increases in a chain reaction. That is, if a number of active neurons in one time interval are activating a larger number of neurons in the next interval, an autocatalytic process is evoked that results in a rapid increase of neural activity.
We expect a limit of maximal activity within the population.
If only a very few neurons within the population are active, they will not be able to excite a larger number of other neurons in the population. In this case, we expect the overall activity of the population to decrease to a resting state of very low or zero activity.
To model these expectations, let us denote the activity within a population of excitatory neurons by p(t). p(t) can be thought of as a percentage of neurons that have spiked during a time interval [t − Δt, t] (for fixed Δt). In a single and isolated population, the change of this activity, , is a function of the activity itself, f(p). A function f that models our qualitative expectations about the dynamic behavior (see items 1 to 3) is depicted in Figure 1. Starting from a critical level, the activity p is increased (positive f(p)) until a maximal activity level is reached. An activity level smaller than the critical value decays to zero. If we want to take spontaneous activity into account, we can use a version of the function f in Figure 1 that is slightly shifted to the right. The dynamic abstraction of a single population can be regarded as a mean-field approach (compare, e.g., Wilson & Cowan, 1973; Amit, 1989; van Vreeswijk & Sompolinsky, 1998; Latham, Richmond, Nelson, & Nirenberg, 2000) for which activation functions f with at least one inflection point are a hallmark. Note, however, that in contrast to these approaches, we will model a system of coupled excitatory populations.
For the functional capabilities of the neural dynamics, the interplay between the self-excitatory properties of the disjoint populations and their coupling via systems of inhibitory interneurons is of crucial importance. We will not model inhibitory populations explicitly but will implicitly model their effect on the activities of the excitatory populations within a column. Inhibitory neurons behave similarly to excitatory neurons but induce inhibitory postsynaptic potentials in the neurons they target. They can thus prevent a postsynaptic neuron from firing. Depending on the location of their synapses on the postsynaptic neurons, their influence can be approximately linear or modulatory. Cortical inhibitory neurons project mainly to targets in their immediate vicinity. The cortex comprises more excitatory neurons than inhibitory ones, but inhibitory neurons seem morphologically and physiologically more diverse (see, e.g., DeFelipe, 1993, for a review; and Douglas & Martin, 2004, for critical discussions). This suggests a greater diversity of functional roles for inhibitory neurons than for their excitatory counterparts. The activity of an excitatory subpopulation in our model can essentially be influenced by three types of inhibition: (1) self-inhibition within the population, (2) lateral inhibition from other populations in layer 2/3, and (3) inhibition originating from external sources.
Self-inhibition can be captured by the form of f and is the reason we expect the activity in a population to be limited. Ultimately the refractory periods of neurons within a population lead to self-inhibition, but inhibitory interneurons, which are restricted to the individual populations, can significantly contribute, for example, in keeping the population in a state of intermediate levels of ongoing activity.
We consider lateral inhibition to be driven by the activity of all excitatory populations in layer 2/3 of the column. We describe the effect of all populations, , on the activity of a single population, pα, by an inhibition function . If none of the neurons in a given population is active, such an inhibition is without effect. However, the more neurons are active, the more efficient the inhibition becomes. We therefore model the effect of lateral inhibition on the activity change of pα to be proportional to .
Inhibition from external sources will be considered in section 2.3 as part of the input to the populations.
2.2. Forms of Lateral Inhibition.
The nonlinear differential equation system 2.1 describes the evolution of activities p1, …, pk of the excitatory populations in layer 2/3 of a cortical column with k populations. The activities and potential attractor states (compare Amit, 1989) will depend on external inputs to the populations and, crucially, on the type of lateral inhibitory coupling. The neuron populations pα will play the role of hidden units that respond to different inputs with different columnar activity distributions.1 These activities determine the computational capabilities of the system. For example, Douglas and Martin (2004) associate a soft-winner-take-all (soft-WTA) computation with cortical layer 2/3. Other prominent computations assigned to hidden layers are K-WTA mechanism (e.g., O'Reilly, 2001) or hard WTA mechanisms as promoted, for example, by Riesenhuber and Poggio (1999). Implementations of learning algorithms such as ICA (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998) and sparse coding (Olshausen & Field, 1996; Olshausen, 2002) require specific types of anticorrelation among hidden unit activities. Many more types of computations associated with hidden layers can be found in the literature and include more complex computations as used in adaptive resonance theory (ART) of Carpenter and Grossberg (2003) or in liquid (Maass et al., 2002) and echo state machines (Jaeger & Haas, 2004).
In our model, the interplay between the self-excitatory populations and their inhibitory coupling determines the type of computation performed. We will study the net effect of different functions on the mean-field-like activities pα and neglect effects of more complex dynamic interactions.
2.2.1. Static Linear Inhibition.
If we assume the inhibition function to be proportional to the mean activity, , we can analytically compute the stationary points of equation 2.1 and the stabilities of these points. Figure 2A shows the stable and unstable points of the system with k = 2 and their dependence on the proportionality factor ν (see Wong & Wang, 2006, and Roxin & Ledberg, 2008, for similar dynamics applied to psychophysical two-choice experiments). The parameter ν plays the role of a bifurcation parameter. For small ν (i.e., for low levels of inhibition), either both subpopulations are active or just one of the two. For ν larger than a critical value, there are just two stable stationary points, with each point corresponding to one subpopulation being active and suppressing the other. For k>2, it can be shown that multiple critical points exist and that the value of ν determines how many populations can be kept active simultaneously (compare Lücke, von der Malsburg, & Würtz, 2002, for a similar single-neuron-based system). If we now consider a small coupling (small κ>0) of the system to inputs , the individual input strengths can determine which of the attractor states is stabilized. If populations are initially equally active, the dynamics is pushed to point attractors, leaving populations active that receive more input than the others. If ν is set such that, for example, at most four of k populations can be active simultaneously, the four populations with strongest input survive whereas the others are deactivated. Dynamics 2.1 with linear inhibition function is thus well suited to implement a K-WTA mechanism as required by numerous models in the literature (e.g., O'Reilly, 2001). However, to appropriately approximate such a computation, the populations' activities have to be relatively equal at the onset of input.
2.2.2. Dynamic Linear Inhibition.
An interesting type of computation emerges if instead of a fixed ν, we use a ν that increases in time. If, for example, for k = 2 both populations are active initially (attractor state with p1 = p2), an increase of ν results in a spontaneous symmetry breaking as soon as ν crosses the critical value (see Figure 2A). If we now consider input to the populations as perturbation, the same process results in a symmetry breaking induced by the difference between the inputs. The population that receives more input remains active and suppresses activity in the other. The input differences and the inputs itself can be very small in this case. As for static inhibition, the dynamics is well suited to implement a K-WTA mechanism. However, a dynamic change of ν has the advantage of resulting in a more sensitive and better way to implement such mechanisms. For a static ν, much of the finally stabilized state depends on the initial population activities. These activities would have to be equal for all populations to allow a precise K-WTA, but note that this activity state is unstable for the high ν values required for K-WTA selectivity. In contrast, dynamic control can use small values of ν to reset the system to an initially stable state of equal activities and can subsequently, via processes of symmetry breakings, select sensitively and robustly the k populations with strongest input.
2.2.3. Dynamic Maximum Inhibition.
For the dynamics 2.1 with 2.2 as inhibition function, we can analytically compute all stationary points and their stabilities. As shown by Lücke (2005), the dynamics can stabilize any subset of active populations if ν is smaller than a critical value νc. For ν greater than νc, just one population can be kept active.2 The inhibition function, equation 2.2, with ν>νc can thus be used to implement a hard WTA mechanism. As in the linear case, the use of a static ν would require initially equal activities pα in order to implement a WTA mechanism precisely. Unfortunately, such a state is unstable for ν>νc. However, following the same arguments as for the linear case, a dynamic ν that increases from ν < νc to ν>νc results in a robust and precise implementation of a WTA procedure.
Regarding the maximum inhibition, a dynamic ν has a further advantage: if ν is linearly increased, the time that each population remains active is a measure for the input strength it receives relative to the other populations. Even for a very small coupling constant κ, we can in simulations observe that the populations pα are deactivated according to their received input strengths. Populations with small inputs are deactivated the earliest. If we, during this process, integrate the population activities over time, we obtain a graded response to the input. Instead of a hard WTA mechanism, a particular type of soft WTA mechanism is obtained. In simulations, such a soft WTA response will later be shown to result in a robust self-organization process with component extraction capabilities. In the following, we will therefore apply inhibition function 2.2 with a dynamic increase of ν. This particular choice is motivated by its functional implications for processing and learning. As we will see later, we obtain for this type of inhibition (but not for the others), Gabor-like RFs if natural images are used as input. The other discussed choices of inhibition result in hard WTA and K-WTA mechanisms with other interesting functional consequences. For more details, we refer to systems such as those of Riesenhuber and Poggio (1999) and O'Reilly (2001) that rely on these types of computations.
Dynamics 2.4 models layer 2/3 populations with equal or at least approximately equal properties. Numerical simulations of the equation system 2.4 show that for small ν, small positive inputs are sufficient to activate the corresponding populations. Such a behavior could have been expected from the bifurcation diagrams in Figure 2. In Figures 2B and 2C, the area between the unstable points (dotted lines) and the symmetric stationary point (pi = pj for all i, j) is a projection of the phase space, which is attracted by the symmetric point. This indicates that for decreasing ν, an increasingly large volume of the phase space is attracted by the symmetric stationary point. For dynamics 2.4, this area extends to the region around zero activity (see Lücke, 2005, for explicit phase space plots). Thus, if all layer 2/3 populations are inactive (e.g., before stimulus onset), small, positive inputs are sufficient to activate the layer 2/3 populations. Instead of resetting all populations to their active state we can, alternatively, activate just the populations that receive the strongest inputs. Such a partial reset is shown in Figure 3 for a numerical simulation of equation 2.4 at νmin = 0.4. In the figure it can be seen that from a state of equally low activity in all populations, just the three populations with the strongest inputs are activated. Instead of continuous input, short impulses of excitation to some of the populations can also serve for a selective population activation (compare, e.g., Swadlow, 2002). Once some or all populations are activated, their activities quickly reach approximately the same values (see Figure 3). This is due to the self-stabilizing dynamics 2.4 and due to the external input being small compared to the input from within a population. Note that from the physiological point of view, partially resetting the system at stimulus onset represents a more plausible mode of operation as populations with small input remain relatively inactive. In numerical simulations of equation 2.4, together with synaptic plasticity as described later, it has turned out that for RF self-organization it is, except for a short initial phase, sufficient to activate just populations that receive input well above average (compare Figure 3).
2.3. Integration of Afferent Input.
Note that our assumption of linear computations in the first stages of columnar processing is surely an oversimplification of the actual computations. More realistic and advanced models (see, e.g., Miller, Pinto, & Simons, 2001, for an overview) are explaining neural response properties in layer 4 much more accurately. In this work, we restrict modeling to capture the minimal functional requirements for the first stages of processing that are anatomically consistent and still result in a robust self-organization of afferent fibers.
2.4. Self-Organization of Afferent Fibers.
Synaptic plasticity is ubiquitous in the cortex. It is presumably crucial for most aspects of cortical structuring, including the ontogenesis and maintenance of basic cortical microcircuits and the structuring of their interactions within and between cortical areas. In our model of a single cortical column, we do not model these types of plasticity. Instead we study the emergence of neuronal RFs based on local properties (e.g., of natural images). We therefore focus on the plasticity of afferents (e.g., from LGN). Note that systems that make use of top-down input or model longer-range correlations in the input would, with all probability, also require synaptic plasticity of lateral connections and plasticity of connections between columns in different areas (e.g., between V1 and V2).
We start in a state with all afferents initialized to the same value: . After an input is presented, ν is increased and the noise in equation 2.4 breaks the symmetry among the activities pα (compare Lücke, 2005), and hidden units are deactivated. RFs of populations that remain active are modified as soon as P(t) < χ. Because of equally initialized RFs, the deactivation of populations is random and results in some RFs being modified relatively strongly and others being not or just weakly modified for a given input. During the first ν-cycles, random selection dominates and results in small differences between RFs. These initial differences represent the structural seeds for later specialization.
An RF with similarity to a certain input component results in relatively high input to its corresponding layer 2/3 population if such a component is present in a given input. The population in this case is likely to remain active relatively long during a ν-cycle, and its RF is thus likely to further specialize to this component.
For our dynamics, the control of lateral inhibition represents the crucial part. If we learn after inhibition selects a single unit, as in hard WTA networks, a distributed encoding of the presented input is not observed even if it consists of easy-to-identify combinations of basic components. The same applies if we learn using an inhibition that selects out, for example, k units (K-WTA). In contrast, dynamics equation 2.7 in conjunction, with equations 2.3 and 2.4 modifies RFs during the process of deactivation of hidden units. As indicated above, such a competition implements a particular type of soft-WTA computation that will prove advantageous for learning. Note that this type of soft-WTA is different from the classical soft-max (Yuille & Geiger, 2003).
Equations 2.3 to 2.8 represent our model of a cortical column. Before applying the dynamics to input, we have to find a set of parameters that lets the system operate in the interesting nonlinear regime between no competition and a hard WTA competition. Choosing parameters is straightforward, and good results can be obtained for a large range of different values. To determine the particular set of parameters used in this letter, we have tuned them using the bars benchmark test we discuss below. Once the set of parameters is chosen, the system is extraordinarily robust with respect to different types of input. We use the same set of parameters throughout this letter (see the appendix).
3. Simulation Results
A column in V1 encodes a texture in its corresponding region of the visual field distributedly. That is, different subfeatures of a presented texture (e.g., color or spatial orientation) are encoded by different units. In this way, the combinatorial richness of real-world textures can be represented without an excessive use of neurons. Before we study natural images, let us examine learning of distributed representations in the context of a benchmark test for component extraction.
3.1. Nonlinear Component Extraction.
The so-called bars test (Földiák, 1990) is a standard benchmark test to qualitatively and quantitatively study the ability of systems to learn to represent input patterns distributedly. According to the test, an input pattern consists of superpositions of disjoint horizontal and disjoint vertical bars on a square gray-level image. In a generated image, each of the b different bars appears with probability po (usually ). The appearance probabilities are taken to be equal for all bars (see Figure 5A for some examples). The bars test models patterns that are generated by a nonlinear superposition of multiple hidden causes. The test was used in different versions with different numbers of bars and different systems (Földiák, 1990; Dayan & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995; Hinton & Ghahramani, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; Spratling & Johnson, 2002; Charles, Fyfe, MacDonald, & Koetsier, 2002; Lücke & von der Malsburg, 2004; Lücke, 2004; Spratling, 2006; Butko & Triesch, 2007). In Figure 5C, self-organization of afferents for a bars test with b = 16 bars and on a 16 × 16 pixel input image is shown for our system. As can be observed, the system has essentially found all bars after 100 × 103 ν-cycles. Remaining ambiguities dissolve thereafter, and RFs further refine. Note that the actual input used for the simulation in Figure 5C is input of a noisy bars test. For no noise, the system needs for b = 16 bars and k = 20 fewer than 9 × 103 ν-cycles to represent all bars in 50% of 100 simulations. The system's reliability, that is, the probability of successfully finding all bars in a simulation, is above 99% (all bars were found in all 100 simulations) and is hence higher than in most other systems. The system is, furthermore, robust against various changes of bars test parameters such as the number of bars and bar appearance probability po. The bars test is a nonlinear task because the components (i.e., the bars), do not combine linearly in their regions of overlap. The smaller the number of bars, the larger is the overlap, that is, the more pronounced is the nonlinearity. In the literature, b = 8 or b = 10 bars are therefore often used. We have also tested for these numbers of bars and used for both tests an appearance probability of —there are on average two bars per input (the common choice in the literature). For b = 8 bars and k = 10, learning needs fewer than 6 × 103 ν-cycles to extract all bars in 50% of 100 simulations. For b = 10 and k = 12, learning needs fewer than 7 × 103 ν-cycles in 50% of 100 simulations. A larger λχ in equation 2.8 results in faster learning times but decreases the robustness of the system. As was the case for b = 16 bars, we find all bars in all the simulations for b = 8 and b = 10.
A very common way to probe the robustness of a system is the use of noisy input images (Dayan & Zemel, 1995; Hinton & Ghahramani, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; Spratling & Johnson, 2002; Charles et al., 2002; Lücke & von der Malsburg, 2004). This also addresses the fact that natural input can be expected to be relatively noisy. For our system, we apply the same bars test setup as used by Spratling and Johnson (2002). In 16 × 16 gray-level images, b = 16 bars appear with probability each, and pixels of a bar are set to gray value 1 (white) with background 0 (black). As by Spratling and Johnson (2002), k = 20 units are used for the test. In this case, our system finds all bars in all 100 simulations for noise levels with variance up to (σn)2 = 3.0 (see Table 1). The system discussed by Spratling and Johnson (2002) needs fewer learning examples but achieves, with especially tuned parameters, just 60% reliability for (σn)2 = 0.4 after 4 × 103 cycles. The system that has reported the highest noise robustness so far is described by Charles and Fyfe (1998). It achieves 70% reliability for (σn)2 = 1.0 (see Table 1) and 20% reliability if bit flip noise with % flipped pixels is used instead of gaussian noise. The column dynamics still finds all bars in all 100 simulations for up to 38% flipped pixels per input image. Being significantly more robust than the model of Charles and Fyfe (1998), the system presented here is thus the most noise-robust system in the bars benchmark test that has been suggested so far.
|.||Bit Flip Noise .||Gaussian Noise .|
|Model .||Noise Level .||Reliability .||Noise Level .||Reliability .|
|Spratling and Johnson (2002)||-||-||0.4||60%|
|Charles and Fyfe (1998)||33.3%||20%||1.0||70%|
|Cortical column dynamics||≤38.0%||>99%||≤3.0||>99%|
|.||Bit Flip Noise .||Gaussian Noise .|
|Model .||Noise Level .||Reliability .||Noise Level .||Reliability .|
|Spratling and Johnson (2002)||-||-||0.4||60%|
|Charles and Fyfe (1998)||33.3%||20%||1.0||70%|
|Cortical column dynamics||≤38.0%||>99%||≤3.0||>99%|
Note: Systems that report the highest noise tolerance so far, are compared on the basis of bit flip noise (noise level is given in percent of flipped pixels) and gaussian noise (noise level is given as variance).
3.2. Natural Images.
To model input to our dynamics that resembles input received by cortical columns in V1, we use gray-level image patches as input. We expect the model column to self-organize its RFs such that the activities of hidden units can represent the input distributedly. Input is taken from 20 randomly selected images of the van Hateren database (van Hateren & van der Schaaf, 1998) that do not contain man-made structures. We use difference of gaussians (DoG) transformed versions of these images to emulate the preprocessing of visual input by the retina and LGN. A standard deviation of σ+ = 1.0 pixel for the positive part and σ− = 3.0 pixels for the negative part is used, which is consistent with the biologically measured ratio of (compare Somers, Nelson, & Sur, 1995). From the 20 images, we randomly selected patches of N = 20 × 20 pixels whose values were scaled linearly to lie in the interval [0, 1]. For each ν-cycle, a different patch was selected. In Figure 6A, the RFs of a system with k = 100 hidden units are shown for DoG preprocessed input after being trained for 2 × 106 ν-cycles.3 Such a large number of patches taken from 20 images without man-made structures can be expected to be representative of natural images. The RFs in Figure 6A have the familiar Gabor-like shape (see also Figure 6C) and represent filters acting on the already DoG transformed images. For comparison with physiological data as obtained in simple-cell recordings, we first have to compute the corresponding filters that act on the raw images. For DoG preprocessed input, this amounts to a convolution of the RFs in Figure 6A using the same DoG kernel as for the preprocessing of the image (see the appendix). In Figure 6D, the resulting filters are shown for three examples. The filters are, in theory, infinitely large but are virtually zero for all pixels outside a central region of 40 × 40 pixels. For further analysis and as is customary in the literature, we match the real filters using Gabor wavelet functions (see, e.g., Jones & Palmer, 1987; Ringach, 2002). Matching works well in most cases, but the artificial rectangular shape of the patches can result in notable artifacts. The effect of these artifacts on the later analysis can be well understood, however, and they will be discussed below using RFs 4 and 31 in Figure 6D.
An analysis of the parameters of the matched Gabors shows a relatively even distribution of RF positions (data not shown). Plotting orientation versus frequency (see Figure 7A) shows a distribution similar to the ones obtained by using independent component analysis (ICA; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998), and sparse coding (Olshausen & Field, 1996; Olshausen, 2002). That is, the orientation preferences are relatively evenly distributed apart from a small preference of vertical and horizontal orientations. As for sparse coding and ICA, and unlike RFs measured in vivo, the RFs obtained in our model are clustered around a preferred frequency, which is given by (see Figure 7B). In our case, the frequency distribution is largely explained by the bandpass properties of the DoG preprocessing, which prefers frequencies in the range of the obtained filters.
The most notable feature of the RFs in Figure 6A is the variation in the widths and lengths of their gaussian envelopes. An analysis of properties of the envelopes relative to the wavelets' frequencies was suggested by Ringach (2002) as a means for comparing the RFs of computational models with in vivo measurements. For quantitative comparison, the RF parameters are plotted in the nx/ny-plane with nx = σxf and ny = σyf, where f is the frequency of the matched wavelet and σx and σy are the standard deviations of the gaussian envelope (compare Figure 6B). Figure 8A shows the distribution obtained by our model together with in vivo measurements of simple-cell RFs in macaque primary visual cortex (Ringach, 2002). Both distributions have similar variance and show a similar correlation between nx and ny, with a preference for RFs to be elongated in the ny-direction if they are distant from the origin. Note that the RF distribution in cat primary visual cortex also shows this property (Jones & Palmer, 1987; Ringach, 2002). The data show that our model RFs as well as RFs measured in vivo possess a continuous spectrum of gaussian envelopes. It stretches from gaussian envelopes that are circular or slightly stretched in nx-direction to RFs that are elongated in ny-direction. Two notable differences between the measured distribution and the RFs of the model are the absence of model RFs with values close to the origin and with values far away from it.
RFs with values near the origin are associated with what was called globular RFs by Ringach (2002), i.e., RFs with no or weak orientation preference. Although RFs with weak orientation preference are actually developed by our system (see, e.g., RF 31 in Figure 6D), the plot in Figure 8 does not show any points near the origin. The reason is that we use Gabor wavelets to match localized RFs whose positive and negative parts sum to zero. Even in the case of perfectly radial symmetric RFs, Gabor matching would break the symmetry to a preferred orientation (compare Figure 6D, third row). Because of this artifact introduced by Gabor matching, a globular RF with positive and negative parts cannot lie near the origin in a plot like Figure 8. The five RFs that presumably were most affected by this artifact have been marked in Figure 8A. They are the five most globular RFs, that is, the five RFs best fitted by a circular-symmetric Mexican-hat function. In contrast to the RFs emerging in our model, RFs measured in vivo have values near the origin in Figure 8 because they are not as severely affected by the artifacts of Gabor matching. Many of the simple-cell RFs measured by Ringach (2002) are essentially matched by a gaussian, which represents a degenerated wavelet with infinite wavelength.
The absence of RFs distant from the origin can be explained by the chosen data representation, which restricts RFs to a rectangular patch size. As can be seen in Figure 6D (second row), the Gabor match results in a gaussian envelope artificially restricted in ny-direction. Because of the range of preferred frequencies (see Figure 7), this prevents values of ny from being larger than ny ≈ 0.8. The cluster of data points near (0.5, 0.7) in Figure 8A can therefore be interpreted as a consequence of rectangular patch sizes (compare RF 4). Less artificial or larger patches would move some of these points (e.g., RF 4) farther away from the origin and presumably closer to the measured data. Another consequence of the finite rectangular patch size is that stretched-out RFs like RF 4 have a higher tendency to occupy the central regions of the patch. This might push smaller RFs closer to the patch's edges.
Comparison of the distribution obtained by our model and distributions predicted by sparse coding (Olshausen & Field, 1996; Olshausen, 2002) and ICA (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998) are shown in Figure 8B. Sparse coding seems to partly match the measured ny/nx-correlation but shows an incorrect preference for RFs elongated in nx-direction near (0.5, 0.5). The distribution of RFs obtained using ICA is concentrated on a small region on the bisecting line near (0.7, 0.7). ICA predicts neither the variety of gaussian envelopes nor any preference for RF elongation in any direction.
In contrast to sparse coding and ICA, the distribution obtained by our model is in good agreement with in vivo measurements (see Figure 8). Note, however, that computationally and experimentally obtained distributions can depend on various choices made in experiment and analysis. For our model, we have discussed the effects of rectangular patches that restrict RFs and the effect of Gabor matching, which can artificially remove globular RFs. Likewise, different preprocessing and analysis techniques can affect the RF properties of other computational systems (see Ringach, 2002, for a discussion). Nevertheless, the specific variability of gaussian envelopes seems to be a distinguishing feature of RFs obtained in our model as opposed to RFs obtained by different versions of sparse coding and ICA. This variability of RFs takes the form of a broad distribution in ny/nx-space with RFs elongated in ny-direction if they are distant from the origin (see Figure 8). The same holds true for in vivo measurements. For our RFs, this property is already recognizable by considering the raw RF data as displayed in Figure 6A.
3.3. Other Types of Input.
Other than to the inputs discussed, the system can be applied to various other types of input. To demonstrate the robustness of the self-organization process, we show, without further analysis, the RFs obtained after training on two further types. Again we use the same set of parameters as for the previous experiments (see the appendix). Figure 9A shows RFs if we use raw (i.e., unfiltered), images as input to the column. In this case, we also obtain RFs sensitive to different orientations and spatial frequencies. The RFs tend to be less localized, however. In Figure 9B, we use handwritten digits as input and show one randomly selected image of a digit per ν-cycle. In this case, the system encodes the input on the basis of more global features rather than using local components as for the other types of input.
In this letter, we have analyzed possible functional implications of recent results on the cortical interconnectivity in primary visual cortex. We found that the anatomical fine-scale structure of V1 cortical columns can result in a robust self-organization of afferents and in the emergence of Gabor-like RFs that show a higher similarity to physiological in vivo measurements than classical computational models.
4.1. Columnar Processing.
A main assumption of our column model is the existence of functionally disjoint subpopulations of excitatory neurons. Neurons within a population are taken to be connected to other neurons of their population but not to neurons of the other populations. Instead of modeling the populations explicitly (compare Lücke & von der Malsburg, 2004), we have used an abstract modeling approach to capture the essential dynamic properties that we can expect from such self-excitatory units. Our earlier modeling work (Lücke et al., 2002; Lücke & von der Malsburg, 2004) was based on single spiking neuron models and has been motivated by anatomical structures termed pyramidal cell modules (Peters & Yilmaz, 1993; Peters & Sethares, 1996) or minicolumns (see, e.g., DeFelipe, Hendry, Hashikawa, Molinari, & Jones, 1990; Mountcastle, 1997; Buxhoeveden and Casanova, 2002). In combined neuroanatomical and neurophysiological measurements, Yoshimura et al. (2005) found the most direct evidence for self-excitatory populations so far, although the relation of these populations to the cortical minicolumn has yet to be clarified. The main potential difference is that the concept of a cortical minicolumn requires neurons in a population to be spatially adjacent, whereas for neurons in the functional networks described by Yoshimura et al. (2005), this is not necessarily the case. Regardless of their exact spatial arrangement, populations of neurons as functional units have a number of advantages over single cells: they are robust to lesions such as cell death, they can easily establish a large number of connections with other populations, and they can implement a very fast population rate code. All of these aspects can better be discussed on the basis of a single-neuron-based model. However, for our abstraction, the velocity of population responses is important to determine the potential length of a ν-cycle. Simulations on the basis of populations of 100 relatively simple spiking neurons (Lücke & von der Malsburg, 2004) show that population activation and deactivation is possible in a few tens of milliseconds or faster. Other studies show comparably fast responses. For example, transitions of randomly connected populations (1000 neurons) from low- to high-activity states have been measured for different types of more detailed neuron models (Muresan & Savin, 2007). Transitions were found to occur within about 10 to 20 ms, which suggests that population deactivations with increased inhibition can occur on the same timescale. For populations of realistic neurons, this indicates that the period length of a ν-cycle can correspond to time intervals well below 100 ms.
The layer 2/3 populations of pyramidal neurons were reported (Yoshimura et al., 2005) to receive most of their excitatory input from within their own population and only a relatively small amount from layer 4 (see also Thomson, West, Wang, & Bannister, 2002). In our model, this observation is reflected by a small coupling (small κ in equation 2.4) of the layer 2/3 dynamics to layer 4 activities. The analyzed properties of the dynamics show that a weak coupling does not contradict high sensitivity to external stimuli if lateral inhibition is modulated around a critical value. Such an inhibitory modulation allows the system to reset itself to the attractor states required for a high sensitivity to external input. Instead of resetting all layer 2/3 populations for each new input, the system can just activate populations with significant input (see Figure 3). Such a partial reset does not influence the self-organization of RFs, except for a short initial period, and represents a neurophysiologically more realistic dynamic regime. Recruiting the relevant groups of neurons and selecting the strongest ones of them thereafter is reminiscent of Braitenberg's pump-of-thought (Braitenberg, 1978). He conjectured a phase of cortical processing in which Hebbian cell assemblies (Hebb, 1949) can easily be activated and a second phase in which the strongest assemblies are selected. Note, however, that Hebbian cell assemblies are thought of as being distributed across the cortex and would therefore correspond to a network consisting of a multitude of columns (compare, e.g., Palm, 1982; Fransen & Lansner, 1998)
In our system of columnar processing, the population activities in layer 4 are modeled as directly resulting from a linear integration of afferent input and feedforward inhibition (see equations 2.5 and 2.6). No explicit dynamics is assumed for layer 4. This approximation is certainly an oversimplification of the computational properties of the input layer. A number of more detailed models (see, e.g., Somers et al., 1995, or Miller et al., 2001, for an overview) describe the richness of layer 4 responses better than our model. However, we find that the assumption of linear integration with two types of feedforward inhibition is sufficient for extracting local image components. No simple-cell-like RFs are obtained if the integration stage is omitted, that is, if afferents are taken to be directly coupled to the layer 2/3 populations without feedforward inhibition.
Note that our model makes a clear prediction about columnar processing. It predicts that the layer 2/3 populations of pyramidal neurons carry crucial stimulus information. If we identified a functional population, as in Yoshimura et al. (2005), for example, and measured the RFs of at least some of the neurons, their RF average is predicted to be much less noisy than the individual RFs. This average is, in contrast to an average over an arbitrary set of layer 2/3 neurons, predicted to maintain the orientation and spatial frequency tuning of the individual neurons. Similarly, if we identified a disjoint subpopulation of simple cells in layer 4 that is associated with a layer 2/3 subpopulation, we could also compute an average RF of this population. As for a layer 2/3 population, the average RF of such a population is predicted to maintain the orientation and spatial frequency tuning of the individual neurons. Furthermore, our model suggests that the phases of all simple cells belonging to the same population vary much less than the phases across a population of arbitrary neighboring simple cells (compare DeAngelis, Ghose, Ohzawa, & Freeman, 1999). Finally, our model predicts that the spatial phase distinguishes populations in layer 2/3 as well as in layer 4. For layer 2/3, this behavior is likely to change, however, if a larger-scale model, which includes lateral excitatory connections targeting layer 2/3 cells, was considered (see the discussion below).
For the neural activity, the model predicts that input to neurons of an activated population is very noisy and that most of this noise originates from neurons within the population. Nevertheless, very few spikes from external neurons are predicted to be able to result in large differences in subsequent population rates. In particular, the downward slope of population activities relative to an increase in inhibition is predicted to carry important information for learning. These predictions are within the range of current experimental methods. Population RFs can, for instance, be computed by combining the methods of Yoshimura et al. (2005) with subsequent stimulus-evoked response recordings.
With respect to laminar stages of columnar processing, our model assumes a linear stage of stimulus integration in layer 4 and a nonlinear stage of stimulus evaluation in layer 2/3. Anatomically, and physiologically, the two stages of processing are consistent with recent measurements. In Martinez et al. (2005), cells with relatively linear push-pull responses are found to be almost exclusively located in layer 4 (with some in layer 6), and layer 2/3 is found to be populated with nonlinearly responding cells. Furthermore, our model supports the view that simple cells represent a separate class of cells (Martinez et al., 2005) in contrast to being one end of a continuous distribution of cells as suggested by Mechler and Ringach (2002).
A candidate for implementing the first stage of feedforward processing in our model is layer 6, the only other layer that receives significant input from LGN. Layer 6 neurons excite inhibitory neurons in layer 4 as reported, for example, by Ahmed, Anderson, Martin, and Nelson (1997) and as modeled by Raizada and Grossberg (2003). A direct excitation of inhibitory neurons in layer 4 by afferents to the column is, however, equally plausible (compare, e.g., Swadlow, 2002, for evidence on such cells in rat barrel cortex).
4.1.1. Columnar Processing and Learning.
The synaptic modification rule used in our model describes learning on the population level. That is, pα, yj, and Y in equation 2.7 all describe mean-field-like activities of populations of neurons and Rαj or R RFαj model the RFs of such populations. While the activities yj represent the input, activities pα represent the percept, that is, the meaning of the presented input. For instance, a high-activity pα during a ν-cycle signals the presence of a particular bar in an input of the bars test or the presence of an orientation, location, and spatial frequency in a natural image patch. From a functional perspective, the activities of these components representing units are the natural signal to be used for learning.4 If layer 4 cells use the layer 2/3 signal for learning, as would be desirable from the functional perspective, appropriate feedback mechanisms have to exist.
One possible such mechanism, which has been especially popular in the context of training perceptrons, would be backpropagating signals in axons of layer 4 cell that project to layer 2/3. Such signals have been shown to depend on postsynaptic cell activities (see, e.g., Tao, Zhang, Bi, & Poo, 2000, and Harris, 2008, for a review) and could therefore provide a suitable signal that is locally available in layer 4.
Another mechanism would be direct projections from layer 2/3 to layer 4. Such connections would also be desirable from the standpoint of models, e.g., by Harpur and Prager (1996) and Hinton et al. (1995), at least if these models are directly mapped onto the architecture of cortical layers. The number of direct connections from layer 2/3 to layer 4 seems comparably small, but for the purposes of our model, a small number of connections with axonal targets at specific locations on layer 4 cells would be sufficient.
Other than direct feedback, indirect feedback from layer 2/3 to layer 4 is a well-established feature of cortical processing (see, e.g., Douglas & Martin, 2004) and presumably represents a more plausible alternative than direct feedback. Via the deep layers 5 and 6, the activities of layer 2/3 cell populations could be conveyed back to layer 4 such that they are locally available in the input layer. Because of population rate codes as assumed in our model, the propagation of such signals would be very fast. In parallel to subpopulation-specific signals, cells in the deep layers that receive population-unspecific input from layer 2/3 pyramidal cells could conveniently convey the signal about the global layer 2/3 activity (P in equation 2.7 back to layer 4. In ART-like systems (e.g., Raizada and Grossberg, 2003), the pathway from layer 2/3 through the deep layers back to layer 4 is considered to play a crucial role for synaptic plasticity and the integration of top-down information (also compare Douglas & Martin, 2004). It therefore seems reasonable that this pathway is also used for learning in the absence of top-down input. To elaborate on these aspects and for the incorporation of top-down information, the modeling of the deep cortical layers represents a natural starting point for the extension of the system presented here.
Another future extension would be the modeling of longer-range lateral excitatory connections. The measurements by Yoshimura et al. (2005) indicate that disjointness of the excitatory layer 2/3 populations represents a good approximation for fine-scale columnar connectivity. However, without modeling an intercolumnar circuitry of lateral connections (e.g., within layer 2/3), the emergent RFs can capture the statistics of only local image patches. More global aspects of input to V1 are modeled by a number of computational models of the cortex's longer-range lateral circuitry (e.g., Somers et al., 1995; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Erwin, Obermayer, & Schulten, 1995; Fransen & Lansner, 1998; Li, 2002; Seriès, Lorenceau, & Frégnac, 2003, to name a few). These models could presumably be combined with our model of the first feedforward stages of processing in a straightforward manner. Lateral information could be integrated by feeding the activities of populations in adjacent columns into the nonlinear dynamics of layer 2/3. In such a combined model, response properties of layer 2/3 populations would differ more significantly from properties of layer 4 neurons, and learning in such a system could be expected to capture higher-order structures such as reported from complex cells. Lateral connectivity presumably also plays a crucial role in structuring activity at stimulus onset or in the absence of sensory stimuli. Even for no input, spontaneous excitation can spread through a network of later connections. If this network is structured, structured activity patterns can emerge. On the basis of our model, we can speculate that a patterned spontaneous activity (as, e.g., reported in Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003) corresponds to states of partially reset individual columns. The activity patterns of two neighboring columns would be correlated in that case because of a nonrandom lateral connectivity.
4.2. Emergence of Receptive Fields.
Our neural dynamics of self-excitation and inhibition (see equation 2.4) couples into the dynamics of synaptic modification of afferent fibers to the column (see equation 2.7). RF self-organization induced by equation 2.7 has been proven to be capable and flexible. The same system, always with the same set of parameters, was applied to different types of input. The noise robustness of the presented system by far exceeds the capabilities of the single-neuron-based systems (Lücke & von der Malsburg, 2004; Lücke, 2004). In quantitative measurements in the bars test, it is found to be significantly higher than noise robustness reported by other systems in the literature. This high robustness also enables the application to natural image stimuli that the systems discussed by Lücke and von der Malsburg (2004) and Lücke (2004) could not successfully be applied to. Before relating the model column to dynamic and functional systems described in the literature, we will discuss some assumptions used and the model's current limitations.
4.2.1. Static and Moving Visual Stimuli.
For RF learning, we assume stimuli to be static during a ν-cycle. This is certainly an approximation of real visual stimuli. The retinal representation of light intensities and thus the neural signal that enters the primary visual cortex through LGN is constantly changing. Even if we inspect a static scene, eye movements result in this scene's retinal representation to constantly shift across the arrays of cones and rods. The shift of the retinal image due to saccades is of subordinate importance here, as visual input during a saccade is usually considered negligible for visual processing (Carpenter, 1988). In the context of our model, we expect the relevant signals for RF formation to come from phases of fixation. During fixation, the retinal image changes due to fixational eye movements, which are subdivided into tremor, drifts, and microsaccades (see, e.g., Carpenter, 1988). The tremor is a very small movement with a mean amplitude of about 5 to 20 seconds of an arc in humans. Due to drifts, the retinal image shifts with a velocity of on average 4 to 6 minutes of arc per second in humans and monkeys (with mean amplitudes of a few minutes; note that measurements have large variances; see Carpenter, 1988, or Martinez-Conde, Macknik, & Hubel, 2004, for reviews). Microsaccades have roughly the same amplitude as drifts but are much faster, such that the dominant states in time are tremor and drift (more than 90% of the time). To estimate the relevance of tremor and drift, consider the extent of a typical simple cell RF. As discussed by Ringach (2002) a simple cell with parafovea RF typically responds optimally to sinusoidal gratings of about four cycles per degree and thus has a distance between inhibitory and excitatory subfields of about 7.5 min of arc. Displacements of the retinal image therefore start to have a significant effect on a simple cell's response from about 1 to 2 min onward. The tremor has an amplitude of about 5 to 20 seconds of an arc and is thus too small to have a relevant effect. For drifts that have amplitudes of a few minutes, the situation is different. We can, however, ask how short a time period has to be such that a stimulus can still be considered approximately static. If we assume a drift velocity of 5 min/sec, times below 200 ms would result in displacements of less than 1 minute of arc. Thus, during times of about 100 ms and below, the retinal image can be assumed to be sufficiently static relative to a typical RF.
This is, of course, not saying that visual motion can be neglected in general. Representations of correlations in time across different timescales and the representation of objects that move in relation to each other are at least as important as the representation of static object properties. Furthermore, simple cell RFs themselves have a temporal component that can be expected to capture temporal aspects of the visual input statistics. The incorporation of temporal properties of visual stimuli therefore represents a potential future extension of our model. For the current model, it is, however, interesting to note that for the inspection of relatively static scenes, we can assume input stimuli to be approximately static if processing intervals of less than or equal to 100 ms are considered.
4.2.2. Comparison to Other Dynamic Approaches.
In all simulations discussed in this work, RFs are equally and homogeneously initialized, and the small noise term in equation 2.4 is the only source of asymmetry that finally leads to RF diversity. However, regarding the development of the visual system, RFs that at the time of eye opening already show some sensitivity to different orientations would certainly be of great advantage. For our model, convergence to RFs that reflect the statistics of visual input can be expected to be much faster in this case. Furthermore, the energetically expensive reset of all populations in the beginning of the first ν-cycles could be avoided. A number of developmental models show how an interplay of excitation and inhibition can result in the emergence of orientation tuning without exposure to natural images (see, e.g., Grossberg & Seitz, 2003). They are consistent with the observation that at least some orientation tuning already exists at the time of eye opening (see, e.g., Chapman & Stryker, 1993), but they do not model the influence of the input statistics thereafter (see, e.g., White, Coppola, & Fitzpatrick, 2001, for experimental data). RFs of these developmental models are localized by definition as opposed to emerging from training on images as in many other models (see below). They can therefore not be quantitatively analyzed as in Figure 8, for example.
Among systems that study cortical learning, models based on adaptive resonance theory (ART; see Carpenter & Grossberg, 2003, for on overview) form an important class. Our model has some aspects in common with ART models: the threshold of synaptic plasticity dynamics, equation 2.7, is reminiscent of the vigilance parameter of ART models, and, as in ART, some choices of parameters result in hard WTA mechanism. In LAMINART (Raizada & Grossberg, 2003), ART is related to the laminar integration of upstream and downstream information and to the integration of lateral information through layer 2/3. In this respect, the scope of LAMINART exceeds the scope of our model. However, while ART models how the cortex can learn different input categories, it has not been used to extract input components and to explain the input-driven emergence of simple-cell RFs.
Another related model is the recently suggested CBA learning rule (Jerez & Atencia, 2005) which is related to BCM (see below for a discussion). The CBA rule has in common with our model that it also operates on DoG-filtered natural images and, more interesting, that learning also involves a polynomial of order 3 (although in the CBA model, the synaptic plasticity rule rather than the neural dynamics is of order 3). In contrast to our model, CBA describes learning on the single neuron level and requires negative weights and feedback connections. Learning using the CBA rule has been well investigated analytically. The resulting RFs are, however, not localized enough to allow an analysis as used in this letter.
A comparatively early system similar to ART and CBA was suggested by Barrow (1987). Like CBA and our system, it uses DoG-filtered natural images for learning. Competition among hidden units is implemented using a hard WTA selection. The resulting RFs have a Gabor-like shape, but only spatial frequency, phase, and orientation are learned. The localization is set by hand using a modulating gaussian function.
A number of other dynamical models (e.g., Harpur & Prager, 1996; O'Reilly, 2001; Spratling & Johnson, 2002; Charles et al., 2002; Spratling, 2006; Falconbridge, Stamps, & Badcock, 2006) have been suggested as models for cortical learning. Some of these models (e.g., Charles et al., 2002; Spratling & Johnson, 2002, and many more) require plasticity of inhibitory synapses of afferents alongside plasticity of excitatory ones (e.g., Charles et al., 2002; Spratling & Johnson, 2002). Furthermore, an explicit weight normalization is sometimes used (e.g., Spratling & Johnson, 2002; see Abbott & Nelson, 2000, for a critical discussion). In contrast, only plasticity of excitatory afferents from LGN is used in our model, and infinite weight growth is prevented by a synaptic decay term (also compare Falconbridge et al., 2006). The presented system is one of the few that is robust enough to be applicable to both types of benchmarks (compare Harpur & Prager, 1996; Charles et al., 2002; and the system of Földiák, 1990, which has recently been shown by Falconbridge et al., 2006, to be applicable to natural images if some modifications are made). Our dynamic model furthermore relates its functional properties to the fine structure of cortical columns and is the only one of the systems mentioned that so far has been shown to compare favorably with in vivo measurements of simple-cell RFs as depicted in Figure 8.
4.2.3. Comparison with Functional Approaches.
Alongside dynamical bottom-up models, functional models represent another class of models for cortical learning. They focus on the functional capabilities of the neural circuitries, while dynamical models are usually motivated by its anatomical and physiological properties. Classical examples are sparse coding (Olshausen & Field, 1996; Olshausen, 2002) and ICA (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998), which are often regarded as standard models for the emergence of simple cell RFs. Our dynamics is quantitatively compared to these models on the basis of the RFs that emerge if the different systems are trained on natural images. Similar to RFs emerging in ICA and sparse coding and similar to properties of simple-cell RFs (Jones & Palmer, 1987; Ringach, 2002), the RFs obtained by our model show sensitivity to different spatial orientations, frequencies, and locations. However, as analyzed by matching Gabor wavelets, the RFs obtained in the column model show a specific variety in the extents of their gaussian envelopes relative to their spatial frequencies. This feature is consistent with in vivo measurements of simple-cell-like RFs (Jones & Palmer, 1987; Ringach, 2002; compare Figure 8A). The presented dynamics represents presumably the first computational system with this specific variety; The resulting distributions shown by Osindero, Welling, and Hinton (2006) are in general broader than in vivo measurements, however. In the model suggested by Rehn and Sommer (2007), which implements a form of sparse coding, RF distributions are obtained that contain globular RFs. These RFs can be matched by degenerated Gabor wavelets and correspond to points near the origin of an nx/ny-plot. Otherwise the RFs are reminiscent of those obtained by sparse coding (compare Figure 8B). That is, they partly match the measured distributions well but show, distant from the origin, numerous RFs that are incorrectly elongated in nx-instead of ny-direction. Also note that the model of Rehn and Sommer (2007) was tuned to fit the data, while we have used the bars test for tuning. The classical models for the emergence of simple cell RFs (Olshausen & Field, 1996; van Hateren & van der Schaaf, 1998; Bell & Sejnowski, 1997), in particular ICA, do not accurately reproduce the experimental data (see Figure 8B).
The reason for the differences between functional approaches (e.g., Olshausen & Field, 1996; van Hateren & van der Schaaf, 1998; Bell & Sejnowski, 1997; and Rehn & Sommer, 2007) and the model presented here is presumably the type of competition between hidden units. In recognition models for overcomplete ICA and sparse coding, hidden units exhibit a particular type of anticorrelation5 that is less competitive than the soft WTA mechanism implemented by our model. If sparse coding and ICA-like systems are viewed as generative models (see, e.g., Dayan & Abbott, 2001), it can be shown that they assume the input components (the hidden causes) to combine linearly to generate the data. ICA and sparse coding share this linearity assumption with the vast majority of functional approaches. Importantly, the BCM learning rule (Bienenstock, Cooper, & Munro, 1982) can also be regarded as such a linear model. BCM is a computational model that showed more than two decades ago that Gabor-like RFs can emerge from training on natural images. An objective function for BCM was derived by Intrator and Cooper (1992). This function was subsequently shown to be in large parts equivalent to a particular (skewness-based) sparse prior of a linear generative model (Körding, Kayser, & König, 2003).
Linear approaches are plausible for many types of data, but the assumption of linear superposition of components seems difficult to justify for images (see, e.g., Spratling, 2006; Lücke & Sahani, 2008, for critical discussions). For example, if we regard objects or edges as hidden causes, they are subject to occlusion rather than linear superposition. More competitive generative models such as discussed by Saund (1995) and Dayan and Zemel (1995) or more recently by Lücke and Sahani (2008) represent better approximations for such nonlinear combinations. For learning, they all result in a competition between hidden units that is more strongly dominated by the unit that best explains the data. One of the learning algorithms discussed by Lücke and Sahani (2008) can explicitly be shown to have the form of a specific type of soft-WTA competition. Functionally, the presented dynamics is therefore more closely related to competitive generative models than to linear systems such as sparse coding or ICA-like approaches. However, none of the competitive models has been shown to develop Gabor-like RFs if applied to natural images. This might be because of the extensive computational costs that are often required for learning in generative model-based systems. However, in the computationally less demanding bars test, the competitive schemes (Saund, 1995; Dayan & Zemel, 1995; Lücke & Sahani, 2008) all successfully extract the true hidden causes. In contrast, the linear approaches of ICA and PCA are shown by Hochreiter and Schmidhuber (1999) to fail in this nonlinear task. That the system here discussed successfully extracts the true causes in the bars test is further evidence for its functional similarity to more competitive generative models.
For natural images, the presented model shows that Gabor-like RFs can be obtained using a soft-WTA competition between hidden units as opposed to types of competition related to a linear superposition assumption. Interestingly, the high similarity of the RF distribution emerging in our model to those measured in vivo might indicate that V1 processing is indeed based on a better approximation for occlusion-like scenarios than implicit in linear approaches—although this has to be said with all caution, for example, regarding the influences of different preprocessing and approximation schemes on RF shapes.
We have studied the neurodynamic implications of recent results on the fine structure of cortical columns. The emergent dynamics of neural activity balances excitation and inhibition to allow for sensitive decisions with respect to external stimuli. If the dynamics is coupled to synaptic plasticity, we find that a specific choice of lateral coupling results in self-organization of RFs with competitive component extraction capabilities. Most notably, the model shows the highest robustness in the noisy bars benchmark test so far, and if applied to natural images, it matches the variability of V1 simple-cell RFs better than classical methods do. The system models important functions associated with cortical columns and combines cortical anatomy and the emergence of simple-cell-like RFs in a computational model of the first stages of local columnar processing.
A.1. Simulation Details.
For the simulations, we chose parameter values based on the performance of the system in the standard bars test. The ratio a:κ:σ in equation 2.4 is set to 200:1:0.01 with κ = 25. We numerically simulated the dynamics using 1250 time steps per ν-cycle. A ν-cycle is taken to be of unit length (t in equation 2.3 runs from 0 to 1). In equation 2.3, νmin is equal to 0.4 (the critical value is νc = 0.5). The learning rate in equation 2.7 is ϵ = 0.02. Parameters of equation 2.8 are λχ = 5 × 10−5, aχ = 1.2, λν = 10−3, aν = 0.7. χ and νmax are initialized to values χ = 0.6 k and νmax = 0.45. During learning, νmax increases quickly to values near νc, and χ decreases slowly. Most of the parameters are chosen such that RF self-organization converges and the competition between the hidden units has soft-WTA characteristics. Small variations of the parameters usually result in small quantitative changes on benchmarks such as the bars test. Qualitative features such as the particular variability in the distribution of gaussian envelopes for natural images remain stable for a wide range of parameter values. We observed this variability in all our trials on DoG preprocessed image patches.
Note that in equation 2.3, instead of increasing ν for each stimulus anew, we can also use an oscillating ν while synchronizing the input stimuli with such an oscillation (compare Lücke & von der Malsburg, 2004). For the simulations, both procedures are equivalent. However, with respect to biological interpretability, the first could be seen as an increase of inhibition triggered by a stimulus, whereas the second would be associated with neural background oscillations.
A.2. Synaptic Plasticity.
For the neural populations of layer 4, we have assumed a linear dependency (see equation 2.6) between their activities I1 to Ik and input unit activities y1, …, yN. We can therefore use the afferent weights (Rαj) to define the RFs (R RFαj). Because we assumed linear processing in layer 4 (see equation 2.5), the afferents (Rαj) also determine the inputs to the layer 2/3 populations: . The vector (R effα1, …, R effαN) with could therefore be termed effective RF. For dynamics 2.7, it can be shown that ∑jRαj converges to one during learning, which implies that ∑jR RFαj converges to zero. Furthermore, the effective RFs converge to the RFs .
A.3. Analysis of Simulation Results.
To determine the learning time in the bars test, we say the system has found all bars if each bar is assigned to a population, for any pair of bars there is a pair of populations such that one bar is exclusively assigned to the one population and the other bars is exclusively assigned to the other, and the set of assignments remains the same for 10 × 103 ν-cycles. For b = k, the first two conditions simply amount to testing for a one-to-one mapping. A bar is assigned to a population if the population's probability of being active at the end of a ν-cycle is above average in the case that the bar is shown exclusively and without noise.
I thank Richard Turner, Pietro Berkes, Peggy Seriès, Caroline Möller, Maneesh Sahani, Christoph von der Malsburg, and Peter Dayan for their comments and suggestions and for reading versions of the manuscript for this letter. Furthermore, I thank Jan D. Bouecke for his help in acquiring preliminary simulation results and gratefully acknowledge funding by the Gatsby Charitable Foundation, the Hertie Foundation, and the BMBF project 01GQ0840 (Bernstein Focus Neurotechnology Frankfurt).
Note that the term hidden unit is taken from the literature on unsupervised learning. In the terminology of perceptrons, the layer 2/3 populations would be referred to as output units.
To reduce undesirable boundary effects, image patches are large (20 × 20 pixels) compared to patch sizes used by other methods—for example, 12 × 12 for ICA (Bell & Sejnowski, 1997) and sparse coding (Olshausen & Field, 1996; Olshausen, 2002). Larger patch sizes exceed the already extensive computational resources required to simulate dynamics 2.3 to 2.8. Note that the range of preferred spatial frequencies is determined by the DoG preprocessing (see Figure 7B). Increasing this frequency with a smaller DoG kernel is not possible because the standard deviation of its positive part is already as small as 1 pixel.
In probabilistic functional models, the so-called posterior probabilities of hidden units describe the presence or absence of input components. In the learning rules of such models, these probabilities emerge as the crucial entities.
Given an input, their joint probability distribution has high values in the vicinity of a hyperplane in the space of hidden unit activities.