## Abstract

Recently multineuronal recording has allowed us to observe patterned firings, synchronization, oscillation, and global state transitions in the recurrent networks of central nervous systems. We propose a learning algorithm based on the process of information maximization in a recurrent network, which we call *recurrent infomax* (RI). RI maximizes information retention and thereby minimizes information loss through time in a network. We find that feeding in external inputs consisting of information obtained from photographs of natural scenes into an RI-based model of a recurrent network results in the appearance of Gabor-like selectivity quite similar to that existing in simple cells of the primary visual cortex. We find that without external input, this network exhibits cell assembly–like and synfire chain–like spontaneous activity as well as a critical neuronal avalanche. In addition, we find that RI embeds externally input temporal firing patterns to the network so that it spontaneously reproduces these patterns after learning. RI provides a simple framework to explain a wide range of phenomena observed in in vivo and in vitro neuronal networks, and it will provide a novel understanding of experimental results for multineuronal activity and plasticity from an information-theoretic point of view.

## 1. Introduction

Recent advances in multineuronal recording have allowed us to observe phenomena in the networks of the central nervous system (CNS) that are much more complex than previously thought to exist. The existence of interesting types of neuronal activity, such as patterned firings, synchronization, oscillation, and global state transitions, has been revealed by multielectrode recording and calcium imaging (Nadasdy, Hirase, Czurko, Csicsvari, & Buzsaki, 1999; Cossart, Aronov, & Yuste, 2003; Ikegaya et al., 2004; Fujisawa, Matsuki, & Ikegaya, 2006; Sakurai & Takahashi, 2006). However, in contrast to the rapidly accumulating experimental data, theoretical work attempting to account for this wide range of data has been slower to materialize. These new data are partly explained by the classical hypotheses proposed purely on theoretical grounds, such as the “cell assembly” of Hebb (1949). However, to explain a wider range of data, we have to extend the classical hypotheses on the basis of mathematics and information sciences.

*P*(

*x*) that a system takes state

*x*. For example,

*P*([1, 0, 0, 1]) = 0.01 means that the relative frequency of occurrence that the first and fourth neurons fire and the second and third ones remain silent is 1% over the duration of a long trial. Mutual information

*I*(

*X*;

*Y*) of two discrete random variables

*X*and

*Y*with a joint probability distribution

*P*(

*x*,

*y*), , and marginal probability distribution

*P*(

*x*) and

*P*(

*y*) is defined by where and are the sets of states. Taking logarithms to base 2, we can measure the mutual information in bits. Mutual information

*I*(

*X*;

*Y*) is the information shared by input

*X*and output

*Y*. In other words, it measures the reduction in the uncertainty of

*X*due to the the knowledge of

*Y*and vice versa. Thus, maximizing the mutual information of the input and output improves information transmission in a feedforward network. It has been proposed that infomax in feedforward networks may provide an explanation of the stimulus selectivity of neurons in CNSs (Tsukada, Ishii, & Sato, 1975; Atick, 1992; Bell & Sejnowski, 1995, 1997; Olshausen & Field, 1996; Lewicki, 2002). However, CNSs contain not only feedforward but also recurrent synaptic connections (see Figure 1B), which endow networks with many interesting phenomena, some of which have been reported recently and several researchers have attempted to model (Diesmann, Gewaltig, & Aertsen, 1999; Maass, Natschläger, & Markram, 2002; Buonomano, 2005; Vogels & Abbott, 2005; Teramae & Fukai, 2007). Therefore, we attempted to extend infomax to the case of recurrent networks, in which the input to the neurons at time

*t*consists of their own output at time

*t*− 1 (see Figure 1C).

More specifically, a learning algorithm based on infomax in feedforward networks generates information-efficient representation of the input in the output neurons of the feedforward network (see Figures 2A1 and 2A2). This algorithm adjusts the connection weights to realize the most efficient information transfer from the input to the output. In this way, a network with small mutual information of input and output, that is, large information loss (see Figure 2A1), evolves into a network that preserves a larger percentage of information (see Figure 2A2) through this algorithm. If the optimization based on infomax is applied to a recurrent network in which the input to the neurons at time *t* consists of only their own output at time *t* − 1, the mutual information of two successive states, , is maximized; that is, the information loss through time is minimized. We call this form of infomax *recurrent infomax* (RI). An algorithm based on RI readjusts the connection weights of the recurrent network to change a random network with large information loss (see Figure 2B1) into an information-efficient network (see Figure 2B2). The role of RI is to allow a recurrent network to optimize the synaptic connection weight in order to maximize information retention and thereby minimize information loss by maximizing the mutual information of the temporally successive states of the network.

In this letter, proposing a learning algorithm based on RI, we find that feeding in external inputs consisting of information obtained from photographs of natural scenes into an RI-based model of a recurrent network results in the appearance of Gabor-like selectivity quite similar to that existing in simple cells of the primary visual cortex (V1). More important, we find that without external input, this network exhibits cell assembly–like and synfire chain–like stereotyped spontaneous activity (Hebb, 1949; Abeles, 1991; Diesmann et al., 1999) and a critical neuronal avalanche (Beggs & Plenz, 2003; Teramae & Fukai, 2007; Abbott & Rohrkemper, 2007). RI provides a simple framework to explain a wide range of phenomena observed in in vivo and in vitro neuronal networks, and it should provide a novel understanding of experimental results for multineuronal activity and plasticity from an information-theoretic point of view.

## 2. Methods

*N*neurons are connected according to the weight matrix

*W*, and their firing states [

_{ij}*x*(

_{i}*t*) = 1 (fire) and 0 (quiescent)] at time step

*t*are synchronously updated to time step

*t*+ 1. The firing state

*x*(

_{i}*t*+ 1) of neuron

*i*at time step

*t*+ 1 is determined stochastically with the firing probability where

*h*(

_{i}*t*) is the threshold of neuron

*i*and

*p*

_{max}is the maximal firing probability. When the maximal firing probability

*p*

_{max}= 0.5, a neuron fires on average once every two time steps, even if the neuron receives a sufficiently strong excitatory input at every time step. A small value of

*p*

_{max}thus makes the firing of the neurons quite unreliable. In contrast, if

*p*

_{max}is close to 1, it is highly probable that a strong input makes a neuron fire. Thus,

*p*

_{max}determines the reliability with which a model neuron fires in response to an input.

*i*to , we update the threshold of neuron

*i*,

*h*(

_{i}*t*), at each step according to where the learning rate ϵ for the threshold is set to 0.01 in all simulations. Equation 2.2 fixes the mean firing probability of neuron

*i*in a manner that the threshold rises when the neuron fires and the threshold falls when the neuron remains silent. When the firing states and the thresholds are updated by equations 2.1 and 2.2 for a sufficiently long sequence of time steps,

*h*(

_{i}*t*) stops increasing or decreasing and starts fluctuating around a certain value. Then the time average of the second term of the right-hand side of equation 2.2 vanishes, and thereby the time average of

*x*(

_{i}*t*) that is, the firing rate of neuron

*i*, becomes equal to . Thus, the mean firing probability is fixed to .

Input *x _{i}*(0) to the neurons at the first step

*t*= 1 of the simulation was set to 0, and in the following steps,

*x*(

_{i}*t*) was determined stochastically with equation 2.1. Unless otherwise stated, the neurons in the model network do not have other inputs than their outputs at the previous step, and thereby the dynamics of the network are completely determined by equations 2.1 and 2.2 (see Figure 3A).

We performed simulations in blocks consisting of 20,000 to 100,000 time steps, updated *W _{ij}* at the end of each block, and then started the calculation for the next block (see Figure 3B). Outputs of the neurons at the last step of block

*b*− 1 were given as inputs to the neurons at the first step of block

*b*. A simulation consists of 500 to 15,000 blocks.

*W*

^{initial}

_{ij}and develops toward an optimized network with

*W*

^{optimized}

_{ij}. The evolution of the weight matrix is determined by the gradient ascent algorithm, where

*W*(

_{ij}*b*) is the connection weight

*W*in block

_{ij}*b*and

*I*(

*b*) is the mutual information of two successive states of the network in block

*b*and η is the learning rate. To avoid

*W*increasing without bound, it is bounded above and below by

_{ij}*w*

_{limit}and −

*w*

_{limit}, respectively.

*b*of two states separated by

*n*− 1 steps by where is the set of latter half of steps in block

*b*, and

*T*is half of the number of steps contained in a block, that is, . The connection weights

*W*are updated using correlation in the latter half of steps in a block to let

_{ij}*h*(

_{i}*t*) converge in the earlier half of steps in this block after

*W*was updated.

_{ij}*I*

^{(1)}is an approximation of mutual information of two successive states, , to be maximized (see appendix A for derivation).

All models in this letter can be fully characterized by parameters *N* (50– 432), (0.002–0.05), *p*_{max} (0.25–0.95), η (0.2–20), ϵ (0.01), and *w*_{limit} (100–1000). Parameter values used in simulations are included in figure captions. At the beginning of the simulation, *W _{ij}* was drawn from a uniform distribution on [ − 0.5, 0.5] and

*h*was set to 0.

_{i}*f*= 256 (Olshausen & Field, 1997). The processing in the early visual systems such as the retina and lateral geniculate nucleus can be regarded as high-pass filtering, and the output neurons correspond to the neurons in V1. The filtered image data were used to generate firing patterns of input neurons by taking 12 × 12 randomly selected image patches and then converting these to 288 binary inputs. The on-input and off-input neurons fired only when the intensities of the corresponding pixels had positive and negative signs, respectively. For each pixel

_{c}*i*of the input, α|

*d*| was compared to a random value

_{i}*u*drawn from a uniform distribution on [0, 1], where

*d*is the intensity of pixel

_{i}*i*= 1, …, 144, and α is a constant parameter. If α|

*d*|>

_{i}*u*, the state of the corresponding input neuron was set to 1, and if α|

*d*| ⩽

_{n}*u*, it was set to 0. We set the parameter α to fix the mean firing probability of the input neurons around 0.15. Under this condition, the pixels caused the firing of the input neurons with a probability proportional to its intensity, except for 5% of the pixels, whose intensities were larger than 1/α. The simulation program was written in C++.

## 3. Results

We first observed the behavior of this model network under external input. Image patches from a photograph preprocessed by a high-pass filter were used as the external input (see Figure 5A). The neurons in this network were divided into three groups: 144 on-input and 144 off-input neurons, and the 144 output neurons were randomly selected from the network (see Figure 5B1). Pixels with positive and negative values in a randomly selected 12 × 12 image patch excited the corresponding on-input and off-input neurons, respectively. The states of the input neurons were stochastically set to 1 or 0 with firing probabilities proportional to the intensities of the corresponding pixels, whereas the states of the output neurons were not set by the external input (see Section 2 for details). Instead, the firings of these neurons were determined by equation 2.1 with *p*_{max} = 0.95. Initially the connection weight *W _{ij}* was a random matrix (see Figure 5C1), and we found that output neurons did not exhibit clear selectivity with respect to the external input from the input neurons (see Figure 5D1) upon averaging the image patches that evoked firings in an output neuron. After learning, however, the network self-organized a feedforward structure from the on-input and off-input neurons to the output neurons (see Figures 5B2 and 5C2). The output neuron became highly selective to Gabor function-like stimuli (see Figure 5D2), exhibiting behavior quite similar to the selectivity of simple cells in the V1 cortex (Hubel & Wiesel, 1959). Our optimization algorithm based on RI hence caused the model network to become organized into a feedforward network containing simple cell–like output neurons. It has been proven that the infomax accounts for the selectivity of simple cells (Bell & Sejnowski, 1995, 1997). Bell and Sejnowski (1997) argued that the natural image patches are composed of independent localized edges such as Gabor functions and that these components can be recovered by maximizing the mutual information of the input and the output. We thus see that this result is consistent with the previous studies based on information theory.

In the simulation described above, the external input was fed into a network with high response reliability (*p*_{max} = 0.95). Next, we examined the evolution of the spontaneous activity in a neuronal network without external input. In this network, the approximate mutual information *I*^{(1)} of two successive states was maximized, and the approximate mutual information *I*^{(n)} of two states interleaved with *n* − 1 steps after learning became larger than *I*^{(n)} before learning (see Figure 6A). We supposed that this improvement in information retention was a result of the emergence of repeated activity in the network. To identify repeated activity in the model network, we defined a repeated pattern as a spatial pattern of neuronal firings that occurs at least twice in the latter half of a test block (see Figure 6B). Coloring repeated patterns consisting of three or more firing neurons in raster plots of the network (see Figures 6D1 and 6D2), we found that the number of repeated patterns increased after learning. Several patterns were repeated in a sample of 250 steps, as seen in Figure 6D2, where the repeated patterns are indicated by consistently colored circles and connected by lines. Moreover, some patterns appeared to constitute repeated sequences. For example, sequence A, composed of the magenta, orange, and purple patterns, appears three times in Figure 6D2. To quantify the increase in repetition, we tabulated the numbers of occurrences of repeated patterns and sequences and compared these numbers before and after learning (see Figure 6C). We found that both repeated patterns and repeated sequences increased significantly after learning. This indicates that the algorithm embeds not only repeated patterns but also repeated sequences of firings into the network structure as a result of the optimization.

When a pattern in a sequence is activated at one step, it is highly probable that the next pattern in that sequence will be activated at the next step. This predictability means that the state of the network at one time step shares much information with the state at the next time step. In contrast, when the dynamics of a network is highly stochastic and thereby repeated patterns are rare, we cannot predict which pattern follows a given pattern or reduce the uncertainty of the next pattern by using the knowledge of the pattern. In this case, mutual information of two successive states is low. Sequences must be repeatedly activated, and the network must be deterministic in order to efficiently retain information in a recurrent network. Hence, we conclude that the repeated activation of an embedded sequence is an efficient way to maximize information retention in a recurrent network. These repeated patterns and sequences have been experimentally observed in vivo (Skaggs & McNaughton, 1996; Sakurai & Takahashi, 2006; Yao, Shi, Han, Gao, & Dan, 2007) and in vitro (Cossart et al., 2003; Ikegaya et al., 2004), and their existence is suggested by the theory of cell assemblies proposed by Hebb (1949) and the theory of synfire chains proposed by Abeles (1991). We thus see that RI accounts for the appearance of cell assemblies, sequences, and synfire chains in neuronal networks.

In the simulations shown above, a small fraction of connections grew especially strong in the network after learning (see Figure 6E2). So we ask, Is the existence of a small number of strong connections a sufficient condition for efficient information transfer? To answer this, we randomly shuffled the components of the weight matrix of the network after learning shown in Figure 6, and we found that shuffled networks exhibited lower mutual information and a smaller number of occurrences of repeated sequences (see Figures 7A and 7B). Thus, the existence of strong connections does not necessarily imply that the network is efficient in retaining information. RI improves information retention in recurrent networks, while randomly introducing strong connections does not.

We next examined the behavior of the same spontaneous model in the case that the maximal firing probability was small (*p*_{max} = 0.5). For small *p*_{max}, the number of identically repeated sequences is small, and the network seems to lose structured activity. However, we found characteristic network activity consisting of firing in bursts (see Figure 8A2), which are defined as consecutive firing steps that are immediately preceded and followed by “silent” steps, with no firing. We found that after learning, the distribution *P*(*s*) of the burst size *s*, which is the total number of firings in a burst, obeys a power law distribution *P*(*s*) ∝ *s*^{γ} with γ ≈ −1.5, whereas before learning, we have *P*(*s*) ∝ exp(−α*s*) (Figure 8C). This result is consistent with experimental results. Recently Beggs and Plenz (2003) recorded the spontaneous activity of an organotypic culture from the cortex using multielectrode arrays. Defining an avalanche similarly to our bursts following a period of inactivity, they found that the size distribution of avalanches is accurately fit by a power law distribution with exponent −1.5. To explain this, they argued that a neuronal network is tuned to minimize the information loss and that this is realized when one firing induces an average of one firing at the next step. They showed that this condition yields the universal exponent −3/2, using the self-organized criticality of the sandpile model (Bak, Tang, & Wiesenfeld, 1987; Harris, 1989). This condition also holds for the present network because, after learning, each neuron with *p*_{max} = 0.5 had two strong input connections and two strong output connections on average (see Figure 8B2). The universal exponent −3/2 was observed in the network for small *p*_{max} (see Figure 8C), but not for *p*_{max} = 0.95. Actually the size distribution of bursts *P*(*s*) in the system did not exhibit a power law distribution and displayed several peaks, reflecting the existence of stereotyped sequences (data not shown). We thus conclude that RI embeds information-efficient structures in which one firing induces on average one firing at the next step in a network with small *p*_{max}.

To reveal the essential mechanism responsible for the behavior described above, we returned to the recurrent network with an external input (see Figure 9). It has been observed that the hippocampal firing sequences in the awake state are repeated during sleep (Skaggs & McNaughton, 1996; Louie & Wilson, 2001) and that the spontaneous spiking activity in the visual cortex mimics the movie-evoked response after repeated exposure to a movie (Yao et al., 2007). We investigated whether the firings presented during the learning period are replayed by the model after the learning. In the learning blocks, we repeatedly stimulated neurons 1, 3, and 2 in sequence (see Figures 9A1 and 9B1). In the learning blocks, the state of neuron 1 was set to 1 (fire) at random intervals ranging from 50 to 99 steps (time step *t*). At *t* + 2, the state of neuron 3 was set to 1, and at *t* + 6, the state of neuron 2 was set to 1. In the successive test block, in which only neuron 1 was stimulated externally (see Figure 9A2), the firing of neuron 1 was followed by spontaneous firings of neurons 3 and 2 (see Figure 9B2, arrows). In addition, the spontaneous firing of neuron 1 triggers the sequence containing the firings of neurons 3 and 2 (see Figure 9B2, double arrows). The form of the weight matrix after learning reveals that a feedforward structure starting from neuron 1 (1 → 7, 34 → 3, 5 → 49 → 18 → 11, 28 → 2) was embedded in the network (see Figure 9C). This structure self-organizes in the network because, as we saw above, embedding a sequence of firings into the network structure is an efficient way to retain information. It is thus seen that RI embeds externally input temporal firing patterns into the network by producing feedforward structures, and, as a result, the network can spontaneously reproduce the patterns.

## 4. Discussion

In this study, we have found that infomax in recurrent networks acts to optimize the network structure by maximizing the information retained in the recurrent network. Much previous work concerning infomax in feedforward networks (Linsker, 1988; Atick, 1992; Bell & Sejnowski, 1995, 1997; Lewicki, 2002) has suggested that the stimulus selectivity of neurons in CNSs is accounted for by infomax in feedforward networks. In contrast, although infomax in recurrent networks has been studied, infomax is applied to only small recurrent networks that can be studied by using a random search (Ay, 2002). This is because the analysis of recurrent networks is complicated by history-dependent dynamics due to the recurrent connections. In the model presented here, approximating the mutual information of two successive states with second-order correlations of neuronal firings, we succeeded in deriving an algorithm that maximizes information retention in recurrent networks. The model reproduced the self-organization of simple cell-like selectivity shown in the previous models, and we successfully extended these previous results to the spontaneous activity characteristic of recurrent networks. In the context of a simple maze task, for example, these repeated patterns can be regarded as memory traces representing spatial cues and relationships between successive items, and they have been supposed to help an animal in solving the maze task (Dragoi & Buzsaki, 2006). An internal representation of the external input is essential in adaptation to environments, and the internal representation is constructed by RI in the form of feedforward structures.

We have found that infomax in recurrent networks reproduces self-organization of cell assemblies and neuronal avalanches. In contrast, most previous theoretical studies on cell assemblies, synfire chains, and neuronal avalanches investigated the dynamics of neuronal firings on a network in which a feedforward structure underlying this characteristic type of activity had been embedded (Diesmann et al. 1999; Beggs & Plenz, 2003; Teramae & Fukai, 2007). Although these models successfully reproduced experimental results, they could not explain how the embedded network structure emerges. A recent theoretical study suggested that neuronal avalanches are accounted for by a simple model for the growth of dendritic and axonal processes (Abbott & Rohrkemper, 2007). It seems that this model self-organizes a network structure that maximizes retained information as in our model.

In our model, the network structure self-organized by the optimization algorithm resulted in simple cell-like activity, repeated sequences, and neuronal avalanches. Through evolution, animals have acquired CNSs, which are extremely efficient information processing devices that improve an animal's adaptability to various environments. It is thus quite natural that these phenomena can be regarded as a result of the optimization of information retention. Thus, in this letter and our model, we have focused on information retention in a recurrent network, although CNSs should be optimized not only for information retention but also for categorization and generalization. On the other hand, previous studies showed that synaptic plasticity rules experimentally observed and theoretically proposed optimize the information transmission of individual synapses (Toyoizumi, Pfister, Aihara, & Gerstner, 2005; Pfister, Toyoizumi, Barber, & Gerstner, 2006). Thus, neuronal networks with local plasticity rules optimized to retain information could reproduce the experimental results of repeated activity patterns and avalanches. However, the learning rule of our model is not local and requires global information. We can optimize the activity of, for example, the half of the neurons in the network if we approximate the mutual information of these *N*/2 neurons using the *N*/2 × *N*/2 correlation matrix and update the connection weights among these neurons, leaving other connection weights unchanged. Then we observe that the occurrences of repeated sequences increases after this learning but not as much as in the simulation shown in Figure 6 (data not shown). Although this learning rule requires the information on only the half the neurons in the network, this rule is not local and requires global information on the activity of these *N*/2 neurons in the system. To overcome this problem, our next goal is to derive a biologically plausible plasticity rule in a bottom-up way employing RI and to compare this rule with experimentally obtained plasticity rules. We believe that RI will help us understand the meaning of in vivo and in vitro experimental results, particularly to characterize the spontaneous activity of neurons in the context of information theory.

## Appendix A: Algorithm

Here we describe the algorithm to maximize the mutual information of the present state, *X*, and the next state, , of the network.

*N* neurons receive as input an output **x** = [*x _{i}*(

*t*)] at time

*t*and generate an output at time

*t*+ 1. Neuron

*i*takes two states: a firing state,

*x*= 1, and a nonfiring state,

_{i}*x*= 0. The firing probability of neuron

_{i}*i*at time

*t*+ 1 is given by equation 2.1. We assume that

*W*can take positive and negative values, with positive and negative

_{ij}*W*corresponding to excitatory and inhibitory connections, respectively. The threshold

_{ij}*h*(

_{i}*t*) evolves according to equation 2.2 and fixes the mean firing probability of neuron

*i*to .

*H*(

*X*), and the entropy of the joint distribution of two successive states, . Let

*P*(

**x**) be the probability that the state of the network is

**x**= [

*x*], and be the probability that the states of the network at consecutive steps are

_{i}**x**and , respectively. Then these entropies are defined by If the distribution of the state

**x**is given by a gaussian distribution with the correlation matrix where , the entropy of the state is (Cover & Thomas, 2006), and the entropy of the joint distribution of two successive states

**x**and is given by if this joint distribution is gaussian with correlation matrix where , , and . Therefore, the mutual information of two successive states

**x**and is given by Here we have used , that is, the fact that correlation matrix of is identical to the correlation matrix

*C*of

**x**. We assume that recurrent infomax is realized by maximizing the value of the function in equation A.1.

Although the distributions of **x** and are not gaussian because of the discreteness of the neuronal states, this approximation gives a good estimate of the mutual information. We compared the mutual information of two consecutive steps with this approximation. Figure 10 shows that mutual information is fit quite well by the form . Because this approximation requires only correlation matrices, it enables us to estimate the mutual information of *N* neurons, whose calculation in its original form requires the joint probability distribution of 2^{2N} realizations of the firing states.

In addition, the quantity in equation A.1 is a good index of the information retained in a recurrent network even when it deviates significantly from the value of the mutual information. Maximizing equation A.1 results in the decorrelation of the state **x** due to log |*C*|, as well as in the increase of the correlation between the the state **x** and the next state , owing to . A strong correlation between the states of the network at two successive steps increases the amount of information transmitted over time, and strong decorrelation among the neurons at a step increases the information capacity of the network. Thus, equation A.1 is an effective measure of the information retained in the recurrent network. Another advantage of using equation A.1 as the value function is that this function can be calculated by using only the second-order correlations. Although higher-order correlations are useful in estimating the mutual information, calculating higher-order correlations is time-consuming in numerical simulations and complicates the theoretical analysis. In the following derivation of the algorithm, we use equation A.1 and thus employ an approximation of the mutual information in which the contribution of the higher-order correlations is not taken into account.

*x*and . Then we can assume that the probability of a state

_{i}**x**is given by

*P*(

**x**) =

*z*

_{1}(

**x**)/

*Z*

_{1}, where is the partition function, in which

**x**runs over all realizations of the firing states. (Each summation for which no range is expressed is assumed to run from 1 to

*N*.) The variable

*J*is dependent on and determined by the second-order correlation matrix

_{ij}*C*according to and thereby it is dependent on

*W*. We assume that

_{kl}*J*is a symmetric matrix, that is,

*J*=

_{ij}*J*, without losing generality.

_{ji}*J*does not represent a real connection strength between neurons

_{ij}*i*and

*j*but rather the firing correlation between them. A positive and a negative

*J*imply that the firings of neurons

_{ij}*i*and

*j*are positively and negatively correlated, respectively. In other words, we assume that the state

**x**is generated by a Boltzmann machine with connection strength

*J*, and that this Boltzmann machine has been trained to produce the correlation

_{ij}*C*(Hinton & Sejnowski, 1983). We do not have to solve equation A.2 to obtain the value of

*J*, and as we see in the following, calculating the derivative of

_{ij}*J*with respect to

_{ij}*W*suffices to maximize the value of the function in equation A.1. Next, we assume that conditional probability, , is given by , where Although we assume this for the general case, it is exactly correct in the case

_{kl}*p*

_{max}= 1, because in this case, from equation A.3, we obtain and we recover the firing probability of neuron

*i*at time

*t*+ 1, where we have set the state of neuron

*i*at step

*t*to

*x*(

_{i}*t*) =

*x*in equation 2.1. Hence, by formulating the partition function

_{i}*Z*of the system in the form we can write the joint probability .

*E*with respect to

_{ij}*J*, we obtain where the superscript

_{kl}*Z*indicates that

*E*is regarded as a function of the independent variables

_{ij}*J*,

_{kl}*W*, and

_{kl}*h*, although

_{k}*J*and

_{ij}*h*are dependent on

_{i}*W*. Proceeding in the same way, we find the following relations, where , , , , and , and we have used and .

_{kl}*C*and

*E*are determined by

_{ij}*J*according to equation A.2, and,

_{mn}*J*is dependent on

_{ij}*W*. States

_{kl}**x**and obey the same distribution, and thereby the dependency of

*J*on

_{ij}*W*is determined by Thus, is given by the solution of the following: Rearranging terms, we obtain Because we have assumed that there exist no higher-order correlations, we substitute the higher-order correlations in these equations with the second-order correlations as follows: Here we have assumed that terms containing correlations among three or more variables are small and can be set to zero, except in the last approximation. In the last approximation, we have assumed , which holds when the joint distribution of

_{kl}**x**and is gaussian. Thus, equation A.5 is approximated by and therefore Δ

*J*= 0 for all

_{ii}*i*. Hence, equation A.4 can be approximated by where δ

_{ij}is the Kronecker delta. For

*i*≠

*j*, we have Thus, Δ

*J*is given by Hence, we have Assuming that only

_{ij}*W*affects the correlation and that is independent of

_{ij}*J*and

_{kl}*h*, we obtain Therefore, we find and Combining the above forms of and , we find that the steepest gradient

_{k}*V*of the approximate mutual information is given by To test the approximation of the above gradient, we compared the difference between Δlog |

_{kl}*C*| = log |

*C*′| − log |

*C*| and ∑

_{ij}

*V*Δ

^{C}_{ij}*W*, where

_{ij}*C*and

*C*′ are the correlation matrices of

**x**in the systems with the connection matrices

*W*and

_{ij}*W*+ Δ

_{ij}*W*, respectively. Figure 11A shows that this approximation is quite good. The approximations Δlog |

_{ij}*D*| = log |

*D*′| − log |

*D*| ≈ ∑

_{ij}

*V*Δ

^{D}_{ij}*W*and also give good estimates (see Figures 11B and 11C). Although we set

_{ij}*p*

_{max}= 1 for the system depicted in Figure 11, ∑

_{ij}

*V*Δ

_{ij}*W*is a good index for the difference of the mutual information , in the case

_{ij}*p*

_{max}< 1.

## Appendix B: Method for Counting the Repeated Patterns and Sequences

In Figure 6, we presented the number of repeated patterns and sequences before and after learning. Defining repeated patterns as exact patterns that occur multiple times, we excluded incompletely matched patterns from the definition of repeated patterns. This is because we wanted to simplify the definition in order to make the result clear. Of course, we could define it such that patterns with small differences can be regarded as a single repeated pattern. For example, if two patterns with one mismatch or less, such as patterns a and b in Figure 12A, are regarded as the same pattern, patterns b and c would also be regarded as the same pattern. Patterns a and c, however, cannot be regarded as the same pattern, because the states of two of their neurons differ. Thus, in general, even if some patterns, a and b, are considered to be the same pattern and some patterns, b and c, are considered to be the same, patterns a and c may not be the same according to this definition. Thus, classifying two slightly different patterns into one repeated pattern makes the definition of the repeated patterns less meaningful.

We defined a repeated sequence as an exact series of patterns that occurs more than once in a block. A repeated sequence is thus composed of repeated patterns. Moreover, a repeated sequence is composed of shorter repeated sequences. For example, each repeated sequence of length 4 contains three repeated sequences of length 2 (see Figure 12B). In general, a repeated sequence of length *l*_{1} contains *l*_{1} − *l*_{2} + 1 repeated sequences of length *l*_{2} < *l*_{1}. At first glance, it might seem that this way of counting repeated sequences overestimates the number of occurrences of repeated sequences and should be replaced by some more sophisticated method, such as a definition that does not count the short sequences contained in a longer repeated sequence as a repeated sequence. Such a method of counting, however, underestimates the number of repeated sequences. If a sequence B of length 2 occurs three times, twice in a repeated sequence D of length 4 (B2 in D1 and B3 in D2 of Figure 12C) and once outside longer sequences (B1 in Figure 12C), this modified way of counting fails to count the sequence B1 as an occurrence of the repeated sequence of length 2 even though this sequence is indeed repeated. To avoid this kind of failure, we counted sequences as repeated even when they were contained in longer repeated sequences. Thus, each of the sequences A, B, C, and D occurs twice in Figure 12B, and the sequences B and D occur three times and twice, respectively, in Figure 12C.

## Acknowledgments

This work was supported by grants-in-aid from the Ministry of Education, Science, Sports, and Culture of Japan: Grant numbers 16200025, 17022020, 17650100, 18019019, 18047014, and 18300079.