Abstract
Free-running recurrent neural networks (RNNs), especially probabilistic models, generate an ongoing information flux that can be quantified with the mutual information between subsequent system states . Although previous studies have shown that depends on the statistics of the network’s connection weights, it is unclear how to maximize systematically and how to quantify the flux in large systems where computing the mutual information becomes intractable. Here, we address these questions using Boltzmann machines as model systems. We find that in networks with moderately strong connections, the mutual information is approximately a monotonic transformation of the root-mean-square averaged Pearson correlations between neuron pairs, a quantity that can be efficiently computed even in large systems. Furthermore, evolutionary maximization of reveals a general design principle for the weight matrices enabling the systematic construction of systems with a high spontaneous information flux. Finally, we simultaneously maximize information flux and the mean period length of cyclic attractors in the state-space of these dynamical networks. Our results are potentially useful for the construction of RNNs that serve as short-time memories or pattern generators.
1 Introduction
Artificial neural networks form the central part in many current machine learning methods, and in particular, deep learning systems (LeCun et al., 2015) have found numerous industrial and scientific applications over the past decades (Alzubaidi et al., 2021). The neural networks in machine learning systems are typically structured as stacks of neural layers, and the information is usually passing unidirectionally from the input to the output layer.
By contrast, recurrent neural networks (RNNs) have feedback loops among their neuronal connections so that information can continuously circulate within the system (Maheswaranathan et al., 2019). RNNs are therefore autonomous dynamical systems in which the neurons show ongoing dynamical activity even without external input; moreover, they can be considered as universal approximators (Schäfer & Zimmermann, 2006). These and other intriguing properties have stimulated a recent boost in the research field of artificial RNNs, producing both new developments and interesting unsolved problems: Due to their recurrent connectivity, RNNs are ideally suited to process time series data (Jaeger, 2001), and to store sequential input over time (Büsing et al., 2010; Dambre et al., 2012; Gonon & Ortega, 2021; Schuecker et al., 2018; Wallace et al., 2013). For instance, it has been shown that RNNs learn robust representations by dynamically balancing compression and expansion (Farrell et al., 2022). In particular, a dynamical regime called the edge of chaos at the transition from periodic to chaotic behavior (Kadmon & Sompolinsky, 2015), has been extensively studied and demonstrated to be important for computation (Bertschinger & Natschläger, 2004; Boedecker et al., 2012; Kaneko & Suzuki, 1994; Langton, 1990; Legenstein & Maass, 2007; Natschläger et al., 2005; Schrauwen et al., 2009; Solé & Miramontes, 1995; Toyoizumi & Abbott, 2011; Wang et al., 2011), and short-term memory (Haruna & Nakajima, 2019; Ichikawa & Kaneko, 2021). Furthermore, numerous studies address the issue of how to control the dynamics of RNNs (Haviv et al., 2019; Jaeger, 2014; Rajan et al., 2010), in particular with external or internal noise (Bönsel et al., 2021; Ikemoto et al., 2018; Krauss, Prebeck et al., 2019; Metzner & Krauss, 2022; Molgedey et al., 1992). Finally, RNNs have been proposed to be a versatile tool in neuroscience research (Barak, 2017). In particular, very sparse RNNs, as they occur in the human brain (Song et al., 2005), have some remarkable properties (Folli et al., 2018; Gerum et al., 2020; Narang et al., 2017), such as superior information storage capacities (Brunel, 2016).
In previous studies, we systematically analyzed the structural and dynamical properties of very small RNNs, that is, three-neuron motifs (Krauss, Zankl et al., 2019), as well as large RNNs (Krauss, Schuster et al., 2019). Furthermore, we investigated resonance phenomena in RNNs. For instance, we discovered recurrence resonance (Krauss, Prebeck et al., 2019), where a suitable amount of added external noise maximizes information flux in the network. In addition, we investigated coherent oscillations (Bönsel et al., 2021) and import resonance (Metzner & Krauss, 2021, 2022), where noise maximizes information uptake in RNNs.
Here, we focus on the Boltzmann machine as a simple model of probabilistic RNNs with spiking neurons, in the sense that each neuron is either off or on in any given time step . The momentary global state of such a network, assuming neurons, can then be described by a vector , where each component is a binary number.
In order to quantify the ongoing information flux in such a system, an important quantity is the mutual information (MI) between subsequent global system states, here denoted by . It can take the minimal value of zero if the neurons are mutually uncoupled and produce statistically independent and temporally uncorrelated random walks with . By contrast, the MI takes on the maximal possible value of N bits, where N is the number of binary neurons, if the next global system state can be perfectly predicted from the present state (deterministic behavior) and if, moreover, all possible states occur equally often (entropy of states ). An example of the latter extreme case would be a fully deterministic network that periodically counts through all possible binary system states in a fixed order—in other words, a -cycle.
In an actual RNN with neurons, the MI will have some intermediate value between 0 and bits, and much can be learned by studying how this state-to-state memory depends on various system parameters. For example, it has been investigated (Metzner & Krauss, 2022) in a simple deterministic RNN model how the information flux depends on the statistical properties of the weight matrix elements , which describe the coupling strength from the output of neuron to the input of neuron . It was found that in the two-dimensional parameter space spanned by the density of nonzero connections and the balance between excitatory and inhibitory connections, RNNs reproducibly show distinct dynamical phases, such as periodic, chaotic, and fix-point attractors. All these dynamical phases are linked to characteristic changes of the information flux and thus can be detected using the MI between subsequent states.
In order to compute this MI numerically, one first has to estimate the joint probability distribution for a pair of subsequent states. Unfortunately, the number of possible state pairs in an N-neuron system is , and this exponential growth of state space prevents the computation of the MI for systems much larger than . One goal of our study presented here is therefore to test if the full MI can be approximately replaced by numerically more efficient measures of state-to-state memory, in particular measures based on pair-wise neuron-to-neuron correlations. For the future investigation of RNN phase diagrams, we are not primarily interested in the numerical value of the measure, but mainly in whether it rises or falls as a function of certain system control parameters. For this reason, we test in the following to what extent the alternative measures are monotonic transformations of the full MI.
Furthermore, we use evolutionary optimization to find out which kinds of RNN weight matrices lead to a maximum state-to-state memory, that is, a maximum spontaneous flux of information. Related to this question, we also test if those fundamentally probabilistic networks can actually produce quasi-deterministic n-cycles with a period length that is comparable to the total number of states .
2 Methods
2.1 Symmetrized Boltzmann Machine (SBM)
In the following, we consider a Boltzmann machine (BM) with probabilistic logistic neurons that are fully connected to each other. In order to symmetrize the system, we set all biases to zero and convert the binary neuron states to zero-mean states before each network update.
2.2 Mutual Information and Pearson Correlations
This formula is applied in two different ways to compute the MI between successive states of an SBM. One way is to directly compute the MI between the vectorial global states, so that and . The resulting quantity is denoted as , and in an SBM with neurons, it can range between 0 and bits.
In the other application of equation 2.5, we focus on a particular pair of neurons and first compute the MI between the successive binary output values of these two neurons, so that and . The resulting quantity is denoted as , and it can range between 0 and 1 bit.
The resulting pairwise correlation coefficients are again aggregated to a single number using the RMS average as defined in formula 2.6.
Summing up, our three quantitative measures of state-to-state memory are the full mutual information , the average pair-wise mutual information , and the average pair-wise Pearson correlations .
2.3 Numerical Calculation of the MI
In a numerical calculation of the full mutual information , first a sufficiently long time series of SBM states is generated. From this simulated time series, the joint probability is estimated by counting how often the different combinations of subsequent states occur. Next, from the joint probability, the marginal probabilities and are computed. Finally, the MI can be calculated using formula 2.5.
Unfortunately, even in systems with a moderate number of neurons, the time series of SBM states needs to be extremely long to obtain sufficient statistical accuracy. Too short state histories result in large errors of the resulting mutual information.
It turns out that some MI-optimized networks considered in this article spend an extremely long time within specific periodic state sequences and only occasionally jump to another periodic cycle. Even though the system visits all states equally often in the long run, this ergodic behavior shows up only after a huge number of time steps. It is not practical to compute the MI for those systems numerically.
For this reason, we have applied a semianalytical method for most of our simulations. This method computes the joint probability matrix directly from the weight matrix, without any simulation of state sequences, so that the accuracy problem is resolved. However, as is increased, the joint probability matrix eventually becomes too large to be hold in memory.
2.4 Semianalytical Calculation of the MI
In a semianalytical calculation of the full mutual information , we first compute the conditional state transition probabilities . This can be done analytically, because for each given initial state , formula 2.3 gives the on-probabilities of each neuron in the next time step. Using these on-probabilities, it is straightforward to compute the probability of each possible successive state .
The conditional state transition probabilities define the transition matrix of a discrete Markov process. It can therefore be used to numerically compute the stationary probabilities of the global system states, which can also be written as a probability vector . For this purpose, we start with a uniform distribution and then iteratively multiply this probability vector with the Markov transition matrix , until the change of becomes negligible, that is, until .
Once we have the stationary state probabilities , we can compute the joint probability of successive states as . After this, the MI is computed using formula 2.5.
2.5 Weight Matrix with Limited Connection Strength
In order to generate an weight matrix where the modulus of the individual connection strength is restricted to the range , we first draw the matrix elements independently from a uniform distribution in the range . Then each of the matrix elements is flipped in sign with a probability of .
2.6 Comparing the Signum-of-Change (SOC)
2.7 Evolutionary Optimization of Weight Matrices
In order to maximize some objective function that characterizes the fitness of a weight matrix, we start with a matrix in which all elements are zero (The neurons of the corresponding SBM will then produce independent and nonpersistent random walks so that the MI between successive states is zero.) The fitness of this starting matrix is computed:
The algorithm is now initialized with and .
We then generate a mutation of the present weight matrix by adding independent random numbers to the matrix elements. In our case, we draw these random fluctuations from a normal distribution with zero mean and a standard deviation of 0.1. The fitness of the mutant is computed.
If , we set and . Otherwise the last matrix is retained. The algorithm then loops back to step 1.
We iterate the evolutionary loop until the fitness no longer increases significantly.
2.8 Visualization with Multidimensional Scaling (MDS)
A frequently used method to generate low-dimensional embeddings of high-dimensional data is t-distributed stochastic neighbor embedding (t-SNE; Van der Maaten & Hinton, 2008). However, in t-SNE, the resulting low-dimensional projections can be highly dependent on the detailed parameter settings (Wattenberg et al., 2016), sensitive to noise, and may not preserve, but rather often scramble, the global structure in data (Moon et al., 2019; Vallejos, 2019). In contrast to that, multidimensional-scaling (MDS; Cox & Cox, 2008; Kruskal, 1964, 1978; Torgerson, 1952) is an efficient embedding technique to visualize high-dimensional point clouds by projecting them onto a two-dimensional plane. Furthermore, MDS has the decisive advantage that it is essentially parameter free and all mutual distances of the points are preserved, thereby conserving both the global and local structures of the underlying data.
When interpreting patterns as points in high-dimensional space and dissimilarities between patterns as distances between corresponding points, MDS is an elegant method to visualize high-dimensional data. By color-coding each projected data point of a data set according to its label, the representation of the data can be visualized as a set of point clusters. For instance, MDS has already been applied to visualize, for instance, word class distributions of different linguistic corpora (Schilling, Tomasello et al., 2021), hidden layer representations (embeddings) of artificial neural networks (Krauss et al., 2021; Schilling, Maier et al., 2021), structure and dynamics of recurrent neural networks (Krauss, Prebeck et al., 2019; Krauss, Schuster et al., 2019; Krauss, Zankl et al., 2019), or brain activity patterns assessed during, for example, pure tone or speech perception (Krauss, Metzner et al., 2018; Schilling, Tomasello et al., 2021), or even during sleep (Krauss, Schilling et al., 2018; Metzner et al., 2023; Traxdorf et al., 2019). In all these cases, the apparent compactness and mutual overlap of the point clusters permit a qualitative assessment of how well the different classes separate.
In this article, we use MDS to visualize sets of high-dimensional weight matrices. In particular, we apply the metric MDS (from the Python package sklearn.manifold), based on Euclidean distances, using the standard settings (n_init=4, max_iter=300, eps=1e-3).
2.9 Finding the Cyclic Attractors of an SBM
Given a weight matrix , we first calculate analytically the conditional state transition probabilities . We then find for each possible initial state the subsequent state with the maximal transition probability. We thus obtain a map that yields the most probable successor for each given state.
This map now describes a finite, deterministic, discrete dynamical system. The dynamics of such a system can be described by a “state flux graph,” which has nodes corresponding to the possible global states and links between those nodes that indicate the deterministic state-to-state transitions. Nodes can have self-links (corresponding to 1-cycles fixed points), but each node can have only one outgoing link in a deterministic system. All states (nodes) are either part of a periodic n-cycle, or they are transient states that lead into an n-cycle. Of course, the maximum possible period length is .
In order to find the n-cycles from the successor map, we start with the first system state and follow its path through the state flux graph, until we arrive at a state that was already in that path (indicating that a complete cycle was run through). We then cut out from this path only the states that are part of the cycle, discarding possible transient states. The cycle is stored in a list.
The same procedure is repeated for all initial system states, and all the resulting cycles are stored in the list. We finally remove from the list all extra copies of cycles that have been stored multiple times. This leads to the complete set of all n-cycles, which can then be further evaluated. In particular, we compute the individual period lengths for each cycle in the set, and from this, the mean cycle length MCL of the weight matrix .
We note that, strictly speaking, the cycle lengths and their mean value MCL are well defined only in a deterministic system where each state has only one definite successor. However, in a quasi-deterministic system (with sufficiently large matrix elements), each state has one strongly preferred successor. As a result, the number of time steps the system spends in each of its n-cycles is finite, but it can be much longer than the cycle’s period length , corresponding to several “round trips.” In this sense, cycle lengths are meaningful also in quasi-deterministic systems.
3 Results
3.1 State-to-State Memory in a Single Neuron System
By contrast, correlation coefficients are essentially averages of the product of the momentary variable values, and therefore they change when the values and are replaced by injective functions of themselves (such as and ). It is also well known that correlation coefficients can capture only linear relations.
Nevertheless, there are certain special cases where the MI and suitably modified correlation coefficients show a quite similar behavior. As a simple example, we consider an SBM that consists only of a single neuron with self-connection strength (see Figure 1b, sketch on top).
For , the weighted sum of the neuron’s input is zero and the on-probability is 1/2. The neuron therefore produces a sequence of statistically independent binary states with balanced probabilities . This stochastic Markov process corresponds to an unbiased and uncorrelated random walk.
For , the on- and off-probabilities are still balanced, but now subsequent neuron states are statistically dependent. In particular, a positive self-connection generates a persistent random walk in which the same binary states tend to follow each other (such as 00011100111000011). Analogously, a negative self-connection generates an antipersistent random walk in which the system tends to flip between the two binary states (such as 10101001011010101). This stochastic Markov process corresponds to an unbiased but correlated random walk.
Summing up, the one-neuron system produces a random walk of binary (0,1) neuron states that changes from temporally antipersistent to persistent behavior as is tuned from negative to positive values (see the times series at the top of Figure 1).
As both antipersistence and persistence are predictable behaviors, the MI between subsequent states, denoted by approaches the maximum possible value of 1 bit in the limit of both strongly negative and strongly positive connection weights (see Figure 1, orange curve). It reaches the minimum possible value of 0 bit exactly for .
We next compute, again as a function of the control parameter , the Pearson correlation coefficient between subsequent states, denoted by , and defined in section 2. For strongly negative , it approaches its lowest possible value of 1, passes through zero for , and finally approaches its highest possible value of 1 for strongly positive (see Figure 1, blue curve).
It is possible to make the two functions more similar by taking the modulus of the Pearson correlation coefficient, , which in this case is equivalent to the root-mean-square average . This RMS-averaged correlation indeed shares with the MI the minimum and the two asymptotic maxima (see Figure 1, black dashed curve).
Based on this simple example and several past observations (Krauss et al., 2017; Krauss, Schuster et al., 2019; Metzner & Krauss, 2021, 2022), there is legitimate hope that the computationally expensive mutual information can be replaced by the numerically efficient , at least in a certain subclass of systems. It is clear that when plotted as a function of some control parameter, the two functions will not have exactly the same shape, but they might at least be monotonic transformations of each other, so that local minima and maxima will appear at the same position on the parameter axis. If such a monotonic relation exists, could be used, in particular, as an -equivalent objective function for the optimization of state-to-state memory in RNNs.
3.2 Comparing Global MI with Pairwise Measures
The best we can expect is that these measures are monotonic transformations of each other and thus share the locations of local maxima and minima as functions of some control parameter. We have used the matrix element as such a continuous parameter. In the case of -neuron networks, all connection strengths can be simultaneously varied in a gradual way (see Figure 1 of the supplemental material for an example).
Note that the subsequent weight matrices used for the comparison of the three measures need not be correlated with each other. Indeed, using a time series of statistically independent matrices is even beneficial, as a larger part of matrix space can be sampled by this way. In this case, the subsequent values in each of the resulting time series of , and are also statistically independent, but we can still test if the different measures raise and fall synchronously.
For this purpose we count, for a given time series of matrices, how often the signum of change (SOC, introduced in section 2) agrees among the three measures. The fraction of matching SOC values is a quantitative measure for how well the three quantities are monotonically related. For two unrelated measures, one would expect . Values indicate that the two measures are monotonically related to each other (with corresponding to one being a perfect monotonic transform of the other). Values indicate that one measure tends to have minima where the other has maxima.
Since is a statistical quantity, it fluctuates with each considered time series of weight matrices. In the following numerical experiments, we therefore generate different time series and then estimate the probability density distribution of the resulting 100 values of , using gaussian kernel density approximation. Each time series consists of random weight matrices with uniformly distributed elements . For each weight matrix, we simulate the dynamics of the corresponding Boltzmann machine (SBM) for time steps , and then compute the three measures of state-to-state memory based on these subsequent system states .
We first consider weight matrices where the magnitudes of the matrix elements are small (). Since the neurons in such systems are nearly uncoupled, they produce random walks of binary states that are almost statistically independent and temporally uncorrelated (a random regime). As a result, the neuron’s global distribution of on-probabilities , in the following also called fire probabilities, is sharply peaked around 0.5 (see Figure 2a, black dashed curve). The mutual information and pairwise correlation between subsequent states is very small here (and it therefore requires many time steps to estimate those measures precisely), but we find that the pairwise measures and have a clear monotonic relation with the full mutual information (orange and blue solid curves), with (orange) working slightly better.
Next we consider the moderate coupling regime by choosing . The fire probabilities in this regime are distributed approximately uniformly between zero and one, and the system is still behaving in a rather stochastic way, yet with clear correlations between subsequent states. (Note that we have already analyzed this moderate coupling regime in some of our former papers on RNN dynamics Krauss, Schuster et al., 2019 and Metzner & Krauss, 2022). Here we find, again, that the pairwise measures have a clear monotonic relation with the full mutual information (see the blue and orange curves in Figure 2b).
In the case of large matrix elements, with , we are in the deterministic regime: the neurons are driven far into the saturation of their logistic activation function and, correspondingly, the distribution of the fire probabilities now peaks at zero and one. Here we find that the pairwise measures cannot be used to approximate the full mutual information, not even in the sense of monotonic transformation (see Figure 2c). This result was to be expected, as in a deterministic system (such as in a network that is counting up the binary numbers from 0 to ), the correlations between subsequent states are of higher (nonlinear) order and can only be captured by the full mutual information.
Finally, we consider a larger SBM with neurons and weight matrix elements uniformly distributed with . It is, of course, completely out of the question to compute the full mutual information for a system with global states. However, in order to find out how the system’s state-to-state memory is changing as a function of some control parameter, we can consider small subgroups of neurons and compute the three measures within these neuron subgroups, assuming that their changes represent the global trend (we shall call this method “subgroup sampling”). The method is tested using different, random, predefined subgroups of tractable size , applied to a single time series of large weight matrices with the above stated properties. As before, time steps are simulated for each weight matrix. The flat distribution of fire probabilities indicates that the systems are in the moderate coupling regime (black dashed curve in Figure 2d). We find that in this subgroup sampling scenario, the pairwise measures, in particular (orange), have a clear monotonic relation to the full mutual information.
3.3 Preliminary Considerations on MI Optimization
Next, we turn to the optimization of the mutual information in RNNs. We start with some fundamental principles based on the well-known diagram in Figure 1a.
In the present context, corresponds to an initial RNN state at time and to the successor state at time (note that due to the recurrence, the output is fed back as input in each update step). The quantity is the mutual information between successive states. The state entropies and are limited by the size of the RNN and can maximally reach the upper limit of in an N-neuron binary network.
The conditional entropy describes the lossy compression that takes place whenever the same successor state can be reached from different initial states . Such a convergence of global system states happens necessarily in standard RNNs, where each neuron is receiving inputs from several supplier neurons . Since the output state of neuron depends only on the weighted sum of the inputs, several input combinations can cause the same output.
Vice versa, describes the random expansion that occurs when a given initial state can lead to different possible successor states . Obviously, such a divergence of global system states can happen only in probabilistic systems.
According to the Figure 1a, the mutual state-to-state information is maximized under the following conditions:
The state entropies and should approach the upper limit, so that ideally, all possible system states are visited with equal probability.
The lossy compression should be minimized, so that ideally, each successor state is reachable by only a single initial state .
The random expansion should be minimized, so that ideally, each initial state leads deterministically to a single successor state .
All the above conditions are perfectly fulfilled, and thus the mutual information reaches its upper limit, , in an N-neuron system that is running deterministically through a cycle. In the special case when the sequence of these global system states corresponds to the numerical order of the state’s binary representations, we could call such a system a “cyclic counting network.”
However, a major goal of this article is to discover design principles for probabilistic Boltzmann machines in which the mutual information is suboptimal but close to the upper limit. The above preliminary considerations suggest that cyclic attractors are desirable dynamical features for this purpose, as they avoid both convergence and compression (condition 2) and divergence and expansion (condition 3) in state space. Moreover, all states within a cyclic attractor are visited with equal probability (condition 1), thus helping to maximize the state entropy.
3.4 Evolutionary Optimization of State-to-State Memory
We now turn to our second major research problem of systematically maximizing the spontaneous information flux in free-running SBMs, or, in other words, maximizing state-to-state mutual information . Obviously, a system optimized for this goal must simultaneously have a rich variety of system states (large state entropy ) and be highly predictable from one state to the next (quasi-deterministic behavior). It is, however, unclear how the connection strengths of the network must be chosen to achieve this goal.
We therefore perform an evolutionary optimization of the weight matrix in order to maximize , subsequently collect some of the emerging solutions, and finally try to reverse-engineer the resulting weight matrices to extract general design principles. We generate the evolutionary variants simply by adding small random numbers to each of the weight matrix elements (see section 2 for details), but we restrict the absolute values of the entries to below 5 in order to avoid extremely deterministic behavior.
We start with a weight matrix in which all elements are zero. This corresponds to a set of independent neurons without input and results in uncorrelated random walks (see Figure 3a, state-versus-time plot annotated with ). The objective function , consequently, is zero at evolution time step (blue curve).
As the evolutionary optimization proceeds, first, several nonzero matrix elements emerge (inset of Figure 3a), but eventually only elements survive, all with large absolute values close to the allowed limit but with different signs ( in red, in blue). During this development, the objective function is monotonically rising to a final value of about 4.68 (blue curve), which is close to the theoretical maximum of 5. At the final evolution time step, , the five neurons show a complex and temporally heterogeneous behavior (state-versus-time plot).
Computing the state transition matrix from the final evolved weight matrix (see section 2 for how this is done analytically) reveals that each of the system states has only one clearly dominating successor state, thus enabling quasi-deterministic behavior (see Figure 3c, left panel). However, a plot focusing on the small weights shows that each state also has a few alternative, low-probability successor states (see Figure 3c, right panel). The latter property makes sure that any overly deterministic behavior, such as being trapped in an n-cycle, is eventually broken. Note also that the dominant entries of the state transition matrix are arranged in a very regular way, which is surprising considering the random optimization process by which the system has been created. Finally, a closer inspection of the dominant successor states reveals that the dynamics of this specific SBN at least contains a few 2-cycles (entries marked by colors) as quasi-stable attractors.
We next compute for our evolved SBN the stationary probability of all 32 system states and find that they all occur about equally often (see Figure 3c). Besides the quasi-deterministic behavior, this large variety of states (entropy) represents the second expected property of a system with large state-to-state memory.
We finally repeat the evolutionary optimization with different seeds of the random number generator and thus obtain four additional solutions (see Figure 3d). Remarkably, they are all constructed according to the same design pattern. There are only large elements in the weight matrix (one for each neuron), which can be of either sign, but which never share a common row or column. Due to the latter property, we call this general design pattern, in analogy to certain chess problems, the “N-rooks principle.”
Although all N-rooks networks follow the same simple rule regarding their weight matrix, the resulting connectivity structure of the neural units can be very different. This becomes apparent when the networks are represented in the form of “wiring diagrams:” For example, the network in Figure 3f corresponds to isolated, mutually independent neurons with self-connections, whereas the network in Figure 3g has all its neurons arranged in a loop. There are also more complex cases, like the network in Figure 3b, which consists of one isolated neuron with self-connection and two neuron pairs. As a general rule, each neuron of an N-rooks network is part of exactly one closed linear chain of neurons, or -neuron-loop, where can range between 1 and .
3.5 N-Rooks Matrices Maximize State-to-State Memory
In this section, we establish that N-rooks networks are local optima of state-to-state memory in the -dimensional space of weight matrices.
At the end of the evolutionary optimization, the weight matrix contains large entries, whereas the remaining elements are very small but not zero. In order to test the relevance of these small background matrix elements, we start with artificially constructed, perfect N-rooks matrices where all background matrix elements are zero (the ones shown in Figures 3f and 3g), and then gradually fill the background elements with normally distributed random numbers of increasing standard deviation. We find that , on average, is decreasing by this perturbation (see Figures 4a and 4b).
Next we start from perfect N-rooks matrices and generate, around each of them, random imperfect matrices at a fixed Euclidean distance within 25-dimensional weight matrix space. We find that the state-to-state memory of these imperfect matrices is fluctuating strongly (in particular, at larger distances from the perfect center), but on average the mutual information is monotonically decreasing with the distance (see Figure 4c).
We now again consider perfect N-rooks networks, but systematically increase the magnitude of the nonzero matrix elements. We find that the state-to-state memory is increasing with monotonically from zero to 5 bits, the theoretical maximum in a five-neuron system (see Figure 4d). Note that the sigmoidal shape of this curve can be derived analytically as well (see the supplemental material, section 3).
To demonstrate how strongly the properties of N-rooks networks differ from nonoptimized networks, we compute the probability density distribution of the mutual information (see Figure 4e) and of the state entropy (see Figure 4f) for a large ensemble of N-rooks matrices (orange curves), as well as for random matrices with uniformly distributed elements (blue curves). Remarkably, all N-rooks networks have the same large values for to a very high precision (the visible width of the orange bars corresponds to the bin width of the histogram). Most of the random matrices also have considerable mutual information, but nevertheless a large gap is separating them from the performance of the N-rooks networks. The entropy of the N-rooks networks is close to the theoretical maximum.
To further demonstrate that n-rook matrices are local maxima of state-to-state memory, we perform a direct visualization of those matrices as points in -dimensional weight matrix space (W-space), using the method of multidimensional scaling (MDS; see section 2 for details) and color-coding of their -values. We first generate 10 perfect N-rooks matrices as initial points (the red dots in Figure 4g) and then iteratively add to their matrix elements independent, normally distributed random numbers with a fixed, small standard deviation. This corresponds to a random walk in W-space, and it leads with high probability away from the initial points. As expected for local maxima, we observe that tends to decrease during this diffusion process (compare the color bar). At the same time, the matrices tend to move away from each other in this two-dimensional, distance-preserving MDS visualization. This is due to the fact that the initial perfect N-rooks matrices are all part of an -dimensional submanifold of W-space. They are relatively close together because their nondiagonal matrix elements are zero and thus do not contribute to their Euclidean distance. As the matrices are diffusing out of the submanifold, their nondiagonal elements are increasing, so their mutual Euclidean distance is increasing as well.
We next produce a larger test set of weight matrices with high mutual information by generating a cluster of random variants around each perfect central N-rook matrix with a small distance to the cluster center (see the dark red spots in Figure 4h). If the positions of the large elements in the 10 N-rook matrices are now randomly shuffled and variants are again generated from the resulting ‘non-N-rooks’ matrices, we obtain a control set of matrices with smaller mutual information (weak red and weak blue spots), but with the same statistical distribution of weights.
Taken together, our numerical results strongly indicate that N-rooks networks are indeed local maxima of state-to-state memory in W-space. We will further corroborate this statement by theoretical analysis of the N-rooks mechanism.
3.6 Maximizing the Cycle Length of Periodic Attractors
It is important to remember that the N-rooks matrices are optima of state-to-state memory under the constraint of a limited magnitude of connection weights. (In our case, the limit was set to .) Due to this constraint, the value of achieved in five-neuron networks is large but still below the theoretical maximum of .
As we pointed out in section 1, this theoretical maximum could be realized in an -neuron system only if it would run deterministically through a -cycle. It is, however, not clear if this is actually possible with an SBM. We therefore now turn to the problem of maximizing the cycle length evolutionarily.
In order to quantify the cycle length for a given weight matrix, we first compute analytically the state transition matrix. By considering only the most probable successor for each of the states, we obtain a perfectly deterministic flux of states, which corresponds to a directed graph with only one out-link per node (see Figure 5d for some examples). It is straightforward to find all n-cycles in this graph, finally yielding the mean cycle length MCL of the given weight matrix (see section 2 for details).
When the mean cycle length is maximized evolutionarily, we obtain top values of up to MCL 18 in a five-neuron system (corresponding to a single 18-cycle with many transient states leading into it). However, the resulting weight matrices are densely populated with relatively small matrix elements (data not shown), and thus each state has multiple successors with comparable transition probabilities. In other words, the optimization has only produced a quasi-stable 18-cycle with a rather short lifetime. It is therefore necessary to not only maximize the mean cycle length (MCL) but to simultaneously make sure the system behaves deterministically to a large degree.
We therefore choose as a new objective function the product of the state-to-state memory and the MCL. Here we find that, at least over sufficiently many time steps, it is indeed possible to maximize both quantities in parallel (see Figure 5a). For example, one evolutionary run produces a matrix with and MCL 24 (left figure column), and another run yields and MCL 20 (right figure column). Due to the now incomplete optimization of , the resulting weight matrices deviate from the N-rooks principle (see the insets of Figure 5a and their network representations, Figure 5b). Correspondingly, the state transition matrices have more than one nonzero entry in each row; however, each state now has one clearly dominating successor, leading to a relatively long lifetime of the periodic attractor.
3.7 The Mechanism of N-Rooks Systems
In a perfect -rooks weight matrix, there are only nonzero matrix elements, either positive or negative, but all with the same large magnitude, , and arranged so that they have no row or column in common. We now theoretically analyze the implications of this design principle:
Each row of the weight matrix , whose elements describe the input strengths of neuron , has only one nonzero entry. In other words, each neuron receives input from only one supplier neuron, and so these networks obey a single supplier rule.
This single supplier rule prevents the possibility that different combinations of supplier states can lead (via weighted summation) to the same activation of neuron r. The rule eventually prevents the convergence of multiple global system states into one and thus fulfills the second fundamental condition for a large MI.
The fact that is large means that the neurons behave almost deterministically: Each neuron either transmits the state of its single supplier (in the case of excitatory coupling) or inverts it (in the case of inhibitory coupling), with a probability close to one. The resulting deterministic system dynamics prevents the random expansion of states and thus fulfills the third fundamental condition for a large MI.
Since the nonzero matrix elements in are all positioned in different columns, no two neurons receive input from the same supplier neuron. And because there are neurons in total, it follows that each of them is feeding its output to only one unique consumer neuron, corresponding to a single consumer rule.
Together, the single-supplier and single-consumer rules imply that the system is structured as a set of linear strings of neurons. Since each neuron definitely has a consumer (that is, within the weight matrix it appears in the input row of some other neuron), these strings do not end, and so they must be closed loops. Consequently, the N-rooks weight matrix corresponds to a set of closed neuron loops.
In a closed loop of neurons with only excitatory large connections in between, an initial bit pattern would simply circulate around, creating a trivial periodic sequence of states within the subspace of this -neuron group. But also with mixed excitatory and inhibitory connections, closed neuron loops produce periodic state sequences within their subspace, and the period length can vary between and .
On the level of the global state flux, the periodic behavior of each local -neuron loop creates a set of cyclic attractors of different cycle lengths. Even though all N-rooks matrices look very similar, the combination of cycle lengths can vary greatly.
Because the single-supplier rule prevents convergence of global states, N-rooks systems do not have any transient states. Instead, each of the possible global system states belongs to one of the cyclic attractors. As transient states would be visited less often than states within cycles, this property helps to make state probabilities more uniform and thus increases the state entropy.
An N-rooks system will spend a very long time in one of its cyclic attractors, as the neurons work almost deterministically. However, eventually an “error” happens in one of the neurons, one bit of the state vector updates the “wrong” way, and consequently the system jumps to a global state that is out of the periodic order, belonging either to the same or another cyclic attractor. These infrequent random jumps are essential to maximize the state entropy.
Each state within an -cycle is visited only th of the time. One might think that for this reason, states in large -cycles are visited less often, thus contradicting the desired uniform probability of each state (the first condition of achieving a large MI). However, it turns out that the occasional out-of-order jumps of the system end up proportionally more frequently in large -cycles. We have actually verified that all global states in an N-rooks system are visited equally often, leading to the maximum possible state entropy.
4 Summary
This work focused on the measurement and enhancement of the spontaneous information flux in RNNs. We used the symmetrized Boltzmann machine (SBM) as our probabilistic RNN model system and quantified the information flux by the mutual information (MI) between subsequent global system states, a quantity that we call state-to-state memory.
As we have shown in previous work, it is possible to enhance and optimize the state-to-state memory of an RNN by various methods, for example, by tuning the statistics of the weight matrix elements (Krauss, Schuster et al., 2019), or by adding an appropriate level of noise to the neuron inputs (Krauss, Prebeck et al., 2019). Such optimization methods obviously rely on the measurement of the MI, a task that unfortunately becomes intractable for large systems.
We have therefore considered numerically more efficient measures for the state-to-state memory, such as root-mean-square (RMS) averaged pair-wise correlations or the averaged pair-wise mutual information .
These alternative measures yield values that are numerically different from the full MI, but for optimization purposes, the only concern is that all quantities rise or fall synchronously as some system parameters are changed. In order to assess this degree of monotonous relatedness, we compute the fraction of matching signs of change, or the SOC fraction between the outputs of two given measures. This novel SOC method can be used to quantify the monotonous relatedness of two arbitrary time series, even in cases with nonstationary statistics, where linear correlation coefficients fail.
Using the SOC method, we have shown that the two pairwise measures are indeed monotonous transforms of the full MI in a wide range of practically relevant cases: As long as the magnitudes of the RNN’s weight matrix elements are not too large, the system is operating in a predominantly probabilistic regime, so that the fire probabilities of the neurons are peaked around 1/2. We find that in this regime, for optimization purposes, the full MI can be replaced by the pairwise measures, in particular by , which can be efficiently computed even in very large RNNs.
By contrast, too large matrix elements drive the neurons deeply into the nonlinear saturation regime of their activation functions, leading to a bimodal distribution of fire probabilities that has its peaks at zero and one. In this quasi-deterministic regime, subsequent RNN states are often related in complex ways that cannot be captured by the simpler measures.
While the theoretical maximum of state-to-state memory in an -neuron network would correspond to a system that cycles deterministically through a single periodic attractor that comprises all possible states, this extreme case cannot be realized with a probabilistic SBM. Instead, evolutionary maximization of , while enforcing a limited magnitude for the neural connections, led to weight matrices that are built according the N-rooks principle: there are only nonzero matrix elements of large magnitude , arranged so that they do not have any rows or columns in common.
We have demonstrated that networks built according to this principle have peculiar dynamical properties: all of their possible states are parts of periodic cycles of different period lengths . If the magnitude of the nonzero weights is large, the network is dwelling for very long times in one of these cyclic attractors, until it randomly jumps to a new state that may be part of another cycle. In this way, the system visits all states equally often, thus maximizing the state entropy. As it also behaves in a highly (although not perfectly) deterministic way, -rooks networks are maximizing the state-to-state memory, asymptotically approaching the theoretical limit of bits.
As N-rooks networks do not necessarily have very long cycles, we have also evolutionary optimized the product of state-to-state memory and the MCL. This led to weight matrices that deviate from the strict N-rooks principle but that combine large cycle lengths of up to 24 (in a five-neuron system with only 32 possible global system states in total) with a relatively high stability of this periodic attractor.
5 Discussion
5.1 Measuring and Controlling Information Flux
5.1.1 Spectral Properties of the Connectivity Matrix
The dynamical behavior of RNNs, in turn, is intricately linked to the eigenvalues of their connectivity matrix (Ganguli et al., 2008; Hennequin et al., 2014; Rivkind & Barak, 2017), a fundamental concept in the field of neural network analysis. These eigenvalues serve as critical indicators of the network’s stability, convergence properties, and overall functionality. By examining the spectral properties of the connectivity matrix, valuable insights are gained into how information flows and evolves within the network over time. Understanding this connection between RNN dynamics and eigenvalues is essential for optimizing network architectures, training procedures, and predicting their performance in various tasks, making it a pivotal aspect of both theoretical and practical research in the realm of recurrent neural networks.
5.1.2 Fast Control of Network Dynamics
However, there may be situations where an RNN has initially a proper weight matrix and spectral properties, but is nevertheless driven into a computationally unfavorable regime by unexpected input signals or, during learning, by gradual changes of its weights. It is then essential to quickly assess the dynamical state of the system in order to apply regulating control signals that bring it back into a healthy dynamical state.
5.1.3 Fast Measurement of State-to-State Memory
For this reason, we focus in this work on the measurement of state-to-state memory, a central indicator of a network’s dynamical regime. The full mutual information between subsequent states is a computational demanding measure of state-to-state memory, and it requires a long history of states to yield sufficient statistical accuracy. By contrast, the average pair-wise correlation is much simpler and requires much shorter state histories. It may even be possible to train a multilayer perceptron, as a regression task, to estimate the momentary state-to-state memory of an RNN, based on just a few subsequent network states. Such a perceptron could then be used as a fast probe for the dynamical state of the RNN, forming a part of an automatic control loop that keeps the RNN continuously within the optimal computational regime.
5.2 Optimizing Information Flux
N-rooks networks turned out to achieve the optimal state-to-state memory. The intriguing dynamical properties of these systems networks, with or without cycle-length optimization, may offer a new perspective on a several other neural systems, both artificial and biological.
5.2.1 Edge-of-Chaos Networks
Free-running N-rooks networks appear to share certain features with networks situated at the edge-of-chaos (EoC; Bertschinger & Natschläger, 2004; Langton, 1990; Legenstein & Maass, 2007). Both systems delicately balance between stability and unpredictability, a hallmark for optimal computation and adaptability. In N-rooks networks, while the deterministic cycles provide a semblance of order, the random transitions between states echo the spontaneous changes often seen at the EoC. This juxtaposition might offer new insights into the emergent properties of systems teetering on the brink of chaos and their potential for sophisticated information processing.
5.2.2 Amari-Hopfield Networks
When comparing N-rooks networks with the deterministic dynamics of Amari-Hopfield networks (Amari, 1977; Amari & Maginu, 1988), intriguing contrasts and parallels emerge. Both are founded on principles of energy landscapes and attractor states, though their approaches to memory retrieval and storage diverge. While Amari-Hopfield networks are renowned for their associative memory capabilities and stable attractor states, the k-cycle sequences in N-rooks networks suggest a more transient yet extended memory retention mechanism. Exploring the interplay between these models could yield novel hybrid architectures that combine the robustness of associative memory with the dynamic flexibility of transient state sequences.
5.2.3 Central Pattern Generators
The cyclic behaviors inherent in N-rooks networks show intriguing parallels with the rhythmic activities orchestrated by central pattern generators (CPGs) in biological organisms (Grillner, 2006; Harris-Warrick, 2010; Marder & Calabrese, 1996). At their core, both systems autonomously generate rhythmic patterns. CPGs, specialized neural circuits, underpin many essential rhythmic behaviors in animals—most notably, orchestrating locomotion patterns in walking or swimming and managing the rhythmic contractions and expansions of breathing. Similarly, N-rooks networks, through their structured connections, consistently produce deterministic k-cycles, emphasizing a form of artificial rhythmicity. The pronounced persistence within these cycles in N-rooks networks bears resemblance to the sustained rhythmic outputs manifested by CPGs, such as the continuous stride pattern in walking or the consistent tempo of breathing. However, an added layer of complexity in the N-rooks networks is their sporadic jumps between cycles. This feature can be likened to the adaptive and transitional behaviors seen in biological systems, where CPGs might adjust or shift rhythms in response to external stimuli or changing conditions. While the origins and implementations of their rhythmic patterns may differ, an in-depth exploration into the shared principles, transitional behaviors, and disparities of N-rooks networks and CPGs could yield valuable insights. Such an understanding may pave the way for the development of artificial neural systems that not only replicate but also enhance rhythmic behaviors, combining the robustness of biological systems with the adaptability and versatility inherent in artificial networks.
5.2.4 Place and Grid Cells
The spatiotemporal dynamics of N-rooks networks may have parallels with the functioning of place and grid cells (Eichenbaum, 2015; Moser et al., 2008; Rowland et al., 2016), that are vital for spatial navigation and representation in mammals. Just as place cells fire in specific locations and grid cells display a hexagonal firing pattern across spaces, N-rooks networks, with their distinct state transitions, might encode specific locations or patterns in their state space. Drawing inspiration from these biological counterparts, N-rooks networks could potentially be harnessed for tasks related to spatial representation and navigation in artificial systems, merging principles from both domains for enhanced performance.
5.2.5 Practical Applications and Limitations of N-Rooks Networks
N-rooks networks, with their unique dynamical properties, open up many practical applications. Their inherent high mutual information between successive states positions them as candidates for advanced memory storage systems, potentially offering more robust retention and recall mechanisms than traditional architectures. However, the property that, given unbounded connectivity strength, a network spends a disproportionate amount of time in cyclic attractors and seldom switches between them poses some intriguing challenges and insights from both an information processing and memory storage perspective. On the one hand, prolonged dwelling in cyclic attractors can be seen as an effective way to preserve information. In biological systems, certain memories or behaviors might need to be stable and resistant to noise or external perturbations, and such a mechanism could be beneficial. On the other hand, a network that gets trapped in specific cyclic attractors for an extended period can be less responsive to new inputs. This rigidity can be problematic for real-time information processing where adaptability and rapid response to changing conditions are crucial. Moreover, the long dwelling time in cyclic attractors means that the network takes practically infinite time to explore its entire state space. This is inefficient for some applications, as the richness of a network’s behavior often lies in its capacity to explore and traverse a vast range of states. If a network gets trapped in specific cycles, it becomes less versatile and may fail to recognize or generate novel patterns or responses. It is therefore advantageous that the cycle dwelling time in N-rooks networks can be adjusted by the magnitude of the nonzero matrix elements.
5.2.6 Large N-Rooks Networks are Sparse
Very large systems (those with a large number of neurons ) constructed according to the N-rooks principle, would have a vanishingly small density of nonzero connections, given by . Interestingly, we found in one of our former studies on deterministic RNNs a highly unusual behavior at the very edge of the density-balance phase diagram, precisely in the regime that corresponds to these extremely sparsely connected networks (Metzner & Krauss, 2022). It might therefore be worthwhile to study this low-density regime more comprehensively in future work, considering that the brain also has a very low connection density (Miner & Triesch, 2016; Song et al., 2005; Sporns, 2011).
Acknowledgments
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) with grants KR 5148/2-1 (project number 436456810), KR 5148/3-1 (project number 510395418), and GRK 2839 (project number 468527017) to P.K., as well as grant SCHI 1482/3-1 (project number 451810794) to A.S.
Author Contributions
The study was conceived, designed, and supervised by C.M. and P.K. Numerical experiments were performed by C.M. and D.V. Results were discussed and interpreted by C.M., M.Y., A.S., and P.K. The article was written by C.M. and P.K. All authors critically read the manuscript before submission.
Additional Information
The authors declare no competing interests. Data and analysis programs will be made available on reasonable request.