## Abstract

We propose a new self-organizing algorithm for a feedforward network inspired to an electrostatic problem that turns out to have intimate relations with information maximization.

## 1. Introduction

In this letter, we present a new self-organizing algorithm for a layer of *h* continuous perceptrons derived from the electrostatic problem of free electrical charges in a conductor. The algorithm is general and maximizes information.

The idea is simple: we use a layer of continuous perceptrons to map the inputs to point-like electrical charges that we imagine free to move within a hypercube in multidimensional space, and we let them evolve or, better, relax under Coulomb repulsion until they set in the minimal energy configuration. For this reason, we named this algorithm neural relax (NR).

We show that this is sufficient to obtain binary and statistically independent data as a natural consequence of the algorithm itself; in addition, by fixing the dimensions of the hypercube, one can freely adjust the rate of dimensional reduction. From a theoretical point of view, we show that in the simple one-dimensional case, this algorithm provides the maximum-information solution to the problem, and thus the learning rules result equal to those obtained by Bell and Sejnowski (1995) from their independent component analysis (ICA), exhibiting a completely different interpretation of the ICA algorithm. In the general multidimensional case, we show that NR gives a pure Hebbian rule and is also well suited to inject some redundancy that subsequently can be used to perform error correction on the processed patterns.

## 2. A Layer of Perceptrons

*h*perceptrons with

*n*inputs and tanh() transfer function. Given an input , each perceptron gives the output and Figure 1 schematically illustrates the architecture of this network. We stretch the notation a bit, indicating the

*h*equations, equation 2.1, with the weight matrix

*W*,

This is a common, well-studied network that, among other things, can be used to approximate any continuous function since the transfer function, tanh(*x*), is bounded in (−1, 1), nonconstant, smooth, and monotone (Hornik, Stinchcombe, & White, 1989). We assume that the inputs follow a distribution and that there is no noise; usually we consider binary inputs . We focus on the case of binary outputs , that is, the limit of the continuous case, equation 2.2, when the argument is large.^{1}

*C*that can be conveyed by

*h*binary neurons is bounded by

*h*, where is the mutual information between the input , of distribution , and the output . The limitation comes essentially from the architecture since

*h*binary neurons can possibly implement only

*C*

_{h,n}⩽ 2

^{h}of the theoretically possible 2

^{h}output states, and they show that

*h*⩽

*n*, we see that the architecture does not impose any limitation

^{2}and, for these binary neurons without noise, the upper bound

*C*can be reached if, and only if the distribution of the outputs results fully factorized (Nadal & Parga, 1993):

With the help of this analysis we can set up a list of desirable characteristics for the function , equation 2.2, implemented by our layer of *h* perceptrons:

- •
The output patterns should be (essentially) binary, that is, 1 − |

*y*| < ε._{i} - •
The map should be injective and such that equation 2.3 holds.

- •
It should accomplish dimensionality reduction whenever possible, that is,

*h*≪*n*.It should be learnable, that is, it should be possible to find it by gradient descent along an appropriate function of the weights.

The most demanding goal is satisfying equation 2.3, but it is not easy to find an algorithm that does it directly. Several authors followed the equivalent path of maximizing the mutual information , for example, the ICA algorithm (Bell & Sejnowski, 1995; see also Pham, 2001). Our algorithm starts from a physical problem that leads naturally toward fulfilling these requests.

## 3. The Physical Problem

*m*, equal, point-like, electric charges

*Q*

_{ν}within a cube of conductor. This is a problem very similar to the Thomson (1904) problem, where the charges are in a sphere. The problem is remarkably difficult to solve, and exact solutions are known only for a few values of

*m*(see Schwartz, 2010). From now on, we always consider our cube centered at the origin and with a side of length 2. The physical space available to the charges is the three-dimensional cube defined by the extension to the

*h*-dimensional hypercube

*H*being obvious. In an ideal conductor, the

_{h}*m*charges are free to move, and their stable rest positions minimize the Coulomb potential (Jackson, 1999):

^{3}is a harmonic function (Axler, Bourdon, & Ramey, 2001) and thus does not have minima in an open, convex set like

*H*

_{3}. Thus, the rest positions of the charges are on the border, namely, on the surface of the cube. Moreover we conjecture that if the charges are equal and their number is

*m*⩽ 2

^{3}= 8, the only stable positions of the charges are on cube vertices, shown in Figure 2, which contains the minimum energy arrangements for two, three, four, and five charges.

^{4}

*H*, and we generalize our conjecture that charges have stable rest positions on the vertices of

_{h}*H*and, consequently, (almost) binary coordinates.

_{h}We take inspiration from this physical problem to propose a self-organizing algorithm for a layer of continuous perceptrons. We map our set of *m* inputs in to point-like charges in , and these charges are bound to remain in the *h*-dimensional hypercube. Subsequently we let this system evolve under Coulomb repulsion in , minimizing its energy until it reaches equilibrium. Provided that our conjecture is true and if *m* ⩽ 2^{h}, the charges at rest will occupy the vertices of *H _{h}* and thus have binary coordinates, which means that this approach allows us to get a binary representation of the input data as a natural consequence and without any further constraint. We will also show that this process maximizes information.

*m*inputs of distribution , by applying equation 2.2, we get

*m*outputs that the hyperbolic tangent constrains within the

*h*-dimensional hypercube

*H*. To treat inputs of different probability , we postulate that the probability of an output is proportional to the energy of a charge

_{h}*Q*

_{ν}in the electric field, and the total energy of the system is For simplicity, most of the time we assume that all inputs are equiprobable and thus will feel free to put

*Q*

_{ν}= 1 for all

*m*charges and the function to minimize is the simplified Coulomb potential: This “energy” is the function that the NR learning algorithm minimizes, modifying the elements of the weight matrix

*W*by gradient descent, ε being a small positive constant.

Let us suppose that NR has been successfully applied and that the harmonic function *U* has been minimized (more on this later). All of the *m* charges have relaxed in the minimum energy configuration and necessarily lie on the *H _{h}* surface. If

*m*⩽ 2

^{h}and our conjecture is true, they sit precisely on the vertices of the hypercube

*H*. It follows that all coordinates of their positions are binary and satisfactorily represent the outputs of

_{h}*h*binary neurons.

With the distance definition in equation 3.1, we know that *U* is harmonic and the gaussian theorem holds. We use these properties to show that the positions of our charges satisfy equation 2.3 in the limit *n*, *m*, *h* → ∞ when we can neglect the granularity of the charges and assume that the charge distribution becomes continuous. A similar approach is usually taken for idealized physical conductors where one forgets the quantization of electron charges since the single electron charge is considered negligible with respect to the total charge on the conductor.

*n*,

*m*,

*h*→ ∞. It follows, given the

*H*structure, that every hyperplane through the origin of and that does not hit any vertex of

_{h}*H*(to avoid complications) cuts

_{h}*H*into two parts that contain the same number of vertices, since if vertex belongs to one of the semispaces, vertex must belong to the other one.

_{h}^{5}From the constancy of the spatial density of the charges, it follows that the two semispaces must also contain exactly the same charge, half of the total charge on

*H*. Since this result is valid for any hyperplane through the origin of , it is true also for the

_{h}*h*hyperplanes

*y*= 0. This means that there are exactly charges with

_{i}*y*= 1 (remember that all coordinates are binary) and the same number with

_{i}*y*= −1. In the language of our layer of perceptrons and since

_{i}*m*→ ∞, this means that the output distribution is such that It is also easy to prove by induction that . We begin showing that

*q*(

*y*,

_{i}*y*) =

_{j}*q*(

*y*)

_{i}*q*(

*y*) for any couple of different coordinates

_{j}*y*and

_{i}*y*. We suppose we have cut our charge distribution into two equal parts by the hyperplane

_{j}*y*= 0 and consider the orthogonal hyperplane

_{i}*y*= 0. It is easy to use the previous argument to show that in all four subspaces so defined, the charges must be equal to , and thus for any choices of the values of

_{j}*y*and

_{i}*y*, one gets and thus

_{j}*q*(

*y*,

_{i}*y*) =

_{j}*q*(

*y*)

_{i}*q*(

*y*). Now suppose that for any choice of

_{j}*k*variables . It is easy to exploit the structure of

*H*to show that if one adds a (

_{h}*k*+ 1)th coordinate, the hyperplane of equation will cut all the previous charges into two halves and thus that , completing the proof by induction. A technical point: only for

*m*= 2

^{h}can one continue the induction chain up to step

*k*=

*h*, giving for any and complete factorization of the distribution . If

*m*< 2

^{h}, one can prove only that all the moments of order

*k*of are zero up to

*k*= ⌊log

_{2}

*m*⌋.

We have thus proved that if the *m* charges relax in the configuration of minimal energy (that, by the way, is far from being unique, given the many symmetries of the system), the final positions of the charges satisfy all the requests set for a layer of perceptrons at the end of the previous section and in particular, that the final distribution is fully factorized (see equation 2.3) that implies that the information produced at the output is maximal.

*mh*charges coordinates and is provably harmonic, but in our case, with equation 2.1, we can change coordinates only through the (

*n*+ 1)

*h*weights

*w*. It is simple to verify that

_{ij}*U*(

*w*) is no more harmonic: and in general, . This means that the restrictions imposed on the positions of the charges by the fact that they are defined by —which also enforces the constraints —renders the energy no more harmonic in the “free” coordinates

_{ij}*w*. This implies that we cannot formally prove that the function

_{ij}*U*(

*w*) is without local minima and that gradient descent 3.5 will always bring the system to one of the solutions we just described because we can move the charge positions only through the variation of the weights

_{ij}*w*, not freely.

_{ij}One could argue that it is reasonable to expect that the characteristics of the solution will not change dramatically, especially if *m* ≪ 2^{h}, and the charges are very far from each other on *H _{h}*. But the strength of a formal proof is lost. This argument surely deserves further investigation and will be the subject of future work.

We conclude this section with a brief review of other appearances of Coulomb-like forces in the context of neural networks. The series started in 1987 when Bachmann, Cooper, Dembo, and Zeitouni (1987) proposed an associative memory that attached negative electrical charges to the stored patterns and the memory played the role of a positive charge attracted by the patterns. In this fashion, they could store unlimited patterns and the memory had no spurious state. This idea resurged seven year later (Perrone & Cooper, 1995).

After some years, Marques and Almeida (1999) proposed a feedforward network dedicated to the separation of nonlinear mixtures that minimized a function of three terms. The first term, *W*, was inspired by the idea of repulsion of equal charges and produced a repulsive force. This force was nonphysical since the repulsion had a finite range and acted only in proximity to the patterns. The minimization of this term tended to keep the patterns far apart, producing an approximately uniform distribution of the patterns. To this term, they had to add a term *B*, enforcing the constraints of the outputs in [ − 1, 1] not to have the patterns fly to infinity and a regularizing term *R*. This work has subsequently been analyzed in a mathematical setting (Theis, Bauer, Puntonet, & Lang, 2001), where it has been shown that within certain approximations, a repulsive force decreasing faster than the Coulomb force tends to produce uniform probability density of the outputs, which in turn maximizes output entropy and in turn minimizes mutual information and is thus amenable to ICA.

None of these works has a real, physical Coulomb energy that is central to our approach since it will allow us to define a positive-definite probability density (see equation 4.2) and will provide an energy that, at least in the ideal case, is harmonic and thus gives important properties to the function to be minimized. This kind of potential matches perfectly with the hypercube structure since charges tend to put themselves on the hypercube vertices, thus automatically satisfying the other request of having binary coordinates. This produces a distribution of the patterns that microscopically is highly nonuniform, being the discrete sum of point-like charges. From a larger distance, this distribution appears uniform thanks to the gauss theorem (as in real conductors).

## 4. Analysis of the One-Dimensional Case

*x*= ±1. So here we suppose continuous inputs

*x*with probability distribution

*p*(

*x*). Correspondingly, we have continuos

*y*with an electrical charge density ρ(

*y*), and the energy of the system, equation 3.3, becomes Calling the total potential of point

*y*, we have where

*q*(

*y*) is the linear energy density that is by definition positive since it is proportional to the squared electric field (Jackson, 1999). It is thus possible to extend equation 3.2 and interpret

*q*(

*y*) (suitably normalized) as the probability density distribution of

*y*. Our problem, given

*x*and

*p*(

*x*), is to determine the parameters

*w*,

*w*

_{0}that minimize

*U*.

We can gain insight into the solution of this problem by first examining the corresponding physical problem. Since our charges in *y* are to be imagi-ned as free charges in a conductor, this is the physical problem of the charge distribution on a finite (remember −1 < *y* < 1), infinitely thin conductive wire.

It is a typical electrostatic problem: one has to find the charge distribution ρ(*y*) that minimizes *U*. In this case, we are in a conductor, and when the energy is minimized, the potential is constant φ(*y*) = φ_{0}. Mathematically, the problem is to find the charge distribution ρ(*y*) that realizes this condition. This is not an easy problem (it was the subject of James Clerk Maxwell's last scientific paper; see Jackson, 2002), but it is known (Jackson, 2000) that as the ratio of the physical dimensions of the wire goes to zero, the distribution of the charges on the wire ρ(*y*) tends to a uniform distribution: ρ(*y*) → ρ_{0}. So we can conclude that the physical solution that minimizes equation 4.2 gives *q*(*y*) = ρ_{0}φ_{0}.

This is true for the physical problem where, since the charges in the wire are free to move, the distribution of charges ρ(*y*) can take any shape. It is also clear that in our case, where we can play only with the parameters *w*, *w*_{0} to modify ρ(*y*), in general it will be impossible to find values of *w*, *w*_{0} that realize the condition *q*(*y*) = ρ_{0}φ_{0}.

*p*(

*x*) when the variable

*x*is transformed to

*y*=

*f*(

_{w}*x*), where

*W*represent the parameters of the function

*f*() that has to be invertible. In this case, the distribution

*q*(

*y*) of

*y*is given by and this relation tells us that to get a constant

*q*(

*y*) necessarily and thus the function

*y*=

*f*(

_{w}*x*) needs to be proportional to the primitive of the probability distribution of

*x*, namely and it is well known that this represents the maximum entropy solution for our one-neuron net (Atick, 1992). So, if by adjusting

*W*and

*W*

_{0}we can obtain that equation 4.3 holds, and our system minimizes energy (see equation 4.2), and this solution also gives the maximum information. In our case (see equation 4.1), one obtains where we used the fact that tanh ′(

*x*)>0. This relation can also be interpreted to give the only possible

*p*(

*x*) for which we get the optimal solution. As one of the referees pointed out, this can be a severe limitation that one could remedy by adapting not just the weights, but, as Nadal and Parga (1994) did, the transfer function itself

*f*(

_{w}*x*). This would produce a more powerful neuron, but following Bell and Sejnowski's ICA, we decided purposely not to open this Pandora's box at this stage.

*W*that minimize

*U*. We therefore study where we applied Leibnitz's rule for differentiation under the integral since we are dealing with continuous functions. We observe that the only term that depends on

*W*, and is thus affected by the derivative, is .

*W*and

*W*

_{0}, with our choice

*y*=

*f*(

_{w}*wx*+

*w*

_{0}) = tanh(

*wx*+

*w*

_{0}). Then which, substituted in previous equations, gives

*U*integrated over

*x*. Hence, as anticipated, it is possible to interpret it as a distribution over which the terms in square brackets can be considered averaged, so we can also write them as expectation values: Comparing these relations with ICA's learning rules (Bell & Sejnowski, 1995) (remembering that we use slightly different transfer functions), we see that they are equal. This shows that NR and ICA are intimately related and that even if they start from completely different starting points, essentially both end up maximizing information.

## 5. The Multidimensional Case

*m*binary inputs of

*n*bits each (that in our numerical simulations will be binary images), fed to a layer of

*h*neurons thus producing, for each input, where the dimensionality of the output layer

*h*is a quite arbitrary choice. It somehow represents the compression rate of the system.

^{6}To each output produced, we attach an arbitrary unitary electric charge. Then we calculate the Coulomb potential (see equation 3.4) and apply gradient descent to it to obtain the learning rules. With the standard distance definition in

*h*−dimensional space (see equation 3.1), we get which gives the learning rule for

*h*>2, where we used the properties 4.5 of the hyperbolic tangent.

We used the only possible definition of the distance that renders the energy *U* harmonic in the *mh* variables *y*_{νi} but this is of little use for us, since in general, *U* is not harmonic with respect to our “free” variables *w _{ij}*.

^{7}With this new distance plugged in equation 3.4 we define a slightly different energy function

*U*that still diverges when any two charges get too near each other. When minimizing

_{H}*U*, the learning rule becomes which is similar to the previous rule in equation 5.1, with the only difference that it contains only the “crossed” Hebbian terms

_{H}*x*

_{μj}

*y*

_{νi}and

*x*

_{νj}

*y*

_{μi}without the subtraction of the “straight” terms

*x*

_{μj}

*y*

_{μi}and

*x*

_{νj}

*y*

_{νi}and that in numerical simulation appears to be faster.

Learning rules 5.1 and 5.2 share two characteristics. The first is that they are Hebbian since they are perfectly local in the sense that the synapse *w _{ij}* connecting neuron

*y*to input

_{i}*x*is updated only with the values taken by these neurons. At the same time, the value of the synapse is updated by the product

_{j}*x*referring only to different patterns. In other words, to update a synapse, one needs the “history” of the two neurons (one could say that the rule is local in space but nonlocal in time). The second interesting characteristic is that in both rules appear the terms (1 −

_{j}y_{i}*y*

^{2}

_{νi}) that tend to kill the learning when |

*y*

_{νi}| ≃ 1, that is, when the coordinates are substantially binary. This inhibits the weights from growing indefinitely.

We conclude this section observing that the outputs produced by this network are suited to implement error detection and correction. In other words, the injective map , equation 2.2, implemented by our network de facto acts as an encoder that realizes a block (*m*, *h*) code (see, e.g., Cover & Thomas, 2006). Suppose that *m* < 2^{h}; there are fewer patterns and then hypercube vertices to park them and *U* has been minimized. Given the form of the energy minimized by learning (see equation 3.4), we know that each charge will be on a hypercube vertex and as far as possible from all other charges. Let us suppose that the minimum Hamming distance between different charges is *d*. It is well known that in this case, one can detect up to *d* − 1 errors on the patterns and correct up to errors. For example, in the numerical simulations of the next section, for *m* = 7 and *h* = 64, the minimum Hamming distance between different patterns is larger than *d* = 36. This means that if one is given a noisy version of the pattern (e.g., as returned by an associative memory), one can try to restore the original pattern. By the way, the restoration could be done by minimizing again the potential energy *U* that it is no more minimal when the correct pattern is replaced by its noisy version that results “out of place.”

## 6. Preliminary Numerical Results

We start introducing the problem we tackled to test NR: preprocessing real-world data to build a binary, uncorrelated representation. We had in mind preprocessing binary images for an associative memory, but this task is by no means limited to this particular problem.

*n*McCulloch and Pitts neurons, each of them updating its state

*S*→

_{i}*S*′

_{i}with the standard rule where the transfer function

*t*(

*x*) can be smooth (e.g.,

*t*(

*x*) = tanh(

*x*)), or binary (e.g.,

*t*(

*x*) = sgn(

*x*)). Different kinds of associative memories have different connection schemes and different rules for the synapses

*w*, but all models agree that the information is stored in synapses. An associative memory storing

_{ij}*m*patterns should be able to find any of the stored patterns starting from a partial or noisy cue. More precisely, if the network is initially in state , the (repeated) application of equation 6.1 should bring the network in one of the stored states, that is, .

So to deal with these data, one needs to transform them first in data that fulfill these requirements. The simplest transformations are the linear ones, and if one is contented with uncorrelated data (and not independent), then the linear transformation of principal component analysis can do the job. Unfortunately, the transformed patterns are no longer binary, and it is an open problem to find a linear transformation that produces uncorrelated and binary data (see, e.g., Tang & Tao, 2006, or Schein, Saul, & Ungar, 2003), an exact solution being in general impossible).^{8} So to end up with binary data, one must yield to one of the constraints: uncorrelation or linearity of the transformation.

*m*,

*n*-dimensional, binary images , we look for The outputs represent the preprocessed patterns that should be statistically independent and thus ready to be stored in an associative memory of

*h*neurons. At this point, it is clear that equation 2.2, obtained by NR, which satisfies equation 2.3, is tailored for the job.

Before presenting numerical results, we note an additional complication due to the fact that associative memories usually do not exactly recall the stored patterns but return the pattern with , the difference being typically a few percent of the bits. If one wants to be able to get back the original image from , this imposes further requirements to the characteristics of the preprocessing and at the same time rules out standard algorithms for binary compression that produce statistically fragile data. As explained previously, NR, providing data that are as much further apart as possible in *R ^{h}*, can fulfill this request.

We run a preliminary numerical test on a set of *m* = 7 binary images of 33 × 33 pixels. We had a network of *h* = 64 neurons with *n* = 33 × 33 + 1 = 1, 090 inputs totaling 69,760 weights. We run two different learning runs with the two gradient descent rules equations 5.1 and 5.2. The program stopped when max_{i,j}{Δ*w _{ij}*} ⩽ 10

^{−5}that required on the order of 10

^{7}steps. Each simulation took several days of an Intel Core Duo 2.93 GHz processor, indicating that there is ample space for improvement (e.g., by taking advantage of standard electrostatics-relaxing algorithms).

Figure 3 shows the decrease in energy during learning for both the Euclidean *U* and the Hamming distance *U _{H}*. The first impression is that, as one would expect, the decrease is compatible with a typical electrostatic potential and also, as foreseen,

*U*⩾

*U*. In this first run the, expected convergence of

_{H}*U*→

*U*was not observed, but there are indications that

_{H}*U*minimization was not terminated.

Our aim was to obtain both statistically independent and binary data. To check this last property is easier since we only have to check whether the patterns rest on hypercube vertices. This can be seen from Figure 4, which shows a histogram of the values of coordinates *y*_{νi} (obtained minimizing *U _{H}*), showing that this is true as expected.

To verify the independence of data (see equation 2.3) with the reduced statistics of this simulation is a challenging task. A necessary condition is that the marginal distributions (i.e., that each neuron cuts the input data set in exactly two parts). In our simulation, this is perfectly achieved, since we got positive coordinates and 224 negative ones. Moreover, each of the *h* = 64 output neurons has for the *m* = 7 inputs exactly three positive and four negative coordinates (or vice versa), suggesting that if we had a larger (and even) number of initial examples, we would get that each neuron would have *m*/2 positive and negative coordinates.

To investigate the quality of the solution, we analyzed the relative distances of the output data since one can expect, once equation 3.4, has been minimized, that all relative distances should be equal, indicating a roughly constant hypersurface charge distribution. We did this calculating the *m* × *m* matrix of elements , that, when sit on hypercube vertices, substantially represents the distance. In order to make it easier to understand, we converted these values to grayscale (−*h*→ white, *h*→ black). The result is shown in Figure 5. We can conclude that the *m* outputs are substantially equally spaced, particularly in the second case.

## 7. Conclusion

We presented a new approach to the problem of data preprocessing by a layer of perceptrons. We treat each data vector as a point-like electric charge confined in an *h*-dimensional hypercube, subject to simple Coulomb repulsive forces. We then let the system evolve as if it were a real physical system, that is, until it reaches the minimum of the electrostatic energy. At this point, we expect that the charges will occupy the hypercube's vertices and will be as far as possible from each other.

The potential energy function to minimize is continuous (since such is the transfer function tanh(*x*)), well shaped, and, as far as we know, without the relative minima that plague so many cases in neural networks. For these reasons, in this case it is sensible to implement a simple gradient descent that produces a strictly local learning rule that is very similar to a Hebb rule, with the difference that to update a synapse, one needs all the data and not just the last seen one.

In our tests, this learning algorithm does not shine for its speed, but one can speculate that for actual calculations, one could use more refined minimization of the potential *U*, exploiting the relaxation techniques used routinely for similar electrostatic problems.

Even with a continuous transfer function at the end, one obtains binary and statistically independent data that guarantee that the entropy of the output is maximized.

Another characteristic of this network is that one can freely choose the number *h* of output neurons without any adjustment of the learning algorithm. For small values of *h*, the network implements compression of the incoming data, and for larger *h*, just a dimensional reduction without any information loss. For even larger values of *h*, one introduces redundancy in the data, which are useful for subsequent error correction.

Despite some encouraging results, we feel that there still is ample space for further theoretical and computational developments.

## References

## Notes

^{1}

Given that lim_{β→∞}tanh(β*x*) = sgn(*x*).

^{2}

This is different from the request that there is no information loss that depends on the source entropy and would require that .

^{3}

In gaussian units: .

^{4}

Despite several attempts, we have not been able to prove this formally, but numerical simulations support the conjecture.

^{5}

One can observe that if the charges sit on hypercube vertices, they also lie on the hypersphere of radius and continue the following proofs for the hypersphere.

^{6}

As proposed in Nadal and Parga (1993), one can distinguish three cases: , where the net must “compress” the data with some information loss; , where the net is perfectly matched to the incoming information; and , where the net is redundant but, as explained later, with NR, this redundancy can be used for error correction.

^{7}

In our notation, the Hamming distance between binary vectors is .

^{8}

The covariance matrix has integer elements, but this is not true for its eigenvectors.

## Author notes

E. B. is now at Physics Department T35, Technische Universität München, 85747 Garching bei München, Germany.