## Abstract

The perceptron is a simple supervised algorithm to train a linear classifier that has been analyzed and used extensively. The classifier separates the data into two groups using a decision hyperplane, with the margin between the data and the hyperplane determining the classifier's ability to generalize and its robustness to input noise. Exact results for the maximal size of the separating margin are known for specific input distributions, and bounds exist for arbitrary distributions, but both rely on lengthy statistical mechanics calculations carried out in the limit of infinite input size. Here we present a short analysis of perceptron classification using singular value decomposition. We provide a simple derivation of a lower bound on the margin and an explicit formula for the perceptron weights that converges to the optimal result for large separating margins.

## 1. Introduction

The perceptron is a simple algorithm for training a linear classifier to separate a data set into two distinct classes (Rosenblatt, 1962). It works by iteratively updating a weight vector to define a decision hyperplane that separates the inputs into the two desired classes (Minsky & Papert, 1969). In addition to its simplicity, the perceptron algorithm has the appealing property of converging after a finite number of iterations if the data set is linearly separable (Novikoff, 1962).

More recent modifications of the original perceptron have led to algorithms that are guaranteed to converge to an optimal solution—one corresponding to a decision hyperplane that maximally separates the two data classes. This is obtained by maximizing the separating margin, defined as the distance between the input classes and the decision hyperplane (Krauth & Mézard, 1987; Freund & Schapire, 1999; Korzen & Klesk, 2008; Abbott & Kepler, 1989a). This strategy increases the classifier's robustness to input noise and its ability to generalize to untrained data.

In her seminal work, Gardner (1988) proved an exact relation between the number of patterns the perceptron has to classify and the maximal margin attainable. This result, however, holds only in the thermodynamical limit (where the number of neurons and input patterns goes to infinity) and for independent and identically distributed inputs. Bounds on the margin were later obtained by Tarkowski, Komarnicki, and Lewenstein (1991) and Tarkowski and Lewenstein (1992) for a general distribution of inputs through a replica method analysis (Mézard, Parisi, & Virasoro, 1987).

Our main result is an independent derivation of the bound obtained by Tarkowski et al. (1991) and Tarkowski and Lewenstein (1992) using elementary linear algebra methods, including singular value decomposition. Specifically, we show that the margin is bounded by the minimal singular value of the matrix whose columns are the input patterns. This result is valid for any set of input patterns and does not assume any particular correlation structure. Our analysis also provides a straightforward derivation of the pseudo-inverse solution to the perceptron (Personnaz, Guyon, & Dreyfus, 1985; Kanter & Sompolinsky, 1987), which provides a closed-form expression for the weights of the perceptron that converges to the optimal solution for large values of the separating margin.

The perceptron has been used as a tool in a variety of fields ranging from machine learning (Freund & Schapire, 1999), through modeling of specific brain regions (Brunel, Hakim, Isope, Nadal, & Barbour, 2004) to training methods for spiking and decision-making neural networks (Brader, Senn, & Fusi, 2007; Rigotti et al., 2010). A simple way of analyzing the perceptron can provide valuable insight into all these fields.

## 2. Framework

*N*binary inputs and a single output. The perceptron has to separate

*p*patterns ξ

^{μ}

_{i}= ±1 into two classes ζ

^{μ}= ±1, where

*i*= 1, …,

*N*and μ = 1, …,

*p*. The ratio between the number of patterns and the number of inputs defines the storage capacity α =

*p*/

*N*. The output of the perceptron is determined by its weights

*w*and, for a given pattern ξ

_{i}^{μ}, is defined as . Therefore, the conditions for correct classification are with κ>0.

*w*lies on the unit sphere (see Gardner, 1988). A solution to the full problem is therefore a weight vector satisfying the conditions for all μ and for a given κ>0. The optimal solution maximizes κ.

_{i}*r*as the rank of

*S*)

*U*is an

*N*×

*r*matrix with orthonormal columns (

*U*=

^{T}U*I*), Σ is an

*r*×

*r*diagonal matrix with positive real numbers on the diagonal (the singular values of

*S*) and

*V*is a

*p*×

*r*matrix with orthonormal columns (

*V*=

^{T}V*I*).

The SVD decomposition of the input matrix *S* suggests an equivalent perceptron problem obtained by absorbing the matrix *U* into the weight vector . The original and equivalent formulations are:

- •
For κ>0, find with so that

*h*^{μ}⩾ κ, for all μ, where (P1) - •
For find with so that for all μ, where (P2)

*P*1 is a restatement of the original problem. We will show that the second form

*P*2 is equivalent to the original formulation, in that there exists a transformation from the optimal solution to

*P*2 to the optimal solution to

*P*1, and vice versa. Working with the formulation

*P*2 will allow us to derive a lower bound on and, in turn, on κ.

## 3. Equivalence of *P*1 and *P*2: Uncovering the True Dimensionality of the Problem

*P*

*P*

*P*1 if is the optimal solution to

*P*2. Suppose there exists a better solution than to

*P*1, that is, there is a normalized weight vector that satisfies If we now define the normalized

*r*-dimensional vector as a candidate

*P*2 solution, we can see that it satisfies contradicting the optimality of . The inequality stems from the fact that

*U*is a projection on an

*r*-dimensional vector space.

Notice that in general, the matrix Σ*V ^{T}* of the perceptron problem,

*P*2, is an

*r*×

*p*matrix, where

*r*is the rank of

*S*.

*P*1 and

*P*2 are therefore equivalent to an

*r*-dimensional perceptron classifying

*p*patterns. In particular, the reformulation,

*P*2, uncovers the true dimensionality of the classification problem at hand.

## 4. Dual Formulation and Lower Bound on *κ*

*κ*

*P*2 exists for some , we now reformulate the task of finding the optimal solution to the perceptron problem—the maximal for which

*P*2 holds. Defining and and substituting these definitions in the equivalent equation,

*P*2, allows us to reformulate the search for the optimal solution as the problem of finding a weight vector satisfying for the largest possible , where the notation stands for . Because maximizing is the same as minimizing , our task can be formulated in the following equivalent dual form: instead of maximizing subject to a constraining equality on , we minimize subject to inequality constraints on : The maximal margin κ* will then be given by and we define a

*p*×

*p*matrix

*M*=

*V*Σ

^{−2}

*V*.

^{T}*r*=

*p*(which implies that there are

*p*⩽

*N*linearly independent patterns). In this case, the formula defines a one-to-one relationship between and . Thus, we can switch the minimization parameter from to : This expression is equivalent to equation 24 in Tarkowski et al. (1991). We now use this result to obtain a lower bound on the maximal margin κ*. Specifically, observe that the all-ones vector satisfies the condition . Using this fact in equation 4.2, we get where is the maximal eigenvalue of

*M*. Notice that the first inequality in the previous expression is saturated in the case where the vector is a minimum of the quadratic form , while the second inequality is saturated when this vector is an eigenvector associated with the maximal eigenvalue.

The simple derivation of equation 4.3, which appears as equation 12 in Tarkowski and Lewenstein (1992), is the main result of our note. This derivation was made possible by the formulation *P*2, which factorizes out the matrix *U* and allows a one-to-one mapping between the weights and the stabilities . The bound is useful in cases where exact results for the maximal margin are not known.

*S*, we cannot improve the bound by providing a simple example where the bound is tight. Consider the following two patterns in a two-dimensional space: In this case the relevant matrices are The optimal separating hyperplane for these patterns is the ξ

_{2}= 0 line, and thus the maximal margin is 1. Since , the bound 4.3 is tight.

## 5. A Closed-Form Solution for the Weight Vector

*N*-dimensional weight vector of

*P*1. Using the relationships and in the

*r*=

*p*case defines a weight vector, where

*z*is a normalizing factor enforcing ∑

*w*

^{2}

_{i}= 1. The weight vector given by equation 5.1 can also be derived by solving for in equation 2.3 by applying the pseudo-inverse of

*S*to the vector . The result simplifies using equation 2.4 to equation 5.1 with

*z*= κ. This pseudo-inverse solution (Personnaz et al., 1985; Kanter & Sompolinsky, 1987) is not the optimal solution to

*P*1, but converges to it as κ → ∞. This is because for the optimal , the fraction of patterns having a margin strictly larger than κ is ∫

^{−κ}

_{−∞}

*Dz*, where

*Dz*denotes a gaussian integral (Abbott & Kepler, 1989b), meaning that as κ tends to infinity, the pseudo-inverse solution, equation 5.1, will tend to the optimal solution . Notice that because the error function ∫

^{−κ}

_{−∞}

*Dz*goes exponentially to zero, the pseudo-inverse solution starts approaching the optimal solution for margins κ of order 1. For instance, for κ = 1.40, less than 5% of the elements of

**h**are above κ, meaning that is already a good approximation.

## 6. Discussion

We have demonstrated the utility of applying singular value decomposition to the perceptron problem for a quick and simple derivation of several results.

The original problem has *N* unknowns. However, the patterns actually lie in an *r*-dimensional subspace spanned by the columns of *U*, and thus there are only *r* independent degrees of freedom. The formulation *P*2 uncovers the true dimensionality of the problem by absorbing *U* in the weight vectors. Another way to look at this result is by noting that the best weight vector should be a linear combination of the input patterns (Gerl & Krey, 1994). Indeed, the transformation defines a one-to-one relation between the *r*-dimensional vectors and the *N*-dimensional vectors in the *r*-dimensional subspace spanned by the patterns. Adding a component to that is orthogonal to all patterns will increase the norm of without contributing to the classification of the input patterns.

We can characterize the dependence of the weight vector on the input patterns by implicitly defining the vector of pattern contributions as . The components *x*_{μ} are known as the embedding strengths of the patterns and were used to relate the perceptron problem to nonlinear optimization (Anlauf & Biehl, 1989). In particular, for the optimal solution to the perceptron, we have that a given pattern is either exactly on the margin and explicitly encoded by the weights, or it is further away from the margin and is automatically classified without being encoded in the weights. We can easily derive these conditions within our formalism by relating and . Specifically, inserting the definitions of and *M* into equation 2.3 implies . We now note that if we perturb the optimal solution to equation 4.2 by a vector , the result is . To ensure the optimality of in the domain , for each μ there are two options: either and then δ*h*_{μ}>0, which forces *x*_{μ}>0, or and then δ*h*_{μ} can be either positive or negative, which forces *x*_{μ} = 0. These conditions are known as the Kuhn-Tucker conditions (Fletcher, 1988; Gerl & Krey, 1994).

Our main result is a simple derivation of a lower bound on the stability margin. This bound becomes tighter as the margin κ increases (Abbott & Kepler, 1989b) and is therefore useful in situations where a large margin is desirable, for instance, in cases where we are interested in increasing the size of the basin of attraction of the fixed points of autoassociative neural networks (Krauth & Mézard, 1987; Forrest, 1988; Gardner & Derrida, 1988; Kepler & Abbott, 1988).

Our analysis up to and including equation 4.1 did not depend on the assumption *r* = *p* and is valid also for the cases where *p*>*N* (which implies *r* < *p*). The derivation of equations 4.3 and 5.1, however, does depend on this assumption. In general, solutions to the perceptron may also exist for the case *r* < *p*. Specifically, in the uncorrelated input case, there exists a solution for *N* < 2*p* even though the rank is full only for *N* ⩽ *p*. Our methods, however, cannot provide any general results on this regime since there is no one-to-one correspondence between and . Analysis of this regime has to rely on tools from statistical mechanics such as the replica and cavity methods (Gardner, 1988; Gerl & Krey, 1994; Tarkowski et al., 1991).

## Acknowledgments

It is our pleasure to thank Stefano Fusi for helpful discussions and comments on the manuscript. We also thank Misha Tsodyks, Ken Miller, Larry Abbott, Srdjan Ostojic, Michael Vidne, and Dan Rubin for useful suggestions on the manuscript. This research was supported by DARPA grant SyNAPSE HR0011-09-C-0002, the Swartz Foundation, and the Gatsby Foundation. O.B. is supported by the Rothschild Fellowship and the Brainpower for Israel foundation. M.R. is supported by Swiss National Science Foundation grant PBSKP3-133357.