The perceptron is a simple supervised algorithm to train a linear classifier that has been analyzed and used extensively. The classifier separates the data into two groups using a decision hyperplane, with the margin between the data and the hyperplane determining the classifier's ability to generalize and its robustness to input noise. Exact results for the maximal size of the separating margin are known for specific input distributions, and bounds exist for arbitrary distributions, but both rely on lengthy statistical mechanics calculations carried out in the limit of infinite input size. Here we present a short analysis of perceptron classification using singular value decomposition. We provide a simple derivation of a lower bound on the margin and an explicit formula for the perceptron weights that converges to the optimal result for large separating margins.
The perceptron is a simple algorithm for training a linear classifier to separate a data set into two distinct classes (Rosenblatt, 1962). It works by iteratively updating a weight vector to define a decision hyperplane that separates the inputs into the two desired classes (Minsky & Papert, 1969). In addition to its simplicity, the perceptron algorithm has the appealing property of converging after a finite number of iterations if the data set is linearly separable (Novikoff, 1962).
More recent modifications of the original perceptron have led to algorithms that are guaranteed to converge to an optimal solution—one corresponding to a decision hyperplane that maximally separates the two data classes. This is obtained by maximizing the separating margin, defined as the distance between the input classes and the decision hyperplane (Krauth & Mézard, 1987; Freund & Schapire, 1999; Korzen & Klesk, 2008; Abbott & Kepler, 1989a). This strategy increases the classifier's robustness to input noise and its ability to generalize to untrained data.
In her seminal work, Gardner (1988) proved an exact relation between the number of patterns the perceptron has to classify and the maximal margin attainable. This result, however, holds only in the thermodynamical limit (where the number of neurons and input patterns goes to infinity) and for independent and identically distributed inputs. Bounds on the margin were later obtained by Tarkowski, Komarnicki, and Lewenstein (1991) and Tarkowski and Lewenstein (1992) for a general distribution of inputs through a replica method analysis (Mézard, Parisi, & Virasoro, 1987).
Our main result is an independent derivation of the bound obtained by Tarkowski et al. (1991) and Tarkowski and Lewenstein (1992) using elementary linear algebra methods, including singular value decomposition. Specifically, we show that the margin is bounded by the minimal singular value of the matrix whose columns are the input patterns. This result is valid for any set of input patterns and does not assume any particular correlation structure. Our analysis also provides a straightforward derivation of the pseudo-inverse solution to the perceptron (Personnaz, Guyon, & Dreyfus, 1985; Kanter & Sompolinsky, 1987), which provides a closed-form expression for the weights of the perceptron that converges to the optimal solution for large values of the separating margin.
The perceptron has been used as a tool in a variety of fields ranging from machine learning (Freund & Schapire, 1999), through modeling of specific brain regions (Brunel, Hakim, Isope, Nadal, & Barbour, 2004) to training methods for spiking and decision-making neural networks (Brader, Senn, & Fusi, 2007; Rigotti et al., 2010). A simple way of analyzing the perceptron can provide valuable insight into all these fields.
The SVD decomposition of the input matrix S suggests an equivalent perceptron problem obtained by absorbing the matrix U into the weight vector . The original and equivalent formulations are:
For κ>0, find with so that hμ ⩾ κ, for all μ, where (P1)
For find with so that for all μ, where (P2)
3. Equivalence of P1 and P2: Uncovering the True Dimensionality of the Problem
Notice that in general, the matrix ΣVT of the perceptron problem, P2, is an r × p matrix, where r is the rank of S. P1 and P2 are therefore equivalent to an r-dimensional perceptron classifying p patterns. In particular, the reformulation, P2, uncovers the true dimensionality of the classification problem at hand.
4. Dual Formulation and Lower Bound on κ
The simple derivation of equation 4.3, which appears as equation 12 in Tarkowski and Lewenstein (1992), is the main result of our note. This derivation was made possible by the formulation P2, which factorizes out the matrix U and allows a one-to-one mapping between the weights and the stabilities . The bound is useful in cases where exact results for the maximal margin are not known.
5. A Closed-Form Solution for the Weight Vector
We have demonstrated the utility of applying singular value decomposition to the perceptron problem for a quick and simple derivation of several results.
The original problem has N unknowns. However, the patterns actually lie in an r-dimensional subspace spanned by the columns of U, and thus there are only r independent degrees of freedom. The formulation P2 uncovers the true dimensionality of the problem by absorbing U in the weight vectors. Another way to look at this result is by noting that the best weight vector should be a linear combination of the input patterns (Gerl & Krey, 1994). Indeed, the transformation defines a one-to-one relation between the r-dimensional vectors and the N-dimensional vectors in the r-dimensional subspace spanned by the patterns. Adding a component to that is orthogonal to all patterns will increase the norm of without contributing to the classification of the input patterns.
We can characterize the dependence of the weight vector on the input patterns by implicitly defining the vector of pattern contributions as . The components xμ are known as the embedding strengths of the patterns and were used to relate the perceptron problem to nonlinear optimization (Anlauf & Biehl, 1989). In particular, for the optimal solution to the perceptron, we have that a given pattern is either exactly on the margin and explicitly encoded by the weights, or it is further away from the margin and is automatically classified without being encoded in the weights. We can easily derive these conditions within our formalism by relating and . Specifically, inserting the definitions of and M into equation 2.3 implies . We now note that if we perturb the optimal solution to equation 4.2 by a vector , the result is . To ensure the optimality of in the domain , for each μ there are two options: either and then δhμ>0, which forces xμ>0, or and then δhμ can be either positive or negative, which forces xμ = 0. These conditions are known as the Kuhn-Tucker conditions (Fletcher, 1988; Gerl & Krey, 1994).
Our main result is a simple derivation of a lower bound on the stability margin. This bound becomes tighter as the margin κ increases (Abbott & Kepler, 1989b) and is therefore useful in situations where a large margin is desirable, for instance, in cases where we are interested in increasing the size of the basin of attraction of the fixed points of autoassociative neural networks (Krauth & Mézard, 1987; Forrest, 1988; Gardner & Derrida, 1988; Kepler & Abbott, 1988).
Our analysis up to and including equation 4.1 did not depend on the assumption r = p and is valid also for the cases where p>N (which implies r < p). The derivation of equations 4.3 and 5.1, however, does depend on this assumption. In general, solutions to the perceptron may also exist for the case r < p. Specifically, in the uncorrelated input case, there exists a solution for N < 2p even though the rank is full only for N ⩽ p. Our methods, however, cannot provide any general results on this regime since there is no one-to-one correspondence between and . Analysis of this regime has to rely on tools from statistical mechanics such as the replica and cavity methods (Gardner, 1988; Gerl & Krey, 1994; Tarkowski et al., 1991).
It is our pleasure to thank Stefano Fusi for helpful discussions and comments on the manuscript. We also thank Misha Tsodyks, Ken Miller, Larry Abbott, Srdjan Ostojic, Michael Vidne, and Dan Rubin for useful suggestions on the manuscript. This research was supported by DARPA grant SyNAPSE HR0011-09-C-0002, the Swartz Foundation, and the Gatsby Foundation. O.B. is supported by the Rothschild Fellowship and the Brainpower for Israel foundation. M.R. is supported by Swiss National Science Foundation grant PBSKP3-133357.