We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate non-margin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin classifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.
Discriminating between signals, or patterns, belonging to two different classes is a widespread decoding problem encountered, for instance, in psychophysics, electrophysiology, and computer vision. In detection experiments, a visual signal is embedded in noise, and a subject has to decide whether a signal is present or absent. The two-alternative forced-choice task is an example of a discrimination experiment where a subject classifies two visual stimuli according to some criterion. In neurophysiology, many decoding studies deal with the discrimination of two stimuli on the basis of the neural response they elicit, in either single neurons or populations of neurons. Furthermore, in many engineering applications such as computer vision, pattern recognition and classification (Duda, Hart, & Stork, 2001; Bishop, 2006) are some of the most encountered problems. Although most of these applications are taken from different fields, they intrinsically deal with a similar problem: the discrimination of high-dimensional patterns belonging to two possibly overlapping classes.
We address this problem by developing a framework—the prototype framework—that decomposes the discrimination task into a data projection, followed by a threshold operation. The projection stage reduces the dimensionality of the space occupied by the patterns to be discriminated by projecting these high-dimensional patterns on a line. The line on which the patterns are projected is unambiguously defined by any two of its points. We propose to find two particular points that have a set of interesting properties and call them prototypes by analogy to the mean-of-class prototypes widely used in cognitive modeling and psychology (Reed, 1972; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). The projected patterns of both classes then define two possibly overlapping one-dimensional distributions. In the threshold stage, discrimination (or classification) simply amounts to setting a threshold between these distributions, similar to what is done in signal detection theory (Green & Swets, 1966; Wickens, 2002). Linear classifiers differ by their projection axis and their threshold, both of them being explicitly computed in our framework. While dimensionality reduction per se has been extensively studied, using, for instance, principal component analysis (Jolliffe, 2002), locally linear embedding (Roweis & Saul, 2000), non-negative matrix factorization (Lee & Seung, 1999), or neural networks (Hinton & Salakhutdinov, 2006), classification-specific dimensionality reduction as considered in this letter has surprisingly been ignored so far.
As mentioned above, the data encountered in most applications are high-dimensional and abstract, and both classes of exemplars are not always well separable. Machine learning is ideally suited to deal with such classification problems by providing a range of sophisticated classification algorithms (Vapnik, 2000; Duda et al., 2001; Schölkopf & Smola, 2002; Bishop, 2006). However, these more complex algorithms are sometimes hard to interpret and visualize and do not provide good intuition as to the nature of the solution. Furthermore, in the absence of a rigorous framework, it is hard to compare and contrast these classification methods with one other. This letter introduces a framework that puts different machine learning classifiers on the same footing—namely, that of prototype classification. Although classification is still done according to the closest prototype, these prototypes are computed using more sophisticated and more principled algorithms than simply averaging the examples in each class as for the mean-of-class prototype classifier.
We first present properties that linear classifiers, also referred to as hyperplane classifiers, must satisfy in order to be invariant to a set of transformations. We show that a linear classifier with such invariance properties can be interpreted as a generalized prototype classifier where the prototypes define the normal vector and offset of the hyperplane. We then apply the generalized prototype framework to three classes of classifiers: non-margin classifiers (the classical mean-of-class prototype classifier, the Fisher classifier, and the relevance vector machine), hard margin classifiers (the support vector machine and a novel classifier—the boosted prototype classifier), and soft margin classifiers (obtained by applying a regularized preprocessing to the data, and then classifying these data using hard margin classifiers). Subsequently we show that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. Numerical simulations on a two-dimensional toy data set allow us to visualize the prototypes for the different classifiers, and finally the responses of a population of artificial neurons to two stimuli are decoded using our prototype framework.
2. Invariant Linear Classifiers
In this section, we define several requirements that a general linear classifier—a hyperplane classifier—should satisfy in terms of invariances. For example, the algorithm should not depend on the choice of a coordinate system for the space in which the data are represented. These natural requirements yield nontrivial properties of the linear classifier that we present below.
Put in less formal words, an algorithm is invariant with respect to a transformation if the produced decision function does not change when the transformation is applied to all data to be classified by the decision function. We conjecture that a “reasonable” classifier should be invariant to the following transformations:
Unitary transformation. This is a rotation or symmetry, that is, a transformation that leaves inner products unchanged. Indeed if U is a unitary matrix, (Ux)t(Uy) = xty. This transformation affects the coordinate representation of the data but should not affect the decision function.
Translation. This corresponds to a change of origin. Such a transformation u changes the inner products (x + u)t(y + u) = xty + (x + y)tu + utu but should not affect the decision function.
Permutation of the inputs. This is a reordering of the data. Any learning algorithm should in general be invariant to permutation of the inputs.
Label inversion. In the absence of information on the classes, it is reasonable to assume that the positive and negative classes have an equivalent role, so that changing the signs of the data should simply change the sign of the decision function.
Scaling. This corresponds to a dilation or a retraction of the space. It should also not affect the decision function since in general, the scale comes from an arbitrary choice of units in the measured quantities.
The normal vector of the SH is then expressed as w = ∑iyiαixi. For a classifier satisfying the assumptions of proposition 1, we call the representation of equation 2.2 the canonical representation. In the next proposition (see appendix B for the proof), we fix the classification algorithm and vary the data, as, for example, when extending an algorithm from hard to soft margins (see section 6):
Consider a linear classifier that is invariant with regard to unitary transformations, translations, input permutations, label inversions, and scaling. Assume that the coefficients αi of the canonical representation in equation2.2are continuous at K = I (where K is the matrix of inner products between input patterns and I the identity matrix). If the constant δij/C is added to the inner products, then, as C → 0, for any data set, the decision function returned by the algorithm will converge to the one defined by αi = 1/n±.
3. On the Universality of Prototype Classification
In the previous section we showed that a linear classifier with invariance to a set of natural transformations has some interesting properties. We here show that linear classifiers satisfying these properties can be represented in a generic form, our so-called prototype framework.
From definition 2, we see that g(x) can be written as g(x) = sign(wtx + b) with w = p+ − p− and . Using proposition 1, we get the following proposition:
Clearly we have w = p+ − p− = ∑iyiαixi. In the next three sections, we explicitly compute the parameters αi and b of some of the most common hyperplane classifiers that are invariant with respect to the transformations mentioned in section 2 and can thus be cast into the generalized prototype framework. These algorithms belong to three distinct classes: non-margin, hard margin, and soft margin classifiers.
4. Non-Margin Classifiers
We consider in this section three common classification algorithms that do not allow a margin interpretation: the mean-of-class prototype classifier that inspired this study; the Fisher classifier, which is commonly used in statistical data analysis; and the relevance vector machine, which is a sparse probabilistic classifier. For convenience, we use the notation γi = yiαi throughout this section.
4.1. Classical Prototype Classifier.
4.2. Fisher Linear Discriminant.
4.3. Relevance Vector Machine.
5. Hard Margin Classifiers
In this section we consider classifiers that base their classification on the concept of margin stripe between the classes. We consider the state-of-the-art support vector machine and also develop a novel algorithm based on boosting the classical mean-of-class prototype classifier. As presented here, these classifiers need a linearly separable data set (Duda et al., 2001).
5.1. Support Vector Machine.
5.2. Boosted Prototype Classifier.
Generally boosting methods aim at improving the performance of a simple classifier by combining several such classifiers trained on variants of the initial training sample. The principle is to iteratively give more weight to the training examples that are hard to classify, train simple classifiers so that they have a small error on those hard examples (i.e., small weighted error), and then make a vote of the obtained classifiers (Schapire & Freund, 1997). We consider below how to boost the classical mean-of-class prototype classifiers in the context of hard margins. The boosted prototype algorithm that we will develop in this section cannot exactly be cast into our prototype framework since it is still an open problem to determine the invariance properties of the boosted prototype algorithm. However, the boosted prototype classifier is an important example of how the concept of prototype can be extended.
We can now state an iterative algorithm for our boosted prototype classifier. This algorithm is an adaptation of AdaBoost* (Rätsch & Warmuth, 2005), which includes a bias term. A convergence analysis of this algorithm can be found in Rätsch and Warmuth (2005). The patterns have to be normalized to lie in the unit ball (i.e., ∣wtxi ∣ ⩽1 ∀w, i with ‖w‖ = 1). The first iteration of our boosted prototype classifier is the classical mean-of-class prototype classifier. Then, during boosting, the boosted prototype classifier maintains a distribution of weights αi on the input patterns and at each step computes the corresponding weighted prototype. Then the patterns where the classifier makes mistakes have their weight increased, and the procedure is iterated until convergence. This algorithm maintains a set of weights that are separately normalized for each class, yielding the following pseudocode:
Determine the scale factor of the whole data set : s = maxi(‖xi‖).
Scale such that ‖xi‖ ⩽ 1 by applying .
Set the accuracy parameter ϵ (e.g., ϵ = 10−2).
Initialize the weights and the target margin ρ0 = 1.
Do k = 1, …, kmax; compute:
The weighted prototypes:
The normalized weight vector:
The target margin where
In the final expression for p±, w, and b, the factor ∑kvk ensures that these quantities are in the convex hull of the data. Moreover, since the data are scaled by s, the bias and the prototypes have to be rescaled according to wtx + b ↔ wt(sx) + sb. In practice, it is important to note that the choice of ϵ must be coupled with the number of iterations of the algorithm.
6. Soft Margin Classifiers
The problem with the hard margin classifiers is that when the data are not linearly separable, these algorithms will not converge at all or will converge to a solution that is not meaningful (the non-margin classifiers are not affected by this problem). We deal with this problem by extending the hard margin classifiers to soft margin classifiers. For this, we apply a form of “regularized” preprocessing to the data, which then become linearly separable. The hard margin classifiers can subsequently be applied on these processed data. Alternatively, in the case of the SVM, we can also rewrite its formulation in order to allow nonlinearly separable data sets.
6.1. From Hard to Soft Margins.
6.2. Soft Margin SVM.
7. Relations Between Classifiers
In this section we outline two relations between the prototype classification algorithm and the other classifiers considered in this letter. First, in the limit where C → 0, we show that the soft margin algorithms converge to the classical mean-of-class prototype classifier. Second, we show that the boosted prototype algorithm converges to the SVM solution.
7.1. Prototype Classifier as a Limit of Soft Margin Classifiers.
We deduce the following proposition as a direct consequence of proposition 2:
All soft margin classifiers obtained from linear classifiers whose canonical form is continuous at K = I by the regularized preprocessing of equation 6.1 converge toward the mean-of-class prototype classifier in the limit where C → 0.
7.2. Boosted Prototype Classifier and SVM.
While the analogy between boosting and the SVM has been suggested previously (Skurichina & Duin, 2002), we here establish that the boosting procedure applied on the classical prototype classifier yields the hard margin SVM as a solution when appropriate update rules are chosen:
The solution of the problem in equation 5.5 whenis the same as the solution of the hard margin SVM.
In other words, in the context of hard margins, boosting a mean-of-class prototype learner is equivalent to a SVM. It is then straightforward to extend this result to the soft margin case using the regularized preprocessing of equation 6.1. Thus, without restrictions, the SVM is the asymptotic solution of a boosting scheme applied on mean-of-class prototype classifiers. The above developments also allow us to state the following:
This is a consequence of the proof of proposition 5. Indeed, the vector w achieving the maximum in equation 7.1 is given by w = ∑iyiαixi/‖∑iyiαixi‖, which shows that w is proportional to p+ − p−. The choice of b is arbitrary since one has ∑iαiyi = 0, so that there exists a choice of b such that the corresponding function h is the same as the prototype function based on p+ and p−.
8. Numerical Experiments
In the numerical experiments of this section, we first illustrate and visualize our prototype framework on a linearly separable two-dimensional toy data set. Second, we apply the prototype framework to discriminate between two overlapping classes (nonlinearly separable data set) of responses from a population of artificial neurons.
8.1. Two-Dimensional Toy Data Set.
In order to visualize our findings, we consider in Figure 1 a two-dimensional linearly separable toy data set where the examples of each class were generated by the superposition of three gaussian distributions with different means and different covariance matrices. We compute the prototypes and the SHs for the classical mean-of-class prototype classifier, the Fisher linear discriminant (FLD), the relevance vector machine (RVM), and the hard margin support vector machine (SVM HM). We also study the trajectories taken by the “dynamic” prototypes when using our boosted prototype classifier and when varying the soft margin regularization parameter for the soft margin SVM (SVM SM). We can immediately see that the prototype framework introduced in this letter allows one to visualize and distinguish at a glance the different classification algorithms and strategies. While the RVM algorithm per se does not allow an intuitive geometric explanation as, for instance, the SVM (the margin SVs lie on the margin stripe) or the classical mean-of-class prototype classifier, the prototypes are an intuitive and visual interpretation of sparse Bayesian learning. The different classifiers yield different SHs and consequently also a different set of prototypes. As foreseen in theory, the classical prototype and the SVM HM have no shift in the decision function S = 0, indicating that the SH passes through the middle of the prototypes. This shift is largest for the RVM, reflecting the fact that one of the prototypes is close to the center of mass of the entire data set. This is due to the fact that the RVM algorithm usually yields a very sparse representation of the γi. In our example, a single γi, which corresponds to the prototype close to the center of one of the classes, strongly dominates this distribution, such that the other prototype is bound to be close to the mean across both classes (the center of the entire data set). The prototypes of the SVM HM are close to the SH, which is due to the fact that they are computed using only the SVs corresponding to exemplars lying on the margin stripe. When considering the trajectories of the “dynamic” prototypes for the boosted prototype and the soft margin SVM classifiers, both algorithms start close to the classical mean-of-class prototype classifier and converge to the hard margin SVM classifier. We further study the dynamics associated with these trajectories in Figure 2. The prototypes and the corresponding SH have a similar behavior in all cases. As predicted theoretically, the first iteration of boosting is identical to the classical prototype classifier. However, while the iterations proceed, the boosted prototypes get farther apart from the classical ones and finally converge as expected toward the prototypes of the hard margin SVM solution. Similarly, when C → 0, the soft margin SVM converges to the solution of the classical prototype classifier, while for C → ∞, the soft margin SVM converges to the hard margin SVM.
8.2. Population of Artificial Neurons.
To test our prototype framework on more realistic data, we decode the responses from a population of six independent artificial neurons. The responses of the neurons are assumed to have a gaussian noise distribution around their mean response, the variance being proportional to the mean. We use our prototype framework to discriminate between two stimuli using the population activity they elicit. This data set is not linearly separable, and the pattern distributions corresponding to both classes may overlap. We thus consider the soft margin preprocessing for the SVM and the boosted prototype classifier. We first find the value of C minimizing the training error of the SVM SM and then use this value to compute the soft margin SVM and the boosted prototype classifiers. As expected from the hard margin case, we find in Figure 3 that the boosted prototype algorithm starts as a classical mean-of-class prototype classifier, and converges toward the soft margin SVM. In order to visualize the discrimination process, we project the neural responses onto the axis defined by the prototypes (i.e., the normal vector w of the SH). Equivalently, we compute the distributions of the distances of the neural responses to the SH. Figure 4 shows these distance distributions for the classical prototype classifier, the FLD, the RVM, the soft margin SVM, and the boosted prototype classifier. The projected prototypes have locations similar to what we observed for the toy data set for the prototype classifier and the FLD. For the SVM, they can be even closer to the SH (δ = 0) since they depend on only the SVs, which may here also include exemplars inside the margin stripe (and not only on the margin stripe as for the hard margin SVM). For the RVM, however, the harder classification task (high-dimensional and nonlinearly separable data set) yields a less sparse distribution of the γi than for the toy data set. This is reflected by the fact that none of its prototypes lies in the vicinity of the mean over the whole data set (δ = 0). As already suggested in Figure 3, we can clearly observe how the boosted prototypes evolve from the prototypes of the classical mean-of-class prototype classifier to converge toward the prototypes of the soft margin SVM. Most important, the distance distributions allow us to compare our prototype framework directly with signal detection theory (Green & Swets, 1966; Wickens, 2002). Although the neural response distributions were constructed using gaussian distributions, we see that the distance distributions are clearly not gaussian. This makes most analysis such as “receiver operating characteristic” not applicable in our case. However, the different algorithms from machine learning provide a family of thresholds that can be used for discrimination, independent of the shape of the distributions. Furthermore, the distance distributions are dependent on the classifier used to compute the SH. This example illustrates one of the novelties of our prototype framework: a classifier-specific dimensionality reduction. In other words, we here visualize the space the classifiers use to discriminate: the cut through the data space provided by the axis spanned by the prototypes. As a consequence, the amount of overlap between the distance distributions is different across classifiers. Furthermore, the shape of these distributions varies: the SVM tends to cut the data such that many exemplars lie close to the SH, while for the classical prototype, the distance distributions of the same data are more centered around the means of each class. The boosted prototype classifier gives us here an insight on how the distance distribution of the mean-of-class prototype classifier evolves iteratively into the distance distribution of the soft margin SVM. This illustrates how the different projection axes are nontrivially related to generate distinct class-specific distance distributions.
We introduced a novel classification framework—the prototype framework—inspired by the mean-of-class prototype classifier. While the algorithm itself is left unchanged (up to a shift in the offset of the decision function), we computed the generalized prototypes using methods from machine learning. We showed that any linear classifier with invariances to unitary transformations, translations, input permutations, label inversions, and scaling can be interpreted as a generalized prototype classifier. We introduced a general method to cast such a linear algorithm into the prototype framework. We then illustrated our framework using some algorithms from machine learning such as the Fisher linear discriminant, the relevance vector machine (RVM), and the support vector machine (SVM). In particular, we obtained through the prototype framework a visualization and a geometrical interpretation for the hard-to-visualize RVM. While the vast majority of algorithms encountered in machine learning satisfy our invariance properties, the main class of algorithms that are ruled out are online algorithms such as the perceptron since they depend on the order of presentation of the input patterns.
We demonstrated that the SVM and the mean-of-class prototype classifier, despite their very different foundations, could be linked: the boosted prototype classifier converges asymptotically toward the SVM classifier. As a result, we also obtained a simple iterative algorithm for SVM classification. Also, we showed that boosting could be used to provide multiple optimized examples in the context of prototype learning according to the general principle of divide and conquer. The family of optimized prototypes was generated from an update rule refining the prototypes by iterative learning. Furthermore, we showed that the mean-of-class prototype classifier is a limit of the soft margin algorithms from learning theory when C → 0. In summary, both boosting and soft margin classification yield novel sets of “dynamic” prototypes paths: through time (the boosting iteration) and though the soft margin trade-off parameters C, respectively. These prototype paths can be seen as an alternative to the “chorus of prototypes” approach (Edelman, 1999).
We considered classification of two classes of inputs, or equivalently, we discriminated between two classes given the responses corresponding to each one. However, when faced with an estimation problem, we need to choose one class among multiple classes. For this, we can readily extend our prototype framework by considering a one-versus-the-rest strategy (Duda et al., 2001; Vapnik, 2000). The prototype of each class is then computed by discriminating this class against all the remaining ones. Repeating this procedure for all the classes yields an ensemble of prototypes—one for each class. These prototypes can then be used for multiple class classification, or estimation, using again the nearest-neighbor rule.
Our prototype framework can be interpreted as a two-stage learning scheme. First, from a learning perspective, it can be seen as a complicated and time-consuming training stage that computes the prototypes. This stage is followed by a very simple and fast nearest-prototype testing stage for classification of new patterns. Such a scheme can account for a slow training phase followed by a fast testing phase. Albeit it is beyond the scope of this letter, such a behavior may be argued to be biologically plausible. Once the prototypes are computed, the simplicity of the decision function is certainly one advantage of the prototype framework. This letter shows that it is possible to include sophisticated algorithms from machine learning such as the SVM or the RVM into the rather simple and easy-to-visualize prototype formalism. Our framework then provides an ideal method for directly comparing different classification algorithms and strategies, which could certainly be of interest in many psychophysical and neurophysiological decoding experiments.
Appendix A: Proof of Proposition 1
We work out the implications for a linear classifier to be invariant with respect to the transformations mentioned in section 2.
Invariance with regard to scaling means that the pairs (w1, b1) and (w2, b2) correspond to the same decision function, that is, , if and only if there exists some α ≠ 0 such that w1 = αw2 and b1 = αb2.
We denote by (wX, bX) the parameters of the hyperplane obtained when trained on data X. We show below that invariance to unitary transformations implies that the normal vector to the decision surface wX lies in the span of the data. This is remarkable since it allows a dual representation and it is a general form of the representer theorem (see also Kivinen, Warmuth, & Auer, 1997).
If A is invariant by application of any unitary transform U, then there exists γ such that wX = Xγ is in the span of the input data and bX = bUX depends on the inner products between the patterns ofand on the labels.
Now we introduce two specific unitary transformations. The first, U, performs a rotation of angle π along an axis contained in , and the second, U′, performs a symmetry with respect to a hyperplane containing this axis and v. Both transformations have the same effect on the data. However, they have the opposite effect on the vector v. This means that in order to guarantee invariance, we need to have v = 0, which shows that w is in the span of the data: wX = Xγ.
Next, we show that in addition to the unitary invariance, invariance with respect to translations (change of origin) implies that the coefficients of the dual expansion of wX sum to zero.
If A is invariant by unitary transforms U and by translations, then there exists u such that wX = Xu and uti = 0 where i denotes a column vector of size n whose entries are all 1. Moreover, we also have.
Invariance with respect to label inversion means the γi are proportional to yi, but then the αi are not affected by an inversion of labels, which means that they depend on only the products yiyj (which indicate the differences in label).
Invariance with respect to input permutation means that in the case where xtixj = δij, since the patterns are indistinguishable, so are the αi. Hence, the αi corresponding to duplicate training examples that have the same label should be the same value, and from the other constraints, we immediately deduce that αi = 1/n±. This finally proves proposition 1.
Appendix B: Proof of Proposition 2
Notice that adding δij/C to the inner products means replacing K by K + I/C. The result follows from the continuity and from the invariance by scaling, which means that we can as well use I + CK, which converges to I, when C → 0, and for I, the obtained αi were computed in proposition 1.
We thank E. Simoncelli, G. Cottrell, M. Jazayeri, and C. Rudin for helpful comments on the manuscript. A.B.A.G was supported by a grant from the European Union (IST 2000-29375 COGVIS) and by an NIH training grant in Computational Visual Neuroscience (EYO7158).
Olivier Bousquet is now at Google in Zürich, Switzerland. Gunnar Rätsch is now at the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen, Germany.