## Abstract

We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate non-margin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin classifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.

## 1. Introduction

Discriminating between signals, or patterns, belonging to two different classes is a widespread decoding problem encountered, for instance, in psychophysics, electrophysiology, and computer vision. In detection experiments, a visual signal is embedded in noise, and a subject has to decide whether a signal is present or absent. The two-alternative forced-choice task is an example of a discrimination experiment where a subject classifies two visual stimuli according to some criterion. In neurophysiology, many decoding studies deal with the discrimination of two stimuli on the basis of the neural response they elicit, in either single neurons or populations of neurons. Furthermore, in many engineering applications such as computer vision, pattern recognition and classification (Duda, Hart, & Stork, 2001; Bishop, 2006) are some of the most encountered problems. Although most of these applications are taken from different fields, they intrinsically deal with a similar problem: the discrimination of high-dimensional patterns belonging to two possibly overlapping classes.

We address this problem by developing a framework—the prototype framework—that decomposes the discrimination task into a data projection, followed by a threshold operation. The projection stage reduces the dimensionality of the space occupied by the patterns to be discriminated by projecting these high-dimensional patterns on a line. The line on which the patterns are projected is unambiguously defined by any two of its points. We propose to find two particular points that have a set of interesting properties and call them *prototypes* by analogy to the mean-of-class prototypes widely used in cognitive modeling and psychology (Reed, 1972; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). The projected patterns of both classes then define two possibly overlapping one-dimensional distributions. In the threshold stage, discrimination (or classification) simply amounts to setting a threshold between these distributions, similar to what is done in signal detection theory (Green & Swets, 1966; Wickens, 2002). Linear classifiers differ by their projection axis and their threshold, both of them being explicitly computed in our framework. While dimensionality reduction per se has been extensively studied, using, for instance, principal component analysis (Jolliffe, 2002), locally linear embedding (Roweis & Saul, 2000), non-negative matrix factorization (Lee & Seung, 1999), or neural networks (Hinton & Salakhutdinov, 2006), classification-specific dimensionality reduction as considered in this letter has surprisingly been ignored so far.

As mentioned above, the data encountered in most applications are high-dimensional and abstract, and both classes of exemplars are not always well separable. Machine learning is ideally suited to deal with such classification problems by providing a range of sophisticated classification algorithms (Vapnik, 2000; Duda et al., 2001; Schölkopf & Smola, 2002; Bishop, 2006). However, these more complex algorithms are sometimes hard to interpret and visualize and do not provide good intuition as to the nature of the solution. Furthermore, in the absence of a rigorous framework, it is hard to compare and contrast these classification methods with one other. This letter introduces a framework that puts different machine learning classifiers on the same footing—namely, that of prototype classification. Although classification is still done according to the closest prototype, these prototypes are computed using more sophisticated and more principled algorithms than simply averaging the examples in each class as for the mean-of-class prototype classifier.

We first present properties that linear classifiers, also referred to as hyperplane classifiers, must satisfy in order to be invariant to a set of transformations. We show that a linear classifier with such invariance properties can be interpreted as a generalized prototype classifier where the prototypes define the normal vector and offset of the hyperplane. We then apply the generalized prototype framework to three classes of classifiers: non-margin classifiers (the classical mean-of-class prototype classifier, the Fisher classifier, and the relevance vector machine), hard margin classifiers (the support vector machine and a novel classifier—the boosted prototype classifier), and soft margin classifiers (obtained by applying a regularized preprocessing to the data, and then classifying these data using hard margin classifiers). Subsequently we show that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. Numerical simulations on a two-dimensional toy data set allow us to visualize the prototypes for the different classifiers, and finally the responses of a population of artificial neurons to two stimuli are decoded using our prototype framework.

## 2. Invariant Linear Classifiers

In this section, we define several requirements that a general linear classifier—a hyperplane classifier—should satisfy in terms of invariances. For example, the algorithm should not depend on the choice of a coordinate system for the space in which the data are represented. These natural requirements yield nontrivial properties of the linear classifier that we present below.

*n*examples. We denote by

*x*

_{1}, …,

*x*the input patterns (in finite dimensional spaces, these are represented as column vectors), elements of an inner product space , and by

_{n}*y*

_{1}, …,

*y*their labels in {−1, 1} where we define by the two classes resulting from and by their size. Let

_{n}*y*be the vector of labels and

**X**denote the set of input vectors; in finite-dimensional spaces,

**X**= {

*x*}

_{i}^{n}

_{i=1}is represented as a matrix whose columns are the

*x*. A classification algorithm

_{i}*A*takes as input a data set and outputs a function whose sign is the predicted class. We are interested in specific algorithms, typically called

*linear classifiers*, that produce a signed affine decision function: where

*w*stands for the inner product in and the sign function takes values sign(

^{t}x*z*) = −1, 0, 1 according to whether

*z*< 0,

*z*= 0 or

*z*>0, respectively. For such classifiers, the set of patterns

*x*such that

*g*(

*x*) = 0 is a hyperplane called the

*separating hyperplane*(SH), which is defined by its normal vector

*w*(sometimes also referred to as the weight vector) and offset

*b*. A pattern

*x*belongs to either side of the SH according to the class

*g*(

*x*) (a pattern on the SH does not get assigned to any class). The function

*f*(

*x*) is proportional to the signed distance of the example to the separating hyperplane. Since is a (subset of a) vector space, we can consider that the data set is composed of a matrix

**X**and a vector

*y*. We can now formulate the notion of invariance of the classifiers we consider.

Put in less formal words, an algorithm is invariant with respect to a transformation if the produced decision function does not change when the transformation is applied to all data to be classified by the decision function. We conjecture that a “reasonable” classifier should be invariant to the following transformations:

- •
**Unitary transformation**. This is a rotation or symmetry, that is, a transformation that leaves inner products unchanged. Indeed if**U**is a unitary matrix, (**U***x*)^{t}(**U***y*) =*x*. This transformation affects the coordinate representation of the data but should not affect the decision function.^{t}y - •
**Translation**. This corresponds to a change of origin. Such a transformation*u*changes the inner products (*x*+*u*)^{t}(*y*+*u*) =*x*+ (^{t}y*x*+*y*)^{t}*u*+*u*but should not affect the decision function.^{t}u - •
**Permutation of the inputs**. This is a reordering of the data. Any learning algorithm should in general be invariant to permutation of the inputs. - •
**Label inversion**. In the absence of information on the classes, it is reasonable to assume that the positive and negative classes have an equivalent role, so that changing the signs of the data should simply change the sign of the decision function. - •
**Scaling**. This corresponds to a dilation or a retraction of the space. It should also not affect the decision function since in general, the scale comes from an arbitrary choice of units in the measured quantities.

*A linear classifier that is invariant with regard to unitary transformations, translations, inputs permutations, label inversions, and scaling produces a decision function*

*g*that can be written as*with*

*and where α*

_{i}depends on only the relative values of the inner products and the differences in labels and*b*depend on only the inner products and the labels. Furthermore, in the case where*x*= λδ^{t}_{i}x_{j}_{ij}, for some λ>0, we have α_{i}= 1/*n*_{±}.The normal vector of the SH is then expressed as *w* = ∑_{i}*y _{i}*α

_{i}

*x*. For a classifier satisfying the assumptions of proposition 1, we call the representation of equation 2.2 the

_{i}*canonical*representation. In the next proposition (see appendix B for the proof), we fix the classification algorithm and vary the data, as, for example, when extending an algorithm from hard to soft margins (see section 6):

*Consider a linear classifier that is invariant with regard to unitary transformations, translations, input permutations, label inversions, and scaling. Assume that the coefficients α _{i} of the canonical representation in equation*

*2.2*

*are continuous at*

**K**=**I**(where**K**is the matrix of inner products between input patterns and**I**the identity matrix). If the constant δ_{ij}/*C*is added to the inner products, then, as*C*→ 0, for any data set, the decision function returned by the algorithm will converge to the one defined by α_{i}= 1/*n*_{±}._{i}∣ α

_{i}∣ =2 can be enforced by rescaling the coefficients α

_{i}. Furthermore, most algorithms are usually rotation invariant. However, they depend on the choice of the origin and are thus not a priori translation invariant, and in the most general case, the dual algorithm may not satisfy the condition ∑

_{i}

*y*α

_{i}_{i}= 0. One way to ensure that the coefficients returned by the algorithm do satisfy this condition directly is to center the data, the prime denoting a centered parameter: Setting γ

_{i}= α

_{i}

*y*, we can write: where . Clearly, we then have ∑

_{i}_{i}γ

_{i}= 0. The equations of the SH on the original data are then since we have 0 = (

*w*′)

^{t}

*x*′ +

*b*′ =

*w*(

^{t}*x*−

*c*) +

*b*′ =

*w*+

^{t}x*b*′ −

*w*. Because of the translation invariance, centering the data does not change the decision function.

^{t}c## 3. On the Universality of Prototype Classification

In the previous section we showed that a linear classifier with invariance to a set of natural transformations has some interesting properties. We here show that linear classifiers satisfying these properties can be represented in a generic form, our so-called *prototype framework*.

*p*

_{±}the prototypes, we can write the decision function of the classical prototype algorithm as This is a linear classifier since it can be written as

*g*(

*x*) = sign(

*w*+

^{t}x*b*) with In other words, once the prototypes are known, the SH passes through their average (

*p*

_{+}+

*p*

_{−})/2 and is perpendicular to them. The prototype classification algorithm is arguably simple, and also intuitive since it has an easy geometrical interpretation. We now introduce a generalized notion of prototype classifier, where a shift is allowed in the decision function.

From definition 2, we see that *g*(*x*) can be written as *g*(*x*) = sign(*w ^{t}x* +

*b*) with

*w*=

*p*

_{+}−

*p*

_{−}and . Using proposition 1, we get the following proposition:

*Any linear classifier that is invariant with respect to unitary transforms, translations, input permutations, label inversion, and scaling is a generalized prototype classifier. Moreover, if the classifier is given in canonical form by α*

_{i}and*b*, then the prototypes are given by*and the shift is given by*

Clearly we have *w* = *p*_{+} − *p*_{−} = ∑_{i}*y _{i}*α

_{i}

*x*. In the next three sections, we explicitly compute the parameters α

_{i}_{i}and

*b*of some of the most common hyperplane classifiers that are invariant with respect to the transformations mentioned in section 2 and can thus be cast into the generalized prototype framework. These algorithms belong to three distinct classes: non-margin, hard margin, and soft margin classifiers.

## 4. Non-Margin Classifiers

We consider in this section three common classification algorithms that do not allow a margin interpretation: the mean-of-class prototype classifier that inspired this study; the Fisher classifier, which is commonly used in statistical data analysis; and the relevance vector machine, which is a sparse probabilistic classifier. For convenience, we use the notation γ_{i} = *y _{i}*α

_{i}throughout this section.

### 4.1. Classical Prototype Classifier.

*x*to the class whose mean, or prototype, is closest to it. The prototypes are here simply the average example of each class and can be seen as the center of mass of each class assuming a homogeneous punctual mass distribution on each example. The parameters of the hyperplane in the dual space are then where In the above, we clearly have ∑

_{i}γ

_{i}= 0, implying that the data do not need centering. Moreover, the SH is centered (

*S*= 0). One problem arising when using prototype learners is the absence of a way to refine the prototypes to reflect the actual structure (e.g., covariance) of the classes. In section 5.2, we remedy this situation by proposing a novel algorithm for boosted prototype learning.

### 4.2. Fisher Linear Discriminant.

**M**γ = λ

**N**γ, where the between-class variance matrix is defined as

**M**= (

*m*

_{−}−

*m*

_{+})(

*m*

_{−}−

*m*

_{+})

^{t}and the within-class variance matrix as

**N**=

**KK**

^{t}− ∑

_{i=±}

*n*. The Gram matrix of the data is computed as

_{i}m_{i}m^{t}_{i}**K**

_{ij}=

*x*, and the means of each class are defined as , where

^{t}_{i}x_{j}*u*

_{±}is a vector of size

*n*with value 0 for

*i*∣

*y*= ∓1 and value 1 for

_{i}*i*∣

*y*= ±1. In most applications, in order to have a well-conditioned eigenvalue problem, it may be necessary to regularize the matrix

_{i}**N**according to

**N**→

**N**+

*C*

**I**, where

**I**is the identity matrix.

### 4.3. Relevance Vector Machine.

*w*= ∑

^{n}

_{i=0}γ

_{i}

*x*using the convention γ

_{i}_{0}=

*b*and extending the dimensionality of the data as

*x*|

_{i}_{0}= 1 ∀

*i*= 1, …,

*n*, yielding, The two classes of inputs define two possible “states” that can be modeled by a Bernoulli distribution, where

**C**

_{ij}= [1 ∣

*x*] is the “extended” Gram matrix of the data. An unknown gaussian hyperparameter β is introduced to ensure sparsity and smoothness of the dual space variable γ: Learning of γ then amounts to maximizing the probability of the targets

^{t}_{i}x_{j}*y*given the patterns

**X**with respect to β according to The Laplace approximation is used to approximate the integrand locally using a gaussian function around its most probable mode. The variable γ is then determined from β using equation 4.6. In the update of β, some β

_{i}→ ∞, implying an infinite peak of

*p*(γ

_{i}∣ β

_{i}) around 0, or equivalently, γ

_{i}= 0. This feature of the RVM ensures sparsity and defines the relevance vectors (RVs): β

_{i}< ∞ ⇔

*x*∈

_{i}*RV*.

## 5. Hard Margin Classifiers

In this section we consider classifiers that base their classification on the concept of margin stripe between the classes. We consider the state-of-the-art support vector machine and also develop a novel algorithm based on boosting the classical mean-of-class prototype classifier. As presented here, these classifiers need a linearly separable data set (Duda et al., 2001).

### 5.1. Support Vector Machine.

_{i}= 0. We then define the support vectors (SVs) as

*x*∈

_{i}*SV*⇔ α

_{i}≠ 0. The SVM algorithm can be cast into our prototype framework as follows: where

*b*is computed using the KKT condition by averaging over the SVs. The update rule for α is given by equation 5.2. Using one of the saddle points of the Lagrangian, multiplying each term of the KKT conditions by ∑

_{i}

*y*·, we obtain Since ∑

_{i}_{i}α

_{i}

*y*= 0, no centering of the data is required. Furthermore, the shift of the offset of the SH is zero: using ∑

_{i}_{i}α

_{i}= ∑

_{i}∣ α

_{i}∣ =2 since α

_{i}>0.

### 5.2. Boosted Prototype Classifier.

Generally boosting methods aim at improving the performance of a simple classifier by combining several such classifiers trained on variants of the initial training sample. The principle is to iteratively give more weight to the training examples that are hard to classify, train simple classifiers so that they have a small error on those hard examples (i.e., small weighted error), and then make a vote of the obtained classifiers (Schapire & Freund, 1997). We consider below how to boost the classical mean-of-class prototype classifiers in the context of hard margins. The boosted prototype algorithm that we will develop in this section cannot exactly be cast into our prototype framework since it is still an open problem to determine the invariance properties of the boosted prototype algorithm. However, the boosted prototype classifier is an important example of how the concept of prototype can be extended.

*x*,

_{i}*y*), we define the margin as

_{i}*y*(

_{i}f*x*). It is non-negative when the training sample is correctly classified and negative otherwise, and its absolute value gives a measure of the confidence with which

_{i}*f*classifies the sample. The problem to be solved in the boosting scenario is the maximization of the smallest margin in the training sample (Schapire, Freund, Bartlett, & Lee, 1998): It is known that the solution of this problem is a convex combination of elements of (Rätsch & Meir, 2003). Let us now consider the case where the base class is the set of linear functions corresponding to all possible prototype learners. In other words, is the set of all affine functions that can be written as

*h*(

*x*) =

*w*+

^{t}x*b*with ‖

*w*‖ = 1. It can easily be seen that equation 5.5, using hypotheses of the form

*h*(

*x*) =

*w*+

^{t}x*b*, is equivalent to using hypotheses of the form

*h*(

*x*) =

*w*(i.e., without bias). We therefore consider for simplicity the hypothesis set .

^{t}x_{i}α

_{i}

*y*= 0, the solution of equation 5.7 is given by the prototype classifier with

_{i}We can now state an iterative algorithm for our boosted prototype classifier. This algorithm is an adaptation of AdaBoost* (Rätsch & Warmuth, 2005), which includes a bias term. A convergence analysis of this algorithm can be found in Rätsch and Warmuth (2005). The patterns have to be normalized to lie in the unit ball (i.e., ∣*w ^{t}x_{i}* ∣ ⩽1 ∀

*w*,

*i*with ‖

*w*‖ = 1). The first iteration of our boosted prototype classifier is the classical mean-of-class prototype classifier. Then, during boosting, the boosted prototype classifier maintains a distribution of weights α

_{i}on the input patterns and at each step computes the corresponding weighted prototype. Then the patterns where the classifier makes mistakes have their weight increased, and the procedure is iterated until convergence. This algorithm maintains a set of weights that are separately normalized for each class, yielding the following pseudocode:

Determine the scale factor of the whole data set :

*s*= max_{i}(‖*x*‖)._{i}Scale such that ‖

*x*‖ ⩽ 1 by applying ._{i}Set the accuracy parameter ϵ (e.g., ϵ = 10

^{−2}).Initialize the weights and the target margin ρ

_{0}= 1.Do

*k*= 1, …,*k*_{max}; compute:The weighted prototypes:

The normalized weight vector:

The target margin where

In the final expression for *p*_{±}, *w*, and *b*, the factor ∑_{k}*v ^{k}* ensures that these quantities are in the convex hull of the data. Moreover, since the data are scaled by

*s*, the bias and the prototypes have to be rescaled according to

*w*+

^{t}x*b*↔

*w*(

^{t}*sx*) +

*sb*. In practice, it is important to note that the choice of ϵ must be coupled with the number of iterations of the algorithm.

*s*, the weight update α

^{k}

_{i}, and the weight

*v*are defined above. The weight vector

_{k}*w*, however, is not a linear combination of the patterns since there is a normalization factor in the expression of

*w*. The decision function at each iteration is implicitly given by

^{k}*h*(

^{k}*x*) = sign(‖

*x*−

*p*

^{k}_{−}‖

^{2}− ‖

*x*−

*p*

^{k}_{+}‖

^{2}), while at the last iteration of the algorithm, it reverts to the usual form:

*f*(

*x*) =

*w*+

^{t}x*b*.

## 6. Soft Margin Classifiers

The problem with the hard margin classifiers is that when the data are not linearly separable, these algorithms will not converge at all or will converge to a solution that is not meaningful (the non-margin classifiers are not affected by this problem). We deal with this problem by extending the hard margin classifiers to soft margin classifiers. For this, we apply a form of “regularized” preprocessing to the data, which then become linearly separable. The hard margin classifiers can subsequently be applied on these processed data. Alternatively, in the case of the SVM, we can also rewrite its formulation in order to allow nonlinearly separable data sets.

### 6.1. From Hard to Soft Margins.

*x*in the data: where appears at the

_{i}*i*th row after

*x*(

_{i}*e*is the

_{i}*i*th unit vector) and

*C*is a regularization constant. The (hard margin) classifier then operates on the patterns

*X*instead of the original patterns

_{i}*x*using a new scalar product: The above corresponds to adding a diagonal matrix to the Gram matrix in order to make the classification problem linearly separable. The soft margin preprocessing allows us to extend hard margin classification to accommodate overlapping classes of patterns. Clearly, the hard margin case is obtained by setting

_{i}*C*→ ∞. Once the SH and prototypes are obtained in the space spanned by

*X*, their counterparts in the space of the

_{i}*x*are computed by simply ignoring the components added by the preprocessing.

_{i}### 6.2. Soft Margin SVM.

*C*is a regularization parameter and ξ is the slack variable vector accounting for outliers: examples that are misclassified or lie in the margin stripe. The saddle points of the corresponding Lagrangian yield the dual problem: In the above formulation, the addition of the term to the inner product

*x*corresponds to the preprocessing introduced in equation 6.2. The KKT conditions of the above problem are written as The SVM algorithm is then cast into our prototype framework as follows: where

^{t}_{i}x_{j}*b*is computed using the first constraint of the primal problem applied on the margin SVs given by 0 < α

_{i}<

*C*andξ

_{i}= 0. The update rule for α is given by equation 6.4. Using one of the saddle points of the Lagrangian α =

*C*ξ and applying ∑

_{i}

*y*· to the KKT conditions, we get Setting

_{i}*C*→ ∞ in the above equations yields the expression for the hard margin SVM obtained in equation 5.4.

*C*is a regularization parameter and ξ is the slack variable vector. The saddle points of the corresponding Lagrangian yield the dual problem: The KKT conditions are then written as α

_{i}[

*y*(

_{i}*w*+

^{t}x_{i}*b*) − 1 + ξ

_{i}] = 0 ∀

*i*. The above allows us to cast the SVM algorithm into our prototype framework, where

*b*is computed using the first constraint of the primal problem applied on the margin SVs given by 0 < α

_{i}<

*C*andξ

_{i}= 0. The update rule for α is given by equation 6.8. In the hard margin case, also obtained for

*C*→ ∞, the KKT conditions becomes α

_{i}(

*y*(

_{i}*w*+

^{t}x_{i}*b*) − 1) = 0 ∀

*i*. From this, we deduce by application of ∑

_{i}

*y*· to the KKT conditions the expression of the bias obtained for the hard margin case in equation 5.4. Finally, we notice that the 1-norm SVM does not naturally yield the scalar product substitution of equation 6.2 when going from hard to soft margins.

_{i}## 7. Relations Between Classifiers

In this section we outline two relations between the prototype classification algorithm and the other classifiers considered in this letter. First, in the limit where *C* → 0, we show that the soft margin algorithms converge to the classical mean-of-class prototype classifier. Second, we show that the boosted prototype algorithm converges to the SVM solution.

### 7.1. Prototype Classifier as a Limit of Soft Margin Classifiers.

We deduce the following proposition as a direct consequence of proposition 2:

*All soft margin classifiers obtained from linear classifiers whose canonical form is continuous at K = I by the regularized preprocessing of equation 6.1 converge toward the mean-of-class prototype classifier in the limit where C → 0.*

### 7.2. Boosted Prototype Classifier and SVM.

While the analogy between boosting and the SVM has been suggested previously (Skurichina & Duin, 2002), we here establish that the boosting procedure applied on the classical prototype classifier yields the hard margin SVM as a solution when appropriate update rules are chosen:

*The solution of the problem in equation 5.5 when**is the same as the solution of the hard margin SVM.*

_{i}, we first rewrite the problem of equation 5.5 in the following equivalent form: Indeed, the minimization of a linear function of the α

_{i}is achieved when one α

_{i}(the one corresponding to the smallest term of the sum) is one and the others are zero. Now notice that the objective function is linear in the convex coefficients α

_{i}and also in the convex coefficients representing

*f*, so that by the minimax theorem, the minimum and maximum can be permuted to give the equivalent problem: Using the fact that we are maximizing a linear function on a convex set, we can rewrite the maximization as running over the set instead of , which gives One now notices that when ∑

_{i}α

_{i}

*y*≠ 0, the maximization can be achieved by taking

_{i}*b*to infinity, which would be suboptimal in terms of the minimization in the α's. This means that the constraint ∑

_{i}α

_{i}

*y*= 0 will be satisfied by any nondegenerate solution. Using this and the fact that we finally obtain the following problem: This is equivalent to the hard margin SVM problem of equation 5.1.

_{i}In other words, in the context of hard margins, boosting a mean-of-class prototype learner is equivalent to a SVM. It is then straightforward to extend this result to the soft margin case using the regularized preprocessing of equation 6.1. Thus, without restrictions, the SVM is the asymptotic solution of a boosting scheme applied on mean-of-class prototype classifiers. The above developments also allow us to state the following:

*Under the condition ∑*

_{i}α_{i}*y*= 0, the solution of equation 5.7 is given by the prototype classifier defined by_{i}This is a consequence of the proof of proposition 5. Indeed, the vector *w* achieving the maximum in equation 7.1 is given by *w* = ∑_{i}*y _{i}*α

_{i}

*x*/‖∑

_{i}_{i}

*y*α

_{i}_{i}

*x*‖, which shows that

_{i}*w*is proportional to

*p*

_{+}−

*p*

_{−}. The choice of

*b*is arbitrary since one has ∑

_{i}α

_{i}

*y*= 0, so that there exists a choice of

_{i}*b*such that the corresponding function

*h*is the same as the prototype function based on

*p*

_{+}and

*p*

_{−}.

## 8. Numerical Experiments

In the numerical experiments of this section, we first illustrate and visualize our prototype framework on a linearly separable two-dimensional toy data set. Second, we apply the prototype framework to discriminate between two overlapping classes (nonlinearly separable data set) of responses from a population of artificial neurons.

### 8.1. Two-Dimensional Toy Data Set.

In order to visualize our findings, we consider in Figure 1 a two-dimensional linearly separable toy data set where the examples of each class were generated by the superposition of three gaussian distributions with different means and different covariance matrices. We compute the prototypes and the SHs for the classical mean-of-class prototype classifier, the Fisher linear discriminant (FLD), the relevance vector machine (RVM), and the hard margin support vector machine (SVM HM). We also study the trajectories taken by the “dynamic” prototypes when using our boosted prototype classifier and when varying the soft margin regularization parameter for the soft margin SVM (SVM SM). We can immediately see that the prototype framework introduced in this letter allows one to visualize and distinguish at a glance the different classification algorithms and strategies. While the RVM algorithm per se does not allow an intuitive geometric explanation as, for instance, the SVM (the margin SVs lie on the margin stripe) or the classical mean-of-class prototype classifier, the prototypes are an intuitive and visual interpretation of sparse Bayesian learning. The different classifiers yield different SHs and consequently also a different set of prototypes. As foreseen in theory, the classical prototype and the SVM HM have no shift in the decision function *S* = 0, indicating that the SH passes through the middle of the prototypes. This shift is largest for the RVM, reflecting the fact that one of the prototypes is close to the center of mass of the entire data set. This is due to the fact that the RVM algorithm usually yields a very sparse representation of the γ_{i}. In our example, a single γ_{i}, which corresponds to the prototype close to the center of one of the classes, strongly dominates this distribution, such that the other prototype is bound to be close to the mean across both classes (the center of the entire data set). The prototypes of the SVM HM are close to the SH, which is due to the fact that they are computed using only the SVs corresponding to exemplars lying on the margin stripe. When considering the trajectories of the “dynamic” prototypes for the boosted prototype and the soft margin SVM classifiers, both algorithms start close to the classical mean-of-class prototype classifier and converge to the hard margin SVM classifier. We further study the dynamics associated with these trajectories in Figure 2. The prototypes and the corresponding SH have a similar behavior in all cases. As predicted theoretically, the first iteration of boosting is identical to the classical prototype classifier. However, while the iterations proceed, the boosted prototypes get farther apart from the classical ones and finally converge as expected toward the prototypes of the hard margin SVM solution. Similarly, when *C* → 0, the soft margin SVM converges to the solution of the classical prototype classifier, while for *C* → ∞, the soft margin SVM converges to the hard margin SVM.

### 8.2. Population of Artificial Neurons.

To test our prototype framework on more realistic data, we decode the responses from a population of six independent artificial neurons. The responses of the neurons are assumed to have a gaussian noise distribution around their mean response, the variance being proportional to the mean. We use our prototype framework to discriminate between two stimuli using the population activity they elicit. This data set is not linearly separable, and the pattern distributions corresponding to both classes may overlap. We thus consider the soft margin preprocessing for the SVM and the boosted prototype classifier. We first find the value of *C* minimizing the training error of the SVM SM and then use this value to compute the soft margin SVM and the boosted prototype classifiers. As expected from the hard margin case, we find in Figure 3 that the boosted prototype algorithm starts as a classical mean-of-class prototype classifier, and converges toward the soft margin SVM. In order to visualize the discrimination process, we project the neural responses onto the axis defined by the prototypes (i.e., the normal vector *w* of the SH). Equivalently, we compute the distributions of the distances of the neural responses to the SH. Figure 4 shows these distance distributions for the classical prototype classifier, the FLD, the RVM, the soft margin SVM, and the boosted prototype classifier. The projected prototypes have locations similar to what we observed for the toy data set for the prototype classifier and the FLD. For the SVM, they can be even closer to the SH (δ = 0) since they depend on only the SVs, which may here also include exemplars inside the margin stripe (and not only on the margin stripe as for the hard margin SVM). For the RVM, however, the harder classification task (high-dimensional and nonlinearly separable data set) yields a less sparse distribution of the γ_{i} than for the toy data set. This is reflected by the fact that none of its prototypes lies in the vicinity of the mean over the whole data set (δ = 0). As already suggested in Figure 3, we can clearly observe how the boosted prototypes evolve from the prototypes of the classical mean-of-class prototype classifier to converge toward the prototypes of the soft margin SVM. Most important, the distance distributions allow us to compare our prototype framework directly with signal detection theory (Green & Swets, 1966; Wickens, 2002). Although the neural response distributions were constructed using gaussian distributions, we see that the distance distributions are clearly not gaussian. This makes most analysis such as “receiver operating characteristic” not applicable in our case. However, the different algorithms from machine learning provide a family of thresholds that can be used for discrimination, independent of the shape of the distributions. Furthermore, the distance distributions are dependent on the classifier used to compute the SH. This example illustrates one of the novelties of our prototype framework: a classifier-specific dimensionality reduction. In other words, we here visualize the space the classifiers use to discriminate: the cut through the data space provided by the axis spanned by the prototypes. As a consequence, the amount of overlap between the distance distributions is different across classifiers. Furthermore, the shape of these distributions varies: the SVM tends to cut the data such that many exemplars lie close to the SH, while for the classical prototype, the distance distributions of the same data are more centered around the means of each class. The boosted prototype classifier gives us here an insight on how the distance distribution of the mean-of-class prototype classifier evolves iteratively into the distance distribution of the soft margin SVM. This illustrates how the different projection axes are nontrivially related to generate distinct class-specific distance distributions.

## 9. Discussion

We introduced a novel classification framework—the prototype framework—inspired by the mean-of-class prototype classifier. While the algorithm itself is left unchanged (up to a shift in the offset of the decision function), we computed the generalized prototypes using methods from machine learning. We showed that any linear classifier with invariances to unitary transformations, translations, input permutations, label inversions, and scaling can be interpreted as a generalized prototype classifier. We introduced a general method to cast such a linear algorithm into the prototype framework. We then illustrated our framework using some algorithms from machine learning such as the Fisher linear discriminant, the relevance vector machine (RVM), and the support vector machine (SVM). In particular, we obtained through the prototype framework a visualization and a geometrical interpretation for the hard-to-visualize RVM. While the vast majority of algorithms encountered in machine learning satisfy our invariance properties, the main class of algorithms that are ruled out are online algorithms such as the perceptron since they depend on the order of presentation of the input patterns.

We demonstrated that the SVM and the mean-of-class prototype classifier, despite their very different foundations, could be linked: the boosted prototype classifier converges asymptotically toward the SVM classifier. As a result, we also obtained a simple iterative algorithm for SVM classification. Also, we showed that boosting could be used to provide multiple optimized examples in the context of prototype learning according to the general principle of divide and conquer. The family of optimized prototypes was generated from an update rule refining the prototypes by iterative learning. Furthermore, we showed that the mean-of-class prototype classifier is a limit of the soft margin algorithms from learning theory when *C* → 0. In summary, both boosting and soft margin classification yield novel sets of “dynamic” prototypes paths: through time (the boosting iteration) and though the soft margin trade-off parameters *C*, respectively. These prototype paths can be seen as an alternative to the “chorus of prototypes” approach (Edelman, 1999).

We considered classification of two classes of inputs, or equivalently, we discriminated between two classes given the responses corresponding to each one. However, when faced with an estimation problem, we need to choose one class among multiple classes. For this, we can readily extend our prototype framework by considering a one-versus-the-rest strategy (Duda et al., 2001; Vapnik, 2000). The prototype of each class is then computed by discriminating this class against all the remaining ones. Repeating this procedure for all the classes yields an ensemble of prototypes—one for each class. These prototypes can then be used for multiple class classification, or estimation, using again the nearest-neighbor rule.

Our prototype framework can be interpreted as a two-stage learning scheme. First, from a learning perspective, it can be seen as a complicated and time-consuming training stage that computes the prototypes. This stage is followed by a very simple and fast nearest-prototype testing stage for classification of new patterns. Such a scheme can account for a slow training phase followed by a fast testing phase. Albeit it is beyond the scope of this letter, such a behavior may be argued to be biologically plausible. Once the prototypes are computed, the simplicity of the decision function is certainly one advantage of the prototype framework. This letter shows that it is possible to include sophisticated algorithms from machine learning such as the SVM or the RVM into the rather simple and easy-to-visualize prototype formalism. Our framework then provides an ideal method for directly comparing different classification algorithms and strategies, which could certainly be of interest in many psychophysical and neurophysiological decoding experiments.

## Appendix A: Proof of Proposition 1

We work out the implications for a linear classifier to be invariant with respect to the transformations mentioned in section 2.

Invariance with regard to scaling means that the pairs (**w**_{1}, *b*_{1}) and (**w**_{2}, *b*_{2}) correspond to the same decision function, that is, , if and only if there exists some α ≠ 0 such that **w**_{1} = α**w**_{2} and *b*_{1} = α*b*_{2}.

We denote by (**w**_{X}, *b _{X}*) the parameters of the hyperplane obtained when trained on data

**X**. We show below that invariance to unitary transformations implies that the normal vector to the decision surface

**w**

_{X}lies in the span of the data. This is remarkable since it allows a dual representation and it is a general form of the representer theorem (see also Kivinen, Warmuth, & Auer, 1997).

*If A is invariant by application of any unitary transform U, then there exists γ such that w_{X} = Xγ is in the span of the input data and b_{X} = b_{UX} depends on the inner products between the patterns of*

*and on the labels.*

*b*=

_{UX}*b*(take

_{X}*x*= 0), and thus

*b*does not depend on

_{X}**U**. This shows that

*b*can depend on only inner products between the input vectors (only the inner products are invariant by

_{X}**U**since (

**U**

*x*)

^{t}(

**U**

*y*) =

*x*) and on the labels. Furthermore we have the condition which implies (since

^{t}y**U**is self-adjoint) so that

*w*is transformed according to

**U**. We now decompose

*w*as a linear combination of the patterns plus an orthogonal component: where , and similarly we decompose with . We are using

_{X}*w*=

_{UX}**U**

*w*: and since , then

_{X}*v*=

_{U}**U**

*v*and

**X**γ =

**X**γ

_{U}.

Now we introduce two specific unitary transformations. The first, **U**, performs a rotation of angle π along an axis contained in , and the second, **U**′, performs a symmetry with respect to a hyperplane containing this axis and *v*. Both transformations have the same effect on the data. However, they have the opposite effect on the vector *v*. This means that in order to guarantee invariance, we need to have *v* = 0, which shows that *w* is in the span of the data: *w _{X}* =

**X**γ.

Next, we show that in addition to the unitary invariance, invariance with respect to translations (change of origin) implies that the coefficients of the dual expansion of *w _{X}* sum to zero.

*If A is invariant by unitary transforms U and by translations*

*, then there exists*.

*u*such that*w*=_{X}**X***u*and*u*= 0 where^{t}i*i*denotes a column vector of size*n*whose entries are all 1. Moreover, we also have**X**,

*v*, and

*x*, we can write We thus obtain which can be true only if and . In particular, since we can write by the previous lemma

*w*=

_{X}**X**γ

_{X}and , we have for all

*v*: Taking the center of mass of the data, , we obtain where, denoting by

*u*the parenthetical factor of

**X**on the right-hand side, we can then compute that

*u*= 0, which concludes the proof.

^{t}i*w*,

*b*) instead of (

**w**

_{X},

*b*). As a consequence from the above lemmas, a linear classifier that is invariant with respect to unitary transformations and translations produces a decision function

_{X}*g*that can be written as with Since the decision function is not modified by scaling, one can normalize the γ

_{i}to ensure that the sum of their absolute values is equal to 2.

Invariance with respect to label inversion means the γ_{i} are proportional to *y _{i}*, but then the α

_{i}are not affected by an inversion of labels, which means that they depend on only the products

*y*(which indicate the differences in label).

_{i}y_{j}Invariance with respect to input permutation means that in the case where *x ^{t}_{i}x_{j}* = δ

_{ij}, since the patterns are indistinguishable, so are the α

_{i}. Hence, the α

_{i}corresponding to duplicate training examples that have the same label should be the same value, and from the other constraints, we immediately deduce that α

_{i}= 1/

*n*

_{±}. This finally proves proposition 1.

## Appendix B: Proof of Proposition 2

Notice that adding δ_{ij}/*C* to the inner products means replacing **K** by **K** + **I**/*C*. The result follows from the continuity and from the invariance by scaling, which means that we can as well use **I** + *C***K**, which converges to **I**, when *C* → 0, and for **I**, the obtained α_{i} were computed in proposition 1.

## Acknowledgments

We thank E. Simoncelli, G. Cottrell, M. Jazayeri, and C. Rudin for helpful comments on the manuscript. A.B.A.G was supported by a grant from the European Union (IST 2000-29375 COGVIS) and by an NIH training grant in Computational Visual Neuroscience (EYO7158).

## References

## Author notes

Olivier Bousquet is now at Google in Zürich, Switzerland. Gunnar Rätsch is now at the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen, Germany.