## Abstract

By minimizing the zero-norm of the separating hyperplane, the support feature machine (SFM) finds the smallest subspace (the least number of features) of a data set such that within this subspace, two classes are linearly separable without error. This way, the dimensionality of the data is more efficiently reduced than with support vector–based feature selection, which can be shown both theoretically and empirically. In this letter, we first provide a new formulation of the previously introduced concept of the SFM. With this new formulation, classification of unbalanced and nonseparable data is straightforward, which allows using the SFM for feature selection and classification in a large variety of different scenarios. To illustrate how the SFM can be used to identify both the smallest subset of discriminative features and the total number of informative features in biological data sets we apply repetitive feature selection based on the SFM to a functional magnetic resonance imaging data set. We suggest that these capabilities qualify the SFM as a universal method for feature selection, especially for high-dimensional small-sample-size data sets that often occur in biological and medical applications.

## 1. Introduction

Recent developments of massively parallel data acquisition systems call for sophisticated analysis methods that are capable of dealing with data sets that have many dimensions but comprise few samples. Such high-dimensional small-sample-size scenarios occur especially in biological and medical applications that rely on human data, such as tissue classification based on microarray gene data (Golub et al., 1999; Lockhart & Winzeler, 2000), identification of disease-specific genome mutations (Samani et al., 2007; McPherson et al., 2007; Raelson et al., 2007), or multivoxel pattern analysis (MVPA) in neuroimaging (Haynes, 2011). In these applications, theoretical and practical issues such as generalization bounds, run time, or memory footprint considerations require dedicated methods for the three basic machine learning tasks: classification, regression, and density estimation.

Maximum margin methods such as the support vector machine (Vapnik, 1999) have been shown to be a good choice for many classification problems in biological applications due to their excellent generalization capabilities. However, these methods may fail especially in high-dimensional small-sample-size scenarios (Weston et al., 2000). Moreover, in biological and medical applications, the primary goal is often not to achieve high prediction accuracy but to identify informative features. Thus, feature selection is not only needed to improve run time and achieve proper prediction results, but also to allow meaningful inferences about biologically significant features. Such feature selection methods have been proposed in a variety of flavors (for an overview, see Guyon & Elisseeff, 2003). However, these approaches have seldom addressed the question how the smallest subset of features that is required to solve a classification problem can be obtained.

Recently we proposed the support feature machine (SFM) as a novel method for feature selection that aims to identify the smallest subset of features that separates two classes (Klement & Martinetz, 2010a, 2010b, 2011). The basic idea of the SFM is to find the smallest subspace (the least number of features) in a data set such that within this subspace two classes are linearly separable without error. By minimizing the zero norm of the separating hyperplane, a classifier based on a minimal set of features is obtained within a single training run. In contrast to most feature selection methods that are conservative in the way they select features—they keep all features that might ever be relevant for classification and discard only those features that are irrelevant with high probability—the SFM is aggressive in discarding features and keeps only those that are definitely required to separate classes in the training data. Results on artificial data as well as real-world data show that the SFM identifies truly relevant features very effectively and in many cases more accurately than SVM-based feature selection (Klement & Martinetz, 2010a, 2010b).

We suggest that this feature makes the SFM a powerful tool not only to identify the most informative features within a data set but also to estimate the total amount of informative features of this data set. If the SFM is used for repetitive feature selection (RFS)—such that in each repetition, all features returned by the SFM are discarded from the data set and in the next repetition the SFM is trained on the remaining features only—then the percentage of features discarded from the data set before the test error reaches chance level might provide an estimate of the total number of informative features within this data set. Importantly, the accuracy of this estimate strongly depends on the selectivity of the applied feature selection method: if irrelevant (uninformative) features are falsely identified as being relevant, then the number of informative features in the data set is overestimated. In contrast, the estimate does not strongly rely on the sensitivity of feature selection: even if only one (truly relevant) feature is obtained in each repetition, the number of informative voxels will still be estimated correctly. Since the SFM is very conservative in the way it identifies relevant features, SFM-based RFS will likely provide an unbiased estimate of the true number of informative features of a data set.

The remainder of this letter is organized in three sections. In section 2, we review and reformulate the concept of the SFM. This new formulation now permits classification of unbalanced and nonseparable classes, which enables the SFM to be used for feature selection and classification in a large variety of different scenarios.^{1} This is shown on artificial data in section 3. Finally, in section 4, we apply a repetitive version of the SFM to an fMRI (functional magnetic resonance imaging) data set to illustrate how the SFM can be used to identify not only the most informative features but also the total number of informative features in biological data sets. The appendix contains mathematical considerations as well as remarks on implementation alternatives.

### Note on notations.

In this work, we use lowercase boldface letters (e.g., ** x**,

**) for vectors and uppercase boldface letters for matrices (e.g.,**

*y***). Sets are typeset in uppercase calligraphic letters (e.g., ).**

*A*We make use of the common notations used in classification and feature selection frameworks; a data set consists of feature vectors, samples, or patterns and corresponding class labels . The dimensionality of a vector is denoted by *d*, while *n* refers to the cardinality of the set. For simplicity, we define *z*_{i}=*y _{i}*

*x*_{i}and . Further, for each class, we define a separate set of indices, and . The vectors

**0**and

**1**are vectors with all their entries being zero or one, respectively. For readability, we omit the length of these vectors where possible. The identity matrix

*I*_{d}is a square matrix containing ones on the main diagonal and zeros elsewhere, and the zero matrix

**0**

_{n,d}has

*n*rows and

*d*columns all set to zero.

## 2. The Problem and the Approach

In the terminology of feature selection, the SFM is an embedded method that combines both feature selection and classification within a single framework. The principle of structural risk minimisation is implemented by limiting the family of classification functions to those with the fewest number of parameters—in this case, dimensions or features. In this section, we first review and reformulate the basic concept of the SFM originally proposed in Klement and Martinetz (2010a, 2010b). As we will see, this new formulation of the SFM is more compact and intuitive than the original formulation and can easily be extended to unbalanced and nonseparable data. We conclude this section with some considerations regarding the behavior of the SFM in the limit.

### 2.1. SVM-Based Feature Selection.

**and**

*w**b*are solutions to the inequalities, and solve them with . In general, there is no unique solution to equation 2.1. A solution with the least number of features, with . Note that again any solution of equation 2.2 can be multiplied by a positive factor and is still a solution. Weston et al. (2003) proposed solving equation 2.2 with a variant of the SVM by Indeed, as long as there exists a solution to equation 2.2 for which

*y*(

_{i}

*w*^{T}

*x*_{i}+

*b*)>0 for all

*i*=1, …,

*n*, solving equation 2.3 yields a solution to 2.2. Unfortunately, both equations are NP-hard and cannot be solved in polynomial time. Therefore, Weston et al. (2003) proposed to approximate equation 2.3 by with . If

*w*_{0}and optimize equations 2.3 and 2.4, respectively, then that is, both solutions coincide as . Thus, by minimizing equation 2.4, an approximate solution to equation 2.3 is found. However, equation 2.4 is not convex, has many local minima, and is still hard to solve. Weston et al. (2003) proposed the following iterative scheme, which finds a local minimum of equation 2.4 by solving a sequence of linear programs:

Initialize .

**repeat**Minimize |

| such that*w*Update

**until***convergence*

This modification of the SVM effectively reduces the feature space used for classification. However, this approach does not necessarily return the least number of features required for classification. The number of features may be further reduced by discarding any margin maximization induced by the constraints . This is the basic idea of the SFM.

### 2.2. The Support Feature Machine: Basic Algorithm.

**=**

*w***0**. As long as the input data are linearly separable with

*y*(

_{i}

*w*^{T}

*x*_{i}+

*b*)>0 for at least one , we have and the equality constraint can be satisfied by scaling

**and**

*w**b*appropriately. Hence, as long as the input data are linearly separable with

*y*(

_{i}

*w*^{T}

*x*_{i}+

*b*)>0 for at least one (not all

*i*as in Westons's approach), solving equation 2.6 yields a solution to the ultimate problem, equation 2.2. Compared to the first version proposed in Klement and Martinetz (2010a, 2010b), the equality constraint is now more compact since the bias

*b*no longer occurs. In this formulation, an extension to unbalanced and nonseparable classes is straightforward.

Initialize

**repeat**Minimize |

| such that and*w*Update

**until***convergence*

Thus, by successively minimizing the one-norm, we aim to approximate the zero-norm minimizing solution as accurately as possible. By that, the SFM is a combinatorial feature selection method working on the original set of input features and taking class information into account. Importantly, in contrast to dimension-reduction methods like principal component analysis (PCA), which reduce the dimensionality of a transformed space, SFM reduces the dimension of the original feature space.

### 2.3. Extending the Support Feature Machine to Soft Separability.

In general, if , then the data will be separable and the SFM has a solution. In the following, we introduce slack variables similar to a soft-margin SVM to allow for misclassifications during training. This is done for two reasons. First, if the input data are not separable in the intrinsic feature space (i.e., if the classes overlap), irrelevant features will be added to achieve separation of the training data. This might lead to an overestimation of the number of truly relevant features and diminish generalization performance. Second, even if the classes are in principle separable in the intrinsic feature space, the true separating hyperplane might not be identified correctly due to outliers. To address these problems, a mechanism is needed that allows for misclassifications and thereby provides a better estimate of the true dimensionality. (Note that we do not address the problem of intrinsically nonlinear decision borders here.)

*C*(Klement & Martinetz, 2010a) in the same way this is done for soft-margin SVMs; we As classification errors are allowed,

*y*(

_{i}

*w*^{T}

*x*_{i}+

*b*) may become negative and the pathological case where equation 2.7 is smaller or equal to zero may occur. Therefore, the optimizer needs to fulfill the latter constraint with +1 or −1. In practice, one needs to optimize for both variants and finally choose the solution with the lower objective function. To solve equation 2.9, we use the iterative approximation scheme described above. Importantly, one property of this approach is that the objective function explicitely trades off the number of features and the number of misclassified training samples .

#### 2.3.1. Unbalanced Data Sets.

An important problem of classifiers with soft margins is their sensitivity to unbalanced data sets. If one class contains more samples than the other, many classifiers tend to behave like a majority classifier and ignore the smaller class. Several solutions to this problem have been proposed, such as rebalancing the data artificially, adjusting the output threshold of the classifier according to the class ratio, one-class classifiers, and cost-sensitive methods (Provost, 2000; Japkowicz, 2000; Chawla, Japkowicz, & Kotcz, 2004; He & Garcia, 2009).

*C*

^{+}and

*C*

^{−}appropriately, one can avoid the situation that a much smaller class tends to be ignored. In practice, choosing

*C*

^{+}and

*C*

^{−}such that

*C*

^{+}

*n*

^{+}=

*C*

^{−}

*n*

^{−}enforces the proportion of misclassified samples to be equal for both classes.

#### 2.3.2. Behavior in the Limit.

**and**

*w**b*. In the limit, the optimization problem, equation 2.10, simplifies to The minimum, of course, is . The SFM approximates the zero-norm by the one-norm. We prove that the SFM finds the optimum (i.e., exactly one feature) with using the proposed approximation. Assume that

*k*>1 features are obtained by minimizing . Without loss of generality, we assume that these are the first

*k*features. Further, without loss of generality we assume for all (note that equality has probability 0). We may rewrite the equality constraint as Thus, the equality can be fullfilled using only one feature but with a modified weight . Since we would obtain a smaller objective function using exclusively feature 1, in contradiction to our initial assumption. Thus, in the limit for , no more than one feature is obtained.

*j*be the index of this nonzero entry. Then the equality constraint is solved with respect to

*w*by where and are the projection of the class-specific means onto the axes of feature

_{j}*j*. As the SFM approximates the zero-norm by minimizing the one-norm, it will select the feature that minimizes |

*w*| and therefore maximizes the distance of the class-specific means . This is closely related to correlation-based feature selection methods such as the Fisher score and statistical distance measures such as Student's

_{j}*t*-test. Consequently, we expect soft SFM to favor those features that maximize the correlation between feature value and class label, or the distance between the feature values of the two classes. In other words, the soft SFM is a trade-off between a hard SFM and correlation-based feature ranking.

## 3. Experiments on Artificial Data

Having reviewed and reformulated the basic principles of the SFM, we continue to show that the SFM outperforms Weston's SVM-based feature selection method in many simulations, particularly in high-dimensional small-sample-size scenarios. We first compare the capability of the SFM and Weston's method to identify the truly relevant features in a simple scenario with linearly separable but slightly unbalanced classes. For this, we use an experiment that is very similar to that described in Klement and Martinetz (2011) but now includes unbalanced classes. We then show how both methods perform if an increasing number of irrelevant features is added to the data, which is probably the most challenging setting for any feature selection method. Finally, we test the performance of the soft SFM on nonseparable data.

### 3.1. Basic Experiment.

For the basic experiment we generated artificial data sets with class ratio 60% versus 40%. The first *k* dimensions were drawn as . The remaining features were noise drawn as . The parameter determines the distance between the means of the two classes. We ensured that both classes were linearly separable within the first *k* dimensions. This was achieved by removing samples that did not fulfill , that is, we assumed all entries of the real weight vector in the first k dimensions to be one. Then we resampled all invalid data points and repeated both steps until convergence. (Note, however, that by chance, both classes might be separable with even fewer than *k* features.) The number of dimensions was set to *d*=100. In the first experiment, we used a fixed sample size of *n*=100 and varied the number of relevant features from *k*=1 to *k*=20. In the second experiment, we set the number of relevant features to *k*=5 and varied the sample size from *n*=10 to *n*=1000.

Figure 1 shows the average results for 1000 runs with Weston's method and a hard SFM. The SFM returned both a smaller total number of features and a higher percentage of correctly identified features for almost all scenarios. In scenarios with very low intrinsic dimensionality—for —the SFM identified all relevant features correctly (see Figure 1, bottom left). In scenarios with *k*=5, the SFM identified all relevant features correctly in every run if the number of data points exceeded 200. Weston's method failed to converge to the correct number of features even if the number of data points was further increased (to 1000), the percentage of correctly identified features stayed clearly below 80% (see Figure 1, bottom right). Thus, the SFM converged to the correct set of features, while the SVM-based approach got stuck in a local minimum even for large data sets.

Also note that in contrast to Weston's method, the SFM was close to the final solution in the first iteration.

### 3.2. Experiment with Increasing Dimensionality.

For many biological applications, a feature selection method should not only be able to deal with high-dimensional data sets but should also scale well; adding irrelevant features should not significantly degrade the performance. To assess how the performance of the hard SFM and Weston's method degrades when irrelevant features are added to the data, we used an artificial dataset that has been used in a variety of publications but was originally proposed by Weston et al. (2003). The data set contains two equally sized classes where the first six dimensions are informative but redundant and the remaining dimensions contain gaussian noise. With probability 0.7, the first three features are drawn as , and the second three features are drawn as . With probability 0.3, the setting is inverted: the first three features are drawn as and the second three as . The remaining features are drawn as with . Additionally, we ensured the training set to be linearly separable within the six informative dimensions. We sampled *n* training points (*n*=20, 50, 100, 200, 500) and 5000 test data points. Note that this setting is slightly different from the one we originally evaluated in Klement and Martinetz (2010a). Here, the largest number of irrelevant features is 10,000 (instead of 4096), we use a more convient spacing for , and, to obtain results with higher confidence, the results are averaged across 1000 (instead of 100) runs).

Table 1 compares the capability of both feature selection methods to identify relevant features. Compared to Weston's approach, the SFM returns a smaller number of features and more likely the truly relevant features. Even in very high-dimensional small-sample-size scenarios, the SFM can identify the relevant dimensions very effectively: as the number of data points increases, the number of features found to be relevant increases but does not exceed 6—the number of truly relevant features. The percentage of correctly identified features decreases when the number of noise dimensions increases. However, only in extremely high-dimensional small-sample-size scenarios does the percentage of correctly identified features drop below 90%.

n
. | = 10 . | = 100 . | = 1000 . | = 10,000 . |
---|---|---|---|---|

SFM, number of obtained features | ||||

20 | 2.0 (0.6) | 2.0 (0.6) | 2.0 (0.6) | 1.9 (0.6) |

50 | 2.3 (0.6) | 2.4 (0.6) | 2.4 (0.6) | 2.5 (0.7) |

100 | 2.7 (0.7) | 2.6 (0.6) | 2.6 (0.7) | 2.6 (0.7) |

200 | 3.1 (0.7) | 3.2 (0.8) | 3.2 (0.8) | 3.1 (0.7) |

500 | 4.1 (0.8) | 4.2 (0.8) | 4.2 (0.8) | 4.2 (0.8) |

Weston, number of obtained features | ||||

20 | 2.6 (0.8) | 2.8 (0.9) | 3.0 (1.1) | 3.2 (1.1) |

50 | 3.2 (1.0) | 3.4 (1.1) | 3.3 (1.1) | 3.4 (1.1) |

100 | 3.8 (1.1) | 4.0 (1.2) | 4.0 (1.2) | 3.9 (1.2) |

200 | 4.6 (1.2) | 4.9 (1.4) | 4.9 (1.4) | 4.8 (1.2) |

500 | 5.9 (1.3) | 6.2 (1.5) | 6.4 (1.5) | 6.2 (1.4) |

SFM, percentage of correctly identified features | ||||

20 | 98.5% (8.0%) | 88.9% (20.2%) | 66.5% (29.1%) | 46.9% (32.9%) |

50 | 99.6% (3.5%) | 98.8% (6.5%) | 96.7% (10.9%) | 85.9% (22.4%) |

100 | 99.7% (3.1%) | 99.1% (5.3%) | 97.3% (8.8%) | 96.9% (9.4%) |

200 | 99.4% (3.9%) | 98.5% (6.0%) | 96.6% (8.9%) | 95.1% (11.0%) |

500 | 98.8% (5.0%) | 96.4% (8.7%) | 94.3% (10.8%) | 92.2% (11.7%) |

Weston, percentage of correctly identified features | ||||

20 | 94.8% (12.8%) | 81.2% (23.7%) | 58.0% (29.6%) | 32.5% (26.1%) |

50 | 94.0% (12.1%) | 87.6% (16.0%) | 85.0% (18.0%) | 79.7% (20.3%) |

100 | 93.1% (12.0%) | 87.0% (16.0%) | 83.4% (17.5%) | 83.1% (17.1%) |

200 | 89.0% (13.6%) | 82.9% (15.9%) | 80.3% (16.0%) | 80.6% (16.0%) |

500 | 82.2% (12.9%) | 76.2% (14.5%) | 73.5% (14.7%) | 74.1% (14.6%) |

n
. | = 10 . | = 100 . | = 1000 . | = 10,000 . |
---|---|---|---|---|

SFM, number of obtained features | ||||

20 | 2.0 (0.6) | 2.0 (0.6) | 2.0 (0.6) | 1.9 (0.6) |

50 | 2.3 (0.6) | 2.4 (0.6) | 2.4 (0.6) | 2.5 (0.7) |

100 | 2.7 (0.7) | 2.6 (0.6) | 2.6 (0.7) | 2.6 (0.7) |

200 | 3.1 (0.7) | 3.2 (0.8) | 3.2 (0.8) | 3.1 (0.7) |

500 | 4.1 (0.8) | 4.2 (0.8) | 4.2 (0.8) | 4.2 (0.8) |

Weston, number of obtained features | ||||

20 | 2.6 (0.8) | 2.8 (0.9) | 3.0 (1.1) | 3.2 (1.1) |

50 | 3.2 (1.0) | 3.4 (1.1) | 3.3 (1.1) | 3.4 (1.1) |

100 | 3.8 (1.1) | 4.0 (1.2) | 4.0 (1.2) | 3.9 (1.2) |

200 | 4.6 (1.2) | 4.9 (1.4) | 4.9 (1.4) | 4.8 (1.2) |

500 | 5.9 (1.3) | 6.2 (1.5) | 6.4 (1.5) | 6.2 (1.4) |

SFM, percentage of correctly identified features | ||||

20 | 98.5% (8.0%) | 88.9% (20.2%) | 66.5% (29.1%) | 46.9% (32.9%) |

50 | 99.6% (3.5%) | 98.8% (6.5%) | 96.7% (10.9%) | 85.9% (22.4%) |

100 | 99.7% (3.1%) | 99.1% (5.3%) | 97.3% (8.8%) | 96.9% (9.4%) |

200 | 99.4% (3.9%) | 98.5% (6.0%) | 96.6% (8.9%) | 95.1% (11.0%) |

500 | 98.8% (5.0%) | 96.4% (8.7%) | 94.3% (10.8%) | 92.2% (11.7%) |

Weston, percentage of correctly identified features | ||||

20 | 94.8% (12.8%) | 81.2% (23.7%) | 58.0% (29.6%) | 32.5% (26.1%) |

50 | 94.0% (12.1%) | 87.6% (16.0%) | 85.0% (18.0%) | 79.7% (20.3%) |

100 | 93.1% (12.0%) | 87.0% (16.0%) | 83.4% (17.5%) | 83.1% (17.1%) |

200 | 89.0% (13.6%) | 82.9% (15.9%) | 80.3% (16.0%) | 80.6% (16.0%) |

500 | 82.2% (12.9%) | 76.2% (14.5%) | 73.5% (14.7%) | 74.1% (14.6%) |

Note: Shown are the average (over 1000 runs) number of features found to be relevant (std).

The sampling scheme causes the number of correctly identified features not to converge to 100% for large *n*. Due to the experimental design, some features provide better separability than others. Then, by chance, one of the irrelevant features may be favored by the SFM (and also Weston's method) instead of the weakest relevant feature. This effect gets amplified for large sample sizes as the solution space becomes smaller and smaller.

### 3.3. Experiment with Nonseparable Classes.

Finally, we constructed an artificial problem where the two classes are not linearly separable. The probabilities of the classes *y*=1 and *y*=−1 were equal in both the training and the test sets. The first *k* dimensions were drawn normally distributed as . The remaining features were noise drawn as . The parameter was used to adjust the distance between the class centers. Both the training and the test sets were sampled according to the above procedure, each containing *n* data points. The softness parameter *C* was varied in 100 steps logarithmically spaced between 0.01 and 100. This basic setting is the same as in Klement and Martinetz (2010a). Here, we extend this basic setting to examine the soft SFM in four different scenarios with variable numbers of relevant and irrelevant features. In the first scenario, the data contain a high percentage of relevant features. In the second scenario, we reduced the number of relevant features and increased the number of irrelevant features. Finally, in the third and fourth scenarios, we further increased the number of irrelevant features and reduced the number of data points.

Figure 2 shows the averaged results of 100 runs. In all four scenarios, increasing softness—allowing more training errors—results in a smaller number of obtained features. This is in line with the theoretical consideration that the soft SFM directly trades off the number of obtained features and the training error and that a very soft SFM will return a single feature. As the number of obtained features decreases, the percentage of correctly identified features increases. However, this does not necessarily result in increased prediction performance. In the second scenario, the test error reaches a minimum just after the point where the percentage of correctly identified features sharply increases. In the first and third scenarios, however, the test error increases with increasing softness, and in the fourth scenario, the test error remains almost constant throughout all softness values. Below, we discuss the four data scenarios in detail.

#### 3.3.1. Many Relevant Features, Few Irrelevant Features, Large Sample Size.

In this case the data contain a large percentage of relevant features and the number of data points exceeds the number of dimensions (see Figure 2, top left). Note, however, that features might be correlated and not all features might actually be required to separate the classes. The SFM identified almost no irrelevant feature as being relevant, independent of the softness. However, only a small fraction of the truly relevant features was identified (at most 7 out of 20). The test error increased slightly with increasing softness.

#### 3.3.2. Few Relevant Features, Many Irrelevant Features, Large Sample Size.

In this scenario, the data contain few relevant and many irrelevant features, but the number of data points still exceeds the number of dimensions (see Figure 2, top right). Increasing the softness resulted in a smaller set of obtained features, while the percentage of correctly identified features increased. For the parameters chosen here, the optimal test error (in this case 27.6%) is achieved for medium softness (*C*=0.34), approximately at that point where almost all truly relevant features and almost no irrelevant features were obtained. In particular, 5.42 features were obtained, of which 86.7% were correctly identified features on average. Thus, in such scenarios, optimizing for the test error is a valid approach to identify the true set of relevant features.

#### 3.3.3. Few Relevant Features, Many Irrelevant Features, Small Sample Size.

The third and fourth scenarios both represent high-dimensional small-sample-size data with few relevant and many irrelevant features (see Figure 2, bottom). Note that this is very challenging because the information content is very small. As before, increasing softness resulted in a smaller number of obtained features while the percentage of correctly identified features increased. In the third scenario (see Figure 2, bottom left), we chose a larger class distance than in the previous scenarios ( was increased from 0.30 to 0.65). The test error increased slightly with increasing softness but stayed well below chance. In the fourth scenario, we used the same class distance as in the first two scenarios. Although the percentage of correctly identified features still increased with increasing softness, the test error remained almost constant and never fell well below chance (see Figure 2, bottom right). Thus, in this scenario, the class overlap seems to be too large to obtain meaningful results.

Some authors have recommended using a soft- rather than a hard-margin SVM even if the training data are linearly separable to improve the prediction accuracy (see, Hastie, Rosset, Tibshirani, & Zhu, 2004). The third and fourth scenarios are both scenarios with . In these scenarios, the softness had little impact on the test error but largely influenced the number of obtained features.

In sum, applying a soft SFM to the four exemplary data sets shows that the soft SFM performs well on data sets with nonseparable classes even in scenarios with few relevant and many irrelevant features, as long the class overlap is not too large relative to the number of data points. In high-dimensional small-sample-size scenarios with few relevant features and strongly overlapping classes, the performance of the SFM drops to chance, possibly because the information content is no longer sufficient to allow valid feature selection. However, we expect that in such scenarios, performance of any feature selection method will be severely hampered.

## 4. Application to Neuroimaging Data

So far, we have demonstrated that the SFM identifies truly relevant features in artificial data sets very effectively, particularly if the data contain few relevant and many irrelevant dimensions. In this section, we illustrate how the SFM might be used to identify not only the most informative features but also the total number of informative features in biological data sets. For this purpose, we chose an fMRI (functional magnetic resonance imaging) data set.

In fMRI, researchers are almost always faced with high-dimensional small-sample-size scenarios (100,000 dimensions versus 100 samples). To reduce the feature space, feature selection and classification are often performed either independently (e.g., principal component analysis; Carlson, Schrater, & He, 2003; Strother et al., 2002) or recursively (e.g., recursive feature elimination, Martino et al., 2008), but there are also approaches that, like the SFM, combine feature selection and classification within a single framework (e.g., Elastic Net, Carroll, Cecchi, Rish, Garg, & Rao, 2009; sparse penalized discriminant analysis (SPDA), Grosenick, Greer, & Knutson, 2008; and sparse logistic regression, Yamashita, Sato, Yoshioka, Tong, & Kamitani, 2008; Ryali, Supekar, Abrams, & Menon, 2010). In addition to representing extreme high-dimensional small-sample-size scenarios, fMRI data—like many high-dimensional real-world data sets—often contain several informative feature subsets that all permit linear separation (i.e., the data are highly redundant). In such scenarios, one might not only be interested in finding the most informative features, but also in identifying all informative features (Rasmussen, Hansen, Madsen, Churchill, & Strother, 2012). In this case, repetitive feature selection (RFS) might provide an estimate of the total number of informative features even if the number of features that carry information alone or in combination with others cannot be determined exactly (the sample size is usually too small to capture all sources of variance and to accurately describe the decision border).

As noted in section 1, the accuracy of this estimate will depend strongly on the selectivity of the applied feature selection method. If irrelevant (uninformative) features are falsely identified as being relevant, then the proportion of relevant features in the data set is overestimated. In contrast, the estimate does not strongly rely on the sensitivity of the feature selection; even if only one (truly relevant) feature is returned in each repetition, the proportion of informative voxels will still be estimated correctly. Since the SFM is very restrictive in the way it selects relevant features and returns a high percentage of truly relevant features (see section 3 and appendix A), the estimate of informative features obtained with a repetitive SFM (rSFM) likely represents an unbiased estimate of the true number of informative features. This makes the SFM a potentially powerful tool to identify not only the most informative features but also the total number of informative features in biological data sets.

### 4.1. FMRI Data Set.

FMRI data acquired from 12 healthy female participants (mean age 21.6 years, range 19–26 years) on a 3 Tesla scanner (Philips Medical Systems) were used for analysis. Functional imaging was divided into 4 runs per subject; during each run 67 functional whole-brain images were acquired ( weighted echoplanar images, 42 horizontal interleaved slices, tilt angle −30^{o}, 3 mm slice thickness, in plane resolution 3 × 3 mm^{2}, FOV 240 × 240 mm^{2}, TE 35 ms, TR 3000 ms). Participants were shown short text messages (either *happy* or *sad*) through fMRI-compatible video goggles and asked to decide whether they wanted to press a button in their left or right hand immediately whenever a text message appeared on the screen, but to hold their decision in mind and to execute their decision only when a go signal (two arrow heads, one pointing to the left and one pointing to the right) appeared on the screen. Participants were instructed to respond as quickly as possible when the go signal appeared by pressing the selected button with their left or right thumb, respectively. During each run, 12 trials (mean duration 5 scans) were presented in pseudo-randomized order, using the following timing parameters: stimulus presentation time, 1000 ms; delay, 2000 or 3500 ms; go signal, 300 ms; inter trial interval, 8700 to 13,200 ms (steps of 1500 ms). The study was approved by the Ethics Committee of the University of Lübeck.

Image preprocessing and BOLD (blood oxygen level dependent) activity estimation were conducted with SPM5 (Wellcome Department of Imaging Neuroscience, London, UK), and results were visualized with the BrainNet Viewer (National Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, China) and MRicron (www.mccauslandcenter.sc.edu/mricro/mricron/). Preprocessing inclu- ded removal of the first two functional scans of each run, slice acquisition time correction, concurrent spatial realignment and correction of image distortions, normalization into standard MNI (Montreal Neurological Institute) space, and spatial smoothing with an 8 ms gaussian kernel. Individual activity maps for left-hand and right-hand button presses were estimated for each participant and run using a standard GLM procedure. This procedure returned eight activation maps for each participant (four for left-hand button presses and four for right-hand button presses). Thus, the overall data set, in the following referred to as the button press data set consisted of 48 maps (12 participants × 4 runs) per class, each of which contained 50,989 in-brain voxels identified with the brain mask published by Tzourio-Mazoyer et al. (2002).

### 4.2. Mass Univariate Analysis.

The traditional approach to identify voxels in fMRI data that show different levels of activity during two conditions (i.e., that are discriminative) are voxel-wise univariate analyses. Here, we used such a mass univariate approach as baseline measure against which the performance of the rSFM was compared. For this, univariate activation maps of each participant were averaged for each class and fed into a group-level voxel-wise paired *t*-test. The overlap between voxels with high absolute *t*-values and voxels identified as being relevant with the rSFM was assessed at different thresholds. Thresholds were defined in terms of the fraction of above-threshold voxels (i.e., percentage of voxels) rather than in terms of probability of error to keep the number of above-threshold voxels constant across the univariate and rSFM-based approach. Six thresholds were used: 1%, 2.5%, 5%, 20%, 34%, and 50%. The second threshold (2.5%) corresponded to a (uncorrected) voxel-wise probability of error of *p*=0.001, which is a commonly used (although debated; see Nichols, 2012) threshold in voxel-wise analysis of fMRI data. Voxels that were identified as discriminative at this threshold are listed in Table 2. The second last threshold (34%) corresponded to the rSFM-based estimate of the upper limit of informative voxels in the button press data set (see below). The other thresholds were chosen to cover a wide range of different thresholds. Corresponding (uncorrected) voxel-wise probabilities of error are given in Table 3 (column 4).

. | Significant Voxels . | |
---|---|---|

Anatomical Region Name . | Left . | Right . |

Postcentral gyrus | 25.0% (319) | 16.0% (204) |

Precentral gyrus | 7.4% (94) | 7.5% (96) |

Cerebellum VI | 4.1% (52) | 6.1% (78) |

Inferior parietal lobe | 5.3% (67) | 1.3% (16) |

Cerebellum IV/V | 2.5% (32) | 3.1% (39) |

Putamen | 1.5% (19) | 0.0% (0) |

Supplementary motor area | 1.3% (17) | 0.0% (0) |

Superior parietal lobe | 1.3% (16) | 1.3% (16) |

Pallidum | 1.0% (13) | 0.0% (0) |

Supramarginal gyrus | 0.9% (11) | 0.4% (5) |

Unassigned | 9.0% (115) | |

Other regions (<1%) | 5.1% (65) |

. | Significant Voxels . | |
---|---|---|

Anatomical Region Name . | Left . | Right . |

Postcentral gyrus | 25.0% (319) | 16.0% (204) |

Precentral gyrus | 7.4% (94) | 7.5% (96) |

Cerebellum VI | 4.1% (52) | 6.1% (78) |

Inferior parietal lobe | 5.3% (67) | 1.3% (16) |

Cerebellum IV/V | 2.5% (32) | 3.1% (39) |

Putamen | 1.5% (19) | 0.0% (0) |

Supplementary motor area | 1.3% (17) | 0.0% (0) |

Superior parietal lobe | 1.3% (16) | 1.3% (16) |

Pallidum | 1.0% (13) | 0.0% (0) |

Supramarginal gyrus | 0.9% (11) | 0.4% (5) |

Unassigned | 9.0% (115) | |

Other regions (<1%) | 5.1% (65) |

Notes: Shown are the 2.5% most significant voxels (uncorrected voxel-wise ). Anatomical regions were identified by an automatic labeling procedure (Tzourio-Mazoyer et al., 2002; Schmahmann et al., 1998). Only regions that contain at least 1% of all significant voxels across hemispheres are listed. Numbers in brackets are numbers of discriminative voxels in each region. The majority of discriminative voxels form large clusters in each hemisphere, including part of the precentral and postcentral gyri (motor and somatosensory cortex) and part of the cerebellum, respectively.

Percent Voxels Identified . | . | . | Voxel-Wise . |
---|---|---|---|

as Relevant/Discriminative . | Percent Overlap . | Percent Overlap . | Probability of Error . |

(threshold) . | (observed) . | (expected) . | (univariate analysis) . |

1% | 53% | 1% | 0.0001 |

2.5% | 64.3% | 2.5% | 0.001 |

5% | 61.6% | 5% | 0.006 |

20% | 67.7% | 20% | 0.093 |

34% | 72.7% | 34% | 0.23 |

50% | 77.6% | 50% | 0.41 |

Percent Voxels Identified . | . | . | Voxel-Wise . |
---|---|---|---|

as Relevant/Discriminative . | Percent Overlap . | Percent Overlap . | Probability of Error . |

(threshold) . | (observed) . | (expected) . | (univariate analysis) . |

1% | 53% | 1% | 0.0001 |

2.5% | 64.3% | 2.5% | 0.001 |

5% | 61.6% | 5% | 0.006 |

20% | 67.7% | 20% | 0.093 |

34% | 72.7% | 34% | 0.23 |

50% | 77.6% | 50% | 0.41 |

Notes: The overlap expected by chance (third column) for two subsets covering the same percentage *q* is *q*^{2} with respect to the whole set and *q* with respect to one subset. Thus, percentages in the first and third columns are the same. Additionally, the fourth column shows the voxel-wise *p*-value of the least discriminative among all relevant voxels at this threshold.

### 4.3. Support Vector Machine with Random Feature Selection.

To obtain a baseline estimate of the number of informative voxels in the button press data set, we trained a hard-margin SVM on randomly chosen *d*-dimensional feature subsets () in a leave-one-participant-out cross-validation scheme with 44 samples in each class (11 participants × 4 runs). This procedure was repeated 1000 times for each subset size. The test error was below chance for the large majority of repetitions even if only a single dimension was selected at random, and close to zero if more than 1000 features were included, indicating that a large fraction of voxels carried relevant information (see Figure 3).

### 4.4. Repetitive Feature Selection (RFS) with the SFM.

The basic idea of SFM-based RFS is to train an SFM on the complete data set, remove all features from the data set that are found to be relevant by the SFM, retrain on the reduced data set, discard the obtained relevant features, retrain again, and so on, until the data set is no longer separable within the remaining features (Klement & Martinetz, 2010a).

Initialize the set of active features

Set

**repeat**Train a support feature machine using the feature set

**if***a solution was found***then**Store the results, i.e.,

*w*_{t}and*b*_{t}Store the set of relevant features, i.e.,

Update the set of active features, i.e.,

Reduce all feature vectors to

Set

**end****until***until the data set is no longer separable within the remaining features*

If the SFM correctly identifies the smallest informative feature subset in each run, the size of the returned feature subsets will monotonously increase as more and more features are discarded. However, in practice, this might not always be the case because the optimization might terminate in a local optimum due to the data set configuration or because of numerical issues of the technical implementation of the SFM. To correct for such inaccuracies, we sorted the obtained feature subsets according to their size, starting with the smallest feature subset. This way, we obtained a sequence of monotonously increasing feature subsets that, according to our definition, represents a sequence of feature subsets that are less and less informative.

To assess both the information content of each feature subset and the information content in the remaining features in each repetition, we used a leave-one-participant-out cross-validation scheme similar to that described above. In each repetition, information content was accessed as the test error of the SFM and the test error of a soft SVM, respectively. The soft SVM was trained on all features that remained in the data set after all features that had been identified as being relevant by the SFM had been removed. The optimal softness of the SVM () for each cross-validation was estimated by a nested cross-validation scheme in which the SVM was trained on 10 × 4 = 40 samples in each class and tested on the 11th subject. Once the optimal softness parameter was determined, the SVM was retrained on all 11 subjects and tested on the 12th subject. This way, a function representing the test error of an optimized soft SVM over runs was derived for each of the 12 participants.

Note that this method requires training 188 SVMs for a single SFM (11 participants × 17 softness values + the final run). To keep the run time in a reasonable range, the SVM was trained and tested only on every 10th SFM repetition.

Training and testing the repetitive SFM on the complete data set in a leave-one-participant-out scheme required 31,647 training runs (i.e., for each of the 12 cross-validations approximately 2500 repetitions before all voxels were discarded) and took 83 hours on an Intel Core 2 Quad 2.4GHz machine with 4GB RAM (using the Mosek optimization toolbox for solving the SFM). Training and testing SVMs on the remaining features for every 10th SFM repetition took another 27 hours.

Because we used a leave-one-participant-out scheme for cross-validation, the feature set size and error functions obtained during each validation did not have the same length (i.e., the number of repetitions until all features are consumed differed across validation runs). Thus, these functions needed to be resampled before averaging. We chose a resampling procedure in which feature subsets were first sorted according to their size, and each was then assigned the performance value of the last run in which fewer than *x* features were removed. These piece-wise constant curves were then averaged across all leave-one-participant-out cross-validations.

The smallest feature subset obtained by the rSFM contained 2.4 voxels on average. The largest feature subset contained an average of 77.7 voxels, which is below the upper bound (i.e., number of data points, *n* = 88 samples, minus 1—VC-dimension of a linear classifier: see Figure 4a). Critically, the comparison of voxels obtained by the rSFM and voxels identified as being discriminative by univariate *t*-statistics revealed a large overlap across different thresholds that was by far larger than the overlap expected by chance (Table 3 and Figure 5). Furthermore, this overlap was largest for feature sets removed early in the repetitive feature selection scheme (i.e., small feature subsets) and decreased continuously toward the end of the RFS (see Figure 4b). This indicates that the SFM very quickly consumes significant features before other features are included.

Taken together, the results indicate that the SFM identifies informative features effectively even if the data set contains several informative feature subsets that all permit linear separation (such as the fMRI data set used here). Furthermore, the SFM seems to be a promising tool to estimate the total number of informative voxels (or, more generally, features) in highly redundant high-dimensional small-sample-size data sets. Further experiments and validations in different directions will be a next step.

## 5. Summary and Conclusion

The goal of this letter was threefold. First, we reformulated the concept of the SFM to make it applicable to a wide range of different scenarios. We then showed that the SFM allows identifying relevant features very effectively and, in many cases, more accurately than SVM-based feature selection, particularly in high-dimensional small-sample-size scenarios. Finally, we applied a repetitive version of the SFM to an fMRI data set to illustrate how the SFM can be used to identify not only the most informative features of a data set but also to estimate the total number of informative features in this data set.

In appendix A we derive a condition that is necessary for both the SFM and Weston's approach to converge to the zero-norm minimizing solution. With this condition, we provide additional plausibility considerations as to why the SFM finds the least number of features with higher probability than Weston's approach.

In sum, we think we have shown that the SFM is both an effective and, as shown in appendix B, efficient method for feature selection that will open new avenues for data analysis in many functional biological applications, including the rapidly growing field of information-based neuroimaging. Open issues for further research include possible extensions of the SFM to nonlinear classification problems and multiple classes, as well as a more comprehensive comparison with existing approaches in neuroimaging applications.

Finally, we note that we see the SFM as a method that might be used in different ways for repetitive feature selection. For example, if one is not primarily interested in identifying the total number of informative features but to find those features that permit most accurate classification, a slightly different rSFM might be useful: as before, the SFM will be trained on the complete data set in the first repetition, but this time an SVM will be trained on all features found to be relevant. In the next repetition, features found to be relevant will be added to the feature set only if the classification of the SVM trained on all relevant features improves relative to the previous repetition. This heuristic aims at finding the smallest subset of features that contains all information, which is recommended for optimizing classification performance. However, the evaluation of this (and other) approaches requires comprehensive analysis and experiments, which are beyond the scope of this letter. They will be the focus of future studies.

## Appendix: Mathematical Considerations and Implementation

We provide some mathematical considerations that give hints of the superiority of the SFM for feature selection compared to SVM-based methods such as the one proposed by Weston et al. (2003). Further, we provide technical details on how to transform the SFM into a standard linear program to be solved with conventional optimization packages.

### A.1. Mathematical Considerations

The SFM enforces linear separation with an additional constraint on the mean decision value to avoid the trivial solution ** w**=

**0**. However, it is not obvious why this method should be better suited to identify the minimum number of relevant features than, say, the closely related method by Weston et al. (2003). Both methods converge to a local minimum of the target problem, equation 2.2. On artificial and real-world data sets, we observe the SFM to identify relevant features more effectively than other SVM-based feature selection methods.

As we have seen in our simulations, in contrast to the approach of Weston et al., the SFM finds the relevant features basically with the first step. Weston's approach can hardly catch up with the following iterations. It seems that the first step is important and decides whether we will converge to a good minimum. The first step is equivalent to minimizing the one-norm instead of the zero-norm. In the following, we derive necessary conditions for finding a zero-norm solution by minimizing the one-norm (see also Klement & Martinetz, 2011)—for both the SFM and the related SVM-based method by Weston et al. Based on this condition, we explain why the SFM approach finds a zero-norm minimizing solution more frequently by comparing their behavior in a simple illustrative scenario.

*z*_{i}=

*y*

_{i}

*x*_{i}, , and . Using this notation, Weston et al. aim to while in the SFM setting, we aim to Both, equations A.1 and A.2, solve 2.2, but which setting does it more effectively in the first iteration when we replace the zero-norm by the one-norm, or, in the case of the SFM,

In the following, we assume *w*_{0} and *w*_{1} to be unique. This is only a minor restriction, as nonuniqueness will occur only in degenerate cases. Since ** Z** is drawn from a probability distribution, the probability of these cases is of measure zero. The probabilistic nature of the input data also ensures that all quadratic submatrices of

**have full rank. Among all solutions of equation A.1,**

*Z*

*w*_{0}is the solution with the lowest one-norm. Note that if

*w*_{1}is in , then

*w*_{1}=

*w*_{0}. Since in practice, equation A.1 cannot be solved directly, is in general unknown, as is

*w*_{0}. However, both are well defined. In contrast,

*w*_{1}is the solution on the entire and can efficiently be found by linear programming. If

*w*_{0}=

*w*_{1}for a specific data set, then the optimal feature set can be obtained by optimizing for the one-norm. Without loss of generality, for the following considerations, we assume:

All entries of the weight vector are positive: . Otherwise, we invert the corresponding input dimension.

The training data are ordered such that the design matrix with and . So only the first columns of

correspond to active constraints; the constraints are fulfilled with equality. Let be the number of active constraints.*Z*

*If w_{0} contains k nonzero entries, exactly k equations in are active:*.

In linear programming theory, a basic feasible solution is defined to be a solution located in one of the corners of the solution space defined by the constraints. The fundamental theorem of linear programming (found in many textbooks on linear programming, e.g., in Dantzig & Thapa, 2003; Vanderbei, 2008) states that if an optimal solution exists, then a basic optimal solution also exists. In other words, optimal solutions are located in the corners of the solution space, which is exploited by the simplex method for solving linear programming problems.

*k*features of a particular solution to equation A.1. Then , and is valid. Thus,

*w*_{0}can be obtained by a sequence of linear programs. All of them are feasible and nondegenerate. Therefore, an optimal solution exists, and it is a basic optimal solution of a linear program. The weight vector

**contains**

*w**k*nonzero entries, so , that is, the initial

*d*-dimensional problem is equivalent to a

*k*-dimensional one. In a basic solution,

*k*constraints are active and, hence, follows.

We proceed with the main theorem, which provides a necessary condition for finding *w*_{0} with a linear program (i.e., by solving equation A.6).

*For w_{1}=w_{0}, it is necessary that*.

*w*_{0}=

*w*_{1}, for each infinitesimal disparity vector with and , we have Next, we make use of the particular structure of the matrix and split the disparity vector into an upper and a lower part: with . A closed formulation for is derived by rearrangement and using : Finally, equation A.7 can be expressed using and : Equation A.9 has to hold for any infinitesimal , which is the case if and only if holds for all , that is, if and only if ( is excluded according to equation A.8 since then .)

So far, all considerations apply for equation A.1. However, with the following minor changes, a similar condition can be derived for the SFM problem, equation A.2:

The design matrix

is extended by an additional column, the vector.*Z*If

*w*_{0}contains*k*nonzero entries, exactly*k*constraints are active. The last of these constraints is the equality constraint .The design matrix

and the weight vector*Z**w*_{0}are ordered in the same way as before: the first*k*entries of*w*_{0}are nonzero, and the first*k*columns ofcorrespond to active constraints. The*Z**k*th column is .

Theorem 1 and its proof now apply exactly in the same way to the SFM.

Both approaches are very closely connected; however, they are not identical. The slight difference leads to a significantly lower number of features, as we have seen in the experiments. Due to the complexity of both approaches, it is not possible to give a rigorous mathematical proof for the superior performance of the SFM, equation A.2, compared to Weston's approach, equation A.1. However, within a simplified scenario and with approximate arguments, we can use the result of theorem 1 to make superior performance plausible.

*z*_{i}to be drawn from a normal distribution with the expected value: Thus, only the first

*k*features are relevant; all others are irrelevant. For Weston's approach, equation A.1, we have and obtain such that The entries of the vector are distributed as . In contrast, for the SFM, equation A.2, where the last column of is the mean of all

*z*_{i}, we have and obtain and Here, the entries of are distributed as . Obviously, for , the probability that all elements of stay below 1 and, hence, that the condition in theorem 1 is fulfilled, is much larger for the SFM. As expected, the larger

*c*is, the easier it is for both approaches to be successful. Note that we assumed that the elements of and are independent stochastic variables. Of course, since and are selected by the respective algorithm according to certain criteria, this is not really the case.

To summarize, the above considerations are not a proof for a superior performance of the SFM on every data set; however, it provides some insight into why we observe it to identify relevant features more effectively.

### A.2. Implementation

Next, we show how the mathematical formulations are transformed to be solved by conventional optimization frameworks. Given a particular optimization problem, it is in general impossible to say beforehand which of the numerous commercially or freely available solvers will perform best. Therefore, we chose four major optimization packages and performed an empirical analysis of their performance in solving the SFM. Of course, this list is only a small excerpt of the confusingly vast world of linear programming toolboxes; however, they cover the main concepts and algorithms, including the simplex algorithm, interior-point methods, presolvers, and others.

*w*=

_{i}*w*

^{+}

_{i}−

*w*

^{−}

_{i}with and . As either

*w*

^{+}

_{i}or

*w*

^{−}

_{i}or both are 0, |

*w*|=

_{i}*w*

^{+}

_{i}+

*w*

^{−}

_{i}holds. Thus, equation A.11 is transformed into a linear program, where we seek to Note that the constraint is not required. Assume the optimal solution is found and both variables take positive values. Then one could reduce each of them by min(

*w*

^{+}

_{i},

*w*

^{−}

_{i}) without affecting

*w*; the overall weight vector stays the same. However, the objective function is reduced by , which is a contradiction to the initial assumption of

_{i}*w*

^{+}

_{i}and

*w*

^{−}

_{i}being optimal. The input matrices and vectors for equation A.12 take the following values: Here, two conflicting variable naming schemes are mixed to avoid uncommon notations. So,

**, the variable to be optimized, should not be confused with the input data points , and the bias**

*x**b*is to be distinguished from the equality constraint vector

**. We use and to indicate that the variables have no upper or lower bound, respectively. The above formulation is memory inefficient, as it requires the inequality constraint matrix to be stored twice: once with a positive and once with a negative sign. The second minor issue is related to the number of nonzero entries in the constraint matrices. In general, linear programming solvers are most efficient on sparse matrices, and problem formulations should minimize redundancy. In our case, we should seek an alternative formulation involving the training data only once. The key idea is the substitution**

*b**s*=

_{i}*w*

^{+}

_{i}+

*w*

^{−}

_{i}. Thus, we get and, vice versa, The transformed optimization problem, has the same optimum but is memory efficient and much sparser. The input matrices and vectors now take the following values: In the first approach, the constraint matrix

**has**

*A**n*(2

*d*+1) nonzero entries, while in the reformulated version,

*n*(

*d*+1)+4

*d*entries are nonzero. Thus, for

*n*>4, the reformulated version has fewer entries. However, the complexity of linear programming solvers due to numerous processing steps—presolving, scaling, solving—makes an a priori run-time prediction impossible. So other less obvious aspects than the number of nonzero entries might get important in practice and require an empirical run-time evaluation.

*w*=

_{i}*w*

^{+}

_{i}−

*w*

^{−}

_{i}with leads to Again, the size and the structure of the inequality constraint matrix

**are the crucial factors: Here,**

*A**n*(2

*d*+2) entries are nonzero, and as in the hard-margin case, the input data need to be stored twice. This is again avoided by substituting

*s*=

_{i}*w*

^{+}

_{i}+

*w*

^{−}

_{i}to get the linear program: Now the inequality constraint matrix is with 4

*d*+

*dn*+2

*n*nonzero entries.

Technical issues. As with many other machine learning algorithms, normalization is an essential preprocessing step for any of the proposed SFM variants. For all experiments, we normalized the training data sets to zero mean and unit variance and finally scaled all vectors to have a mean norm of one. This last step is sometimes beneficial in high-dimensional scenarios to keep the outcome of scalar products in a reasonable range. The test sets were normalized according to the factors obtained from the corresponding training sets.

For hard SFMs, either no solution exists or a solution where all data points are correctly classified. Since the optimizer uses numerical approximation methods with certain accuracy thresholds, some constraints may be marginally violated. Thus, some data points may be located on the wrong side of the hyperplane, but very close to it, producing a nonzero training error even in the hard case.

To avoid numerical issues, numbers that differed by no more than a specific implementation-dependent number—the machine epsilon—were considered to be equal.

## Acknowledgments

We thank Jana Heussen for help with fMRI scanning. This work was partly funded by the Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research, grant 01GQ1105 to S.A.).

## References

*Vol. 6353*.

*Artificial neural networks*(pp.

*Proceedings of the 21st International Conference on Artificial Neural Networks*(pp.

## Notes

^{1}

A toolbox with source code and demo applications can be retrieved online from http://www.inb.uni-luebeck.de/tools-demos/support-feature-machine.