## Abstract

Least squares regression (LSR) is a fundamental statistical analysis technique that has been widely applied to feature learning. However, limited by its simplicity, the local structure of data is easy to neglect, and many methods have considered using orthogonal constraint for preserving more local information. Another major drawback of LSR is that the loss function between soft regression results and hard target values cannot precisely reflect the classification ability; thus, the idea of the large margin constraint is put forward. As a consequence, we pay attention to the concepts of large margin and orthogonal constraint to propose a novel algorithm, orthogonal least squares regression with large margin (OLSLM), for multiclass classification in this letter. The core task of this algorithm is to learn regression targets from data and an orthogonal transformation matrix simultaneously such that the proposed model not only ensures every data point can be correctly classified with a large margin than conventional least squares regression, but also can preserve more local data structure information in the subspace. Our efficient optimization method for solving the large margin constraint and orthogonal constraint iteratively proved to be convergent in both theory and practice. We also apply the large margin constraint in the process of generating a sparse learning model for feature selection via joint $\u21132,1$-norm minimization on both loss function and regularization terms. Experimental results validate that our method performs better than state-of-the-art methods on various real-world data sets.

## 1 Introduction

Numerous real-world applications, such as text mining (Forman, 2003), bioinformatics (Ding & Peng, 2005; Wang et al., 2012), and medical image analysis (Tang, Cui, & Jiang, 2017), contain enormous amounts of high-dimensional data that can be difficult to process because they contain millions of features. In addition, the process of handling the high-dimensional data is complex, and the redundant features may deteriorate the performance. Feature extraction and feature selection are two main techniques for dimensionality reduction. The major difference between them is that feature extraction finds a meaningful low-dimensional representation of high-dimensional data by learning a linear or nonlinear transformation matrix, while feature selection aims to extract the most representative features from the original features and the property of each feature is unchanged.

During the past several decades, an abundance of algorithms for feature extraction have been proposed; principal component analysis (PCA; Jolliffe, 2005) and linear discriminant analysis (LDA; Duda, Hart, & Stork, 2001) are the best-known methods for linear dimensionality reduction. PCA is an unsupervised algorithm that maximizes the global variance of the data to obtain a projection matrix; the columns of the projection matrix are made of the largest eigenvalues of covariance matrix corresponding to the eigenvectors. However, it does not take the local structure of the data samples (Yu, 2012) into account. Manifold learning is another approach to subspace learning, whose most representative work is locally linear embedding (LLE; Roweis & Saul, 2000; Saul & Roweis, 2003), which constructs the reconstruction weight matrix to reflect intrinsic geometric properties of the data. However, LLE also has a very serious problem: it is sensitive to parameters and noise. The instability of the LLE method can be attributed to the factorizing of an abnormal matrix; the smallest eigenvalues are very small, and the largest eigenvalues are very large. Therefore, stable LLE (SLLE; Hou, Zhang, Wu, & Jiao, 2009) focuses on this problem of enhancing the stability of LLE and employs kernel transformation to avoid eigenproblems. The prime limitation of unsupervised methods is neglect of label information, which plays an important role in classification. Hence, in some classification tasks, supervised learning usually has better performance. LDA is the most typical supervised algorithm whose core task is to search an optimal projection by maximizing the ratio of the between-class scatter and the within-class scatter (Russell, Chiang, & Braatz, 2000). Some studies have concluded that the LDA in a two-class condition is equivalent to the least squares regression.

Least squares regression (LSR) has attracted a great deal of attention and has been used for many practical situations. Many popular models have been derived from it, including ridge regression (Hoerl & Kennard, 2000), LASSO (Tibshirani, 1996), and support vector machines (SVM; Cortes & Vapnik, 1995; Chang & Lin, 2011; Hou, Nie, Zhang, Yi, & Wu, 2014). Furthermore, LSR has been applied in many machine learning fields, such as multilabel learning, semisupervised learning, and clustering. Recently, many researchers have attempted to use LSR to deal with dimensionality-reduction problems. However, LSR is constrained by the simplicity of the model: the local structural information cannot be maintained. In order to overcome this drawback, Zhao, Wang, and Nie (2016) have proposed a novel orthogonal least squares regression (OLSR) for feature extraction, and the method is confirmed to be a special case of the quadratic problem on the Stiefel manifold (QPSM) (Nie, Zhang, & Li, 2017).

The main work of this letter is similar to orthogonal least squares discriminant analysis (OLSDA; Nie, Xiang, Liu, Hou, & Zhang, 2012), which exploits orthogonal projection to constrain transformation matrix to avoid trivial solutions and preserve more local structure information. The orthogonal projection is desirable and often demonstrates good performance empirically. Thus, much previous work has used this property, including orthogonal neighborhood preserving projections (ONPP; Kokiopoulou & Saad, 2007), orthogonal locality preserving projections (OLPP; Cai, He, Han, & Zhang, 2006), and orthogonal locality minimizing globality maximizing projections (OLMGMP; Nie, Xiang, Song, & Zhang, 2009). In addition, the least squares loss between soft regression results and hard labels makes conventional LSR less suitable for classification (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001). To address this issue, many new LSR variants have been proposed; they can be categorized in two ways.

One way is to use a surrogate loss function to replace the least squares loss for improving the property of a model, such as hinge loss (Cortes & Vapnik, 1995), squared hinge loss (Chang, Hsieh, & Lin, 2008), and logistic loss (Hosmer & Lemeshow, 2004). The other category is to alter the absolute values with soft labels obtained through different methods, which can ensure that the new learned labels are beneficial to keep more discrimination information. For example, Xiang, Nie, Meng, Pan, and Zhang (2012) propose a framework called discriminative least squares regression (DLSR) that enlarges the distance between the true classes and false classes through using $\u025b$-dragging technique to force the regression targets of different categories in the opposite direction. Retargeted least squares regression (ReLSR; Zhang, Wang, Xiang, & Liu, 2015) introduces a target matrix from input data to guarantee each sample with a margin that is larger than one; it is much more accurate in measuring the classification error of a regression model.

Although OLSR can retain local information, the hard labels are still used in this algorithm, which is not well suited to classification. Thus, we propose a novel method: incorporating the large margin constraint into the classical least squares regression under the orthogonal subspace for multiclass classification: we call it orthogonal least squares regression with large margin (OLSLM).

Except the feature extraction algorithms, feature selection is another crucial technique for dimensionality reduction. It can roughly be divided into three categories: filter, wrapper, and embedded methods. In filter models, features are preselected based on certain intrinsic properties of the data, such as variance, and some statistical indicators of the features. The selection is independent of the subsequent learning algorithm. Some popular filter feature selection methods encompass Fisher score (FS; Duda et al., 2001), mRMR (Peng, Long, & Ding, 2005), T-test (Montgomery, Runger, & Hubele, 2007), and information gain (IG; Raileanu & Stoffel, 2004). Wrapper methods (Kohavi & John, 1997) depend on the specific learning algorithms to yield learned results that can be used to select distinctive subsets of features. However, the computational costs of wrapper approaches are more expensive than filter methods. In addition, embedded methods integrate feature selection and classification model into a single optimization problem.

The scope of research on sparsity regularization in feature selection has grown rapidly due to its robustness and efficiency. For example, $\u21131$-SVM (Bradley & Mangasarian, 1998) adopts the $\u21131$-norm regularization that typically yields a sparse solution to perform feature selection. Due to the $\u21131$-norm penalty function in $\u21131$-SVM, the number of selected variables is upper-bounded by the sample size. To remedy this drawback, Wang, Zhu, and Zou (2007) propose a hybrid Huberized SVM by combining both $\u21131$-norm and $\u21132$-norm. Recently, the structural sparsity has been extremely important in selecting the group features. FS20 (Cai, Nie, & Huang, 2013) incorporates an explicit $\u21132,0$-norm equality constraint into an $\u21132,1$-norm loss term. In comparison to $\u21132,0$-norm, due to $\u21132,1$-norm being convex, researchers prefer using $\u21132,1$-norm as the regularization term. Nie, Huang, Cai, and Ding (2010) propose a robust and efficient feature selection method via joint $\u21132,1$-norm minimization on both least squares loss function and regularization. It is well known that the $\u21132,1$-norm-based loss function is robust to outliers and the $\u21132,1$-norm regularization is used to select features across all data samples with joint sparsity. Many newly proposed algorithms for feature selection are based on $\u21132,1$-norm regularization, both supervised (Xiang et al., 2012; Wang et al., 2011; Lan, Hou, & Yi, 2016) and unsupervised (Yang, Hou, Nie, & Wu, 2012; Hou, Nie, Li, Yi, & Wu, 2014) methods. In addition, correntropy can enhance robustness as well. He, Tan, Wang, and Zheng (2012) apply correntropy and $\u21132,1$-norm regularization into a unified framework: correntropy regularization algorithm (CRFS). In this letter, a novel feature selection model, least squares regression with large margin for feature selection (LSLM-FS), has been proposed that joints $\u21132,1$-norm on both loss function and regularization. The large margin constraint is incorporated into our model to improve the classification.

The main contributions of our letter are summarized as follows:

- 1.
Our OLSLM model combines orthogonal constraint and the large margin constraint, which can preserve more local information in the subspace and ensure the margin between true and false classes larger than one.

- 2.
We propose another feature selection algorithm with a large margin constraint, which achieves awesome performance on several real-world data sets.

- 3.
We propose two efficient iterative optimization algorithms to optimize these two models. We also present the convergence of two models.

The experiments are divided into two parts: OLSLM and LSLM-FS. We use classification accuracy as the evaluation criterion of the models. Experiments are conducted on several public benchmark data sets, including the UCI and high-dimensional data sets, and it indicates that the OLSLM algorithm outperforms comparison algorithms, including an SVM classifier with four different parameters, conventional LSR, and its varieties. The validity of the extended feature selection algorithm LSLM-FS is tested on nine data sets from different fields. We also use recognition accuracy to evaluate of performance.

In section 2, we give some notations and definitions. In section 3, we describe the OLSLM model for multiclass classification in detail. In section 4, we present the LSLM-FS model. In section 5, experiments are conducted to evaluate the OLSLM and LSLM-FS methods. The conclusions are drawn in section 6.

## 2 Notations and Definitions

We summarize the notations and definitions of norms used in this letter. We use bold to represent matrices (uppercase) and vectors (lowercase) and regular fonts to represent scalars. For matrix $W=(wij)$, its $i$th row, $j$th column are denoted as $wi$, $wj$, respectively. The $\u2113p$-norm of the vector $v\u2208Rn$ is defined as $\u2225v\u2225p=\u2211i=1n|vi|p1p$. The Frobenius norm of the matrix $W\u2208Rn\xd7m$ is defined as $\u2225W\u2225F=\u2211i=1n\u2211j=1mwij2=\u2211i=1n\u2225wi\u222522$. And the $\u21132,1$-norm of matrix $W$ is defined as $\u2225W\u22252,1=\u2211i=1n\u2211j=1mwij2=\u2211i=1n\u2225wi\u22252$.

## 3 Orthogonal Least Squares Regression with Large Margin

### 3.1 Least Squares Regression Revisited

### 3.2 OLSLM for Multiclass Classification

Motivated by Xiang et al. (2012) and Zhao et al. (2016), the ordinary least squares regression can be extended by using orthogonal constraint to maintain more local discriminant information. In spite of this, the target values are still manually assigned as the absolute values 0/1 for a class label vector even though they are in orthogonal projection. Therefore, in this letter, we develop a novel method for multiclass classification, orthogonal least squares regression with large margin (OLSLM), which combines the target matrix constraint on the basis of OLSR so that it not only preserves local structure characteristics but also creates a large margin for targets to ensure the requirement of correct classification in theory. In this section, we describe the OLSLM model and present algorithm 1 for solving the optimization problem. Finally, we analyze the computation complexity of the algorithm and provide the convergence analysis.

#### 3.2.1 Problem Formulation

#### 3.2.2 An Iterative Algorithm to Solve Problem 3.3

The problem with equation 3.3 is that it has three variables with two constraints that need to be optimized, which is a challenge. We present an iterative optimization algorithm to tackle this problem, and the optimum of three variables can be obtained simultaneously. We now introduce the steps of our optimization algorithm:

*Step 1.*We fix $T$ to solve $W$ and $b$. Then the optimization problem, equation 3.3, becomes a regression problem:

*Step 2.*Given $W$ and $b$, equation 3.3 is reduced to solving the retargeting problem:

In summary, we have described the optimization algorithm for solving problem 3.3 and listing the detailed steps in algorithm 1.

#### 3.2.3 Convergence Analysis

#### 3.2.4 Computational Complexity Analysis

Given a data matrix $X\u2208Rn\xd7d$, the main computational costs of algorithm 1 are concentrated in steps 6 to 12:

- •
The complexity of step 7 is $O(mnd2+md3)$.

- •
The complexity of step 8 is $O(ndc)$.

- •
We need $O(nc)$ to calculate steps 9 to 11.

$m$ and $k$ are the number of iterations in OLSR and OLSLM, respectively. Considering that in real-world applications, all of the data sets are resized to a proper size, the number of data is much larger than $m$ and $d$. The computational complexity of OLSLM is $O(kmnd2)$, which is linear with respect to $n$; therefore, our algorithm is easy to implement.

## 4 Least Squares Regression with Large Margin for Feature Selection

### 4.1 Robust Feature Selection based on $\u21132,1$-norm

### 4.2 LSLM for Feature Selection

In the RFS (Nie et al., 2010) algorithm, the least squares loss between soft regression results and hard zero-one labels makes LSR less suitable for classification tasks. Inspired by the ReLSR, Zhang et al. (2015) proved the regression targets are learned from data by focusing on the relative values for requirement of correct classification with a large margin. In this section, we use the idea of a large margin to extend a novel method for feature selection; least squares regression with large margin for feature selection (LSLM-FS), and present an efficient algorithm to solve this problem. Finally, we analyze the optimization algorithm and the computation complexity.

#### 4.2.1 Problem Formulation

#### 4.2.2 An Iterative Algorithm to Solve Problem 4.7

We present an efficient iterative optimization algorithm to address the proposed problem with three variables in our letter:

*Step 1*. We fix $T$ to optimize $W$ and $b$. Then problem of 4.7 is reduced to the following subproblem:

**U**is the diagonal matrix in $\u2208R(d+1)\xd7(d+1)$ with the $i$th diagonal component $Uii$ equal to $1/\u2225w\u02dci\u22252$. Here, $w\u02dci$ is the $i$th row of $W\u02dc$. $D$ is diagonal matrix in $\u2208Rn\xd7n$ with the $i$th diagonal component $Dii$, which equals $1/\u2225x\u02dcTW\u02dc-tiT\u22252$. Here, $ti$ is the $i$th column of matrix $T$.

*Step 2*: Given

**W**and

**b**, the problem can be rewritten as

**T**. Thus, problem 4.13 can be decomposed into $n$ subproblems with a general form as

Based on this analysis, the optimization problem in equation 4.7 has been solved. The steps of the optimization procedure are briefly given in algorithm 2.

#### 4.2.3 Algorithm Analysis

We provide an iterative algorithm to solve problem 4.7 in algorithm 2 and obtain the optimal $W$. Then we can select $z$ features from $d$ original features. First, we calculate the scores for all features $\u2225wi\u22252,(i=1,2,\u2026d)$. Then, we sort these scores and select the top $z$ ranked features as the ultimate result. The convergence of algorithm 2 is proved to be similar to the convergence proof of algorithm 1.

#### 4.2.4 Computational Complexity Analysis

In this part, we analyze the computational complexity of the LSLM-FS model in algorithm 2. It is not hard to deduce that the main computational costs are concentrated in the steps 10 and 15 to 18. The complexity of step 10 is $O((d+1)2(n+c)+(d+1)nc)$. We need $O(ndc)$ to calculate step 15. Another computationally demanding operation needs to be performed in steps 16 to 18. The complexity is $O(nc)$. Suppose the number of iterations is $k$; thus, the maximum computational complexity of algorithm 2 is about $O(k((d+1)2(n+c)))$.

## 5 Experiments

### 5.1 Experimental Results of OLSLM

In order to verify the performance of the OLSLM model, we compare our approach with traditional LSR, DLSR, ReLSR, OLSR, L1-SVM with hinge loss, L2-SVM with squared hinge loss, logistic regression (LR), and the multiclass SVM (MC-SVM) with multiclass hinge loss, on a range of real-world data sets. We give a brief introduction to all of the data sets used in our experiments. Then we introduce the setting of parameters and analyze the experimental results.

#### 5.1.1 Data Sets

In our experiment, we use 12 public data sets to evaluate the performance of the proposed method. Table 1 presents some parameters of these data sets. The first six UCI data sets are taken from the UCI Machine Learning Repository.^{1} The following six high-dimensional data sets are downloaded from the website.^{2} The high-dimensional data sets are a palm print data set, and five face data sets (AR, Georgia Tech, CMU PIE, Yale, and YaleB).

The AR (Martinez, 1998) data set contains over 4000 color images corresponding to 126 people's faces; we use a subset of this data set: 100 persons, each person with 26 color images. Georgia Tech (Chen, Man, & Nefian, 2005) contains images of 50 people. For each individual, there are 15 color images. Most of the images were taken in two sessions to take into account the variations in illumination conditions, facial expression, and appearance. The CMU PIE (Sim, Baker, & Bsat, 2002) data set contains 4,1368 images of 68 people under 13 poses, 4 expressions, and 43 illumination conditions; POSE07 is a subset of this data set. The Yale (Georghiades, Belhumeur, & Kriegman, 2001) data set consists of 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, with/glasses, happy, left-light, with/no glasses, normal, right-light, sad, sleepy, surprised, and wink. The YaleB (Georghiades et al., 2001) data set is an extension of the Yale data set, which consists of 16,128 face images of 38 people. In this experiment, we chose the 2414 frontal images. Palm (Nie, Wang, & Huang, 2014) data set contains 2000 palm images of 100 classes. In the data sets, the color images should be converted to gray images. It is worth noting that in order to accelerate the speed of calculation, images from all of the data sets were resized to an equal proportional scale (shown in Table 1).

Data Sets . | Classes . | Features . | Total Number . | Train Number . |
---|---|---|---|---|

cancer | 2 | 9 | 683 | 274 |

cars | 3 | 8 | 392 | 157 |

glass | 6 | 9 | 214 | 86 |

heart | 2 | 13 | 270 | 108 |

iris | 3 | 4 | 150 | 60 |

vowel | 11 | 13 | 990 | 396 |

AR | 100 | 165 $\xd7$ 120 (17 $\xd7$ 12) | 2600 | 800 |

GT | 50 | 480 $\xd7$ 640 (15 $\xd7$ 20) | 750 | 150 |

POSE07 | 68 | 64 $\xd7$ 64 (16 $\xd7$ 16) | 1629 | 339 |

Yale | 15 | 243 $\xd7$ 320 (15 $\xd7$ 20) | 165 | 30 |

YaleB | 38 | 32 $\xd7$ 32 (16 $\xd7$ 16) | 2414 | 490 |

Palm | 100 | 16 $\xd7$ 16 (16 $\xd7$ 16) | 2000 | 200 |

Data Sets . | Classes . | Features . | Total Number . | Train Number . |
---|---|---|---|---|

cancer | 2 | 9 | 683 | 274 |

cars | 3 | 8 | 392 | 157 |

glass | 6 | 9 | 214 | 86 |

heart | 2 | 13 | 270 | 108 |

iris | 3 | 4 | 150 | 60 |

vowel | 11 | 13 | 990 | 396 |

AR | 100 | 165 $\xd7$ 120 (17 $\xd7$ 12) | 2600 | 800 |

GT | 50 | 480 $\xd7$ 640 (15 $\xd7$ 20) | 750 | 150 |

POSE07 | 68 | 64 $\xd7$ 64 (16 $\xd7$ 16) | 1629 | 339 |

Yale | 15 | 243 $\xd7$ 320 (15 $\xd7$ 20) | 165 | 30 |

YaleB | 38 | 32 $\xd7$ 32 (16 $\xd7$ 16) | 2414 | 490 |

Palm | 100 | 16 $\xd7$ 16 (16 $\xd7$ 16) | 2000 | 200 |

#### 5.1.2 Parameter Settings

**tr(**$\xb7$

**)**is the trace of a matrix. Here, we use 10-fold cross-validation to select the optimal hyperparameter $\beta $ by setting $\beta ^$ from the interval [0:0.1:1] when the number of training samples was more than 200; the three-fold cross-validation approach was performed on the remaining data sets. For SVMs, there exists a regularization parameter C in LIBLINEAR.

^{3}We also use the above cross-validation to select it from the candidate set $10-2,10-1,100,101,102$.

In each experiment, we randomly select a small set of samples from each class for training. Table 1 lists the number of the training samples. The classifier is a one-nearest-neighbor classifier. We use recognition accuracy to evaluate the performance of our method in comparison with different models; the average classification accuracy and standard deviation are obtained by 10 random splits.

#### 5.1.3 Experimental Results Analysis

Previously we proposed an efficient optimization algorithm to minimize the objective function value and have already proved the convergence on the theoretical level. In addition, we evaluated the convergence of our model on all the data sets used in our experiment. As Figure 2 shows, the objective function values consistently fall in the iterative process, which is consistent with the theoretical analysis. And the algorithm converges to a stable value within 50 iterations. The convergence is very fast and almost within 30 iterations.

In addition, we exhibit the averaged training time (per run) of our method and LSR and its variant methods on real data sets. The results are shown in Table 2. As we analyzed in section 3.2.2, calculating the transformation matrix by using the OLSR algorithm is inevitable, and it is performed in each iteration; thus, OLSLM takes longer.

Data Sets . | LSR . | DLSR . | ReLSR . | OLSR . | OLSLM . |
---|---|---|---|---|---|

cancer | 0.031 | 0.047 | 0.047 | 0.031 | 0.125 |

cars | 0.031 | 0.047 | 0.047 | 0.078 | 0.094 |

glass | 0 | 0 | 0.031 | 0.063 | 0.109 |

heart | 0 | 0.016 | 0.047 | 0.047 | 0.078 |

iris | 0.031 | 0.063 | 0.047 | 0.047 | 0.078 |

vowel | 0.047 | 0.078 | 0.063 | 0.047 | 0.219 |

AR | 0.156 | 0.218 | 0.578 | 0.203 | 3.766 |

GT | 0.031 | 0.031 | 0.109 | 0.141 | 1.484 |

POSE07 | 0 | 0.063 | 0.141 | 0.141 | 2.625 |

Yale | 0 | 0.031 | 0 | 0.171 | 2.125 |

YaleB | 0.063 | 0.141 | 0.094 | 0.203 | 3.125 |

Palm | 0.016 | 0.125 | 0.156 | 0.172 | 1.844 |

Data Sets . | LSR . | DLSR . | ReLSR . | OLSR . | OLSLM . |
---|---|---|---|---|---|

cancer | 0.031 | 0.047 | 0.047 | 0.031 | 0.125 |

cars | 0.031 | 0.047 | 0.047 | 0.078 | 0.094 |

glass | 0 | 0 | 0.031 | 0.063 | 0.109 |

heart | 0 | 0.016 | 0.047 | 0.047 | 0.078 |

iris | 0.031 | 0.063 | 0.047 | 0.047 | 0.078 |

vowel | 0.047 | 0.078 | 0.063 | 0.047 | 0.219 |

AR | 0.156 | 0.218 | 0.578 | 0.203 | 3.766 |

GT | 0.031 | 0.031 | 0.109 | 0.141 | 1.484 |

POSE07 | 0 | 0.063 | 0.141 | 0.141 | 2.625 |

Yale | 0 | 0.031 | 0 | 0.171 | 2.125 |

YaleB | 0.063 | 0.141 | 0.094 | 0.203 | 3.125 |

Palm | 0.016 | 0.125 | 0.156 | 0.172 | 1.844 |

### 5.2 Experimental Results of LSLM-FS

We evaluated the performance of the LSLM-FS model in a series of real-world high-dimensional data sets. We compared our method with six benchmark feature selection algorithms: T-test, RFS, CRFS, FS20, FS, and mRMR. We give a brief description of all data sets in Table 3. Then we introduce the parameter settings and analyze the experimental results.

Data Sets . | Classes . | Features . | Total Number . | Train Number . |
---|---|---|---|---|

Faces95 | 72 | 200 $\xd7$ 180(20 $\xd7$ 18) | 1440 | 576 |

PIE10P | 10 | 55 $\xd7$ 44 | 210 | 80 |

Yale | 15 | 243 $\xd7$ 320(27 $\xd7$ 36) | 165 | 60 |

YaleB | 38 | 32 $\xd7$ 32(16 $\xd7$ 16) | 2414 | 978 |

Coil20 | 20 | 128 $\xd7$ 128(16 $\xd7$ 16) | 1440 | 580 |

CLL-SUB-111 | 3 | 11340 | 111 | 44 |

LUNG | 5 | 3312 | 203 | 81 |

TOX | 4 | 5748 | 171 | 69 |

WebKB-WC | 7 | 4189 | 1210 | 484 |

Data Sets . | Classes . | Features . | Total Number . | Train Number . |
---|---|---|---|---|

Faces95 | 72 | 200 $\xd7$ 180(20 $\xd7$ 18) | 1440 | 576 |

PIE10P | 10 | 55 $\xd7$ 44 | 210 | 80 |

Yale | 15 | 243 $\xd7$ 320(27 $\xd7$ 36) | 165 | 60 |

YaleB | 38 | 32 $\xd7$ 32(16 $\xd7$ 16) | 2414 | 978 |

Coil20 | 20 | 128 $\xd7$ 128(16 $\xd7$ 16) | 1440 | 580 |

CLL-SUB-111 | 3 | 11340 | 111 | 44 |

LUNG | 5 | 3312 | 203 | 81 |

TOX | 4 | 5748 | 171 | 69 |

WebKB-WC | 7 | 4189 | 1210 | 484 |

#### 5.2.1 Data Sets

In order to test the performance of our proposed approach, we used nine public data sets from different fields. Among them, the two data sets listed in Table 1 are also included.

The Coil20 (Nene, Nater, & Murase, 1996) data set includes 20 objects, each of which has 72 gray images taken from different view directions. The LUNG (Bhattacharjee et al., 2001) data set contains 203 samples in five classes. Genes with standard deviations smaller than 50 expression units were removed, which produces a data set with 203 samples and 3312 genes. Details about the Faces95,^{4} PIE10P, CLL-SUB-111, TOX,^{5} and WebKB-WC^{6} data sets are in Table 2.

#### 5.2.2 Parameter Settings

We compared our algorithm with several typical feature selection algorithms, including the Fisher score (FS), the minimum redundancy maximum (mRMR), the student's T-test (T-test), RFS, CRFS, and FS20. The LSLM-FS model has a parameter $\beta $ that needs to be tuned. The feature selection performance is evaluated by average classification accuracy using the LibSVM classifier. Here, we use 10-fold cross-validation to select a proper $\beta $ and the regularization parameter $C$ of LibSVM when there are more than 200 training samples. A three-fold cross-validation approach is performed on the remaining data sets. We tune them by grid search from $\beta \u2208$$10-2,10-1,100,101,102$ and $C\u2208$$10-2,10-1,100,101,102$. For all competitors, we also tune parameters repeatedly and report the results under the best parameter settings. In each experiment, we randomly select a percentage of samples from each class for training. Table 3 lists the number of the training samples. We use recognition accuracy to evaluate the performance of our comparison method of different models, the average classification accuracy and standard deviation are obtained by 10 random splits.

#### 5.2.3 Experimental Results Analysis

Compared with the traditional feature selection algorithms, including FS, T-test, and mRMR, our algorithm can achieve higher mean accuracies on almost all data sets with different numbers of selected features. In most cases, our method has obvious improvements compared with RFS, which indicates that the idea of the large margin is feasible for improving classification ability. The CRFS algorithm also achieves very promising results. For some data sets, in particular TOX and WebKB-WC, we can see that when the number of selected features is small, CRFS outperforms LSLM-FS. However, as shown in Figure 3, with the increasing number of selected features, our algorithm achieves higher mean accuracies, which is better than the CRFS algorithm. In the PIE10P data set, our algorithm, compared with FS20, generates slightly low accuracy under a low dimension.

Table 4 reports the average training time of running one split with the seven algorithms. As seen from the results in Table 4, compared with the three classic algorithms—T-test, FS, and mRMR, our method cost less time on most data sets, particular in high-dimensional data sets. The remaining algorithms take more time as the remaining algorithms are implemented iteratively, especially the FS20 algorithm, because this method needs to iterate several times to converge.

Data Sets . | T-test . | FS . | mRMR . | RFS . | CRFS . | FS20 . | LSLM-FS . |
---|---|---|---|---|---|---|---|

Faces95 | 13.219 | 0.672 | 0.703 | 0.031 | 0.078 | 0.094 | 1.641 |

PIE10P | 1.611 | 1.156 | 2.297 | 0.031 | 0.688 | 0.016 | 0.141 |

Yale | 1.578 | 0.656 | 2.297 | 0.031 | 0.094 | 0.016 | 0.031 |

YaleB | 4.406 | 0.406 | 0.941 | 0.063 | 0.031 | 0.266 | 3.281 |

Coil20 | 0.984 | 0.234 | 0.794 | 0.047 | 0.062 | 0.218 | 0.984 |

CLL-SUB-111 | 0.531 | 1.859 | 2.766 | 0.031 | 39.734 | 0.094 | 0.281 |

LUNG | 0.500 | 0.641 | 2.313 | 0.031 | 1.391 | 0.047 | 0.219 |

TOX | 0.531 | 1.406 | 2.453 | 0.047 | 6.672 | 0.031 | 0.703 |

WebKB-WC | 0.578 | 1.109 | 3.406 | 0.031 | 6.750 | 0.031 | 0.218 |

Data Sets . | T-test . | FS . | mRMR . | RFS . | CRFS . | FS20 . | LSLM-FS . |
---|---|---|---|---|---|---|---|

Faces95 | 13.219 | 0.672 | 0.703 | 0.031 | 0.078 | 0.094 | 1.641 |

PIE10P | 1.611 | 1.156 | 2.297 | 0.031 | 0.688 | 0.016 | 0.141 |

Yale | 1.578 | 0.656 | 2.297 | 0.031 | 0.094 | 0.016 | 0.031 |

YaleB | 4.406 | 0.406 | 0.941 | 0.063 | 0.031 | 0.266 | 3.281 |

Coil20 | 0.984 | 0.234 | 0.794 | 0.047 | 0.062 | 0.218 | 0.984 |

CLL-SUB-111 | 0.531 | 1.859 | 2.766 | 0.031 | 39.734 | 0.094 | 0.281 |

LUNG | 0.500 | 0.641 | 2.313 | 0.031 | 1.391 | 0.047 | 0.219 |

TOX | 0.531 | 1.406 | 2.453 | 0.047 | 6.672 | 0.031 | 0.703 |

WebKB-WC | 0.578 | 1.109 | 3.406 | 0.031 | 6.750 | 0.031 | 0.218 |

## 6 Conclusion

In this letter, we propose two novel algorithms: the orthogonal least squares regression with large margin (OLSLM) algorithm for multiclass classification and the least squares regression with large margin for feature selection (LSLM-FS). The OLSLM is different from most traditional LSR and its extensions. The proposed method makes use of orthogonal characteristic and soft labels information to boost the ability of classification. The core idea is to impose the large margin constraint on an OLSR model to replace the hard labels with relative values. Thus, our method can preserve the local discriminant information and guarantee that each sample can be correctly classified with large margins simultaneously. An efficient iterative algorithm is also proposed to solve the optimization problem. In addition, the idea of a large margin is applied to explore a novel feature selection method. The LSLM-FS adopts the $\u21132,1$-norm on both loss function and regularization to produce a sparse learning model under the large margin constraint condition, and an efficient iterative algorithm is proposed to solve the optimization problem as well.

The OLSLM model, however, is sensitive to noise, that is, it cannot suppress the interference of noise. We can use other norms to replace the $F$-norm to ensure that the model can be more robust in the future. In addition, many graph-based methods have been widely used recently. Thus one part of our future work is to incorporate the theory of graph into our framework.

## Notes

^{4}

http://cswww.essex.ac.uk/mv/allfaces/index.html.

## Acknowledgments

This research is supported in part by the National Natural Science Foundation of China (61402002, 61502002, 61300057); the Project Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry (48,2014-1685); the Natural Science Foundation of Anhui Province (1408085QF120, 1408085MKL94); the Key Natural Science Project of Anhui Provincial Education Department (KJ2016A040); and Open Project of IAT Collaborative Innovation Center of Anhui University (ADXXBZ201511).

## References

*Engineering statistics*

_{2,1}-norms minimization. In