Abstract

This work proposes a robust regression framework with nonconvex loss function. Two regression formulations are presented based on the Laplace kernel-induced loss (LK-loss). Moreover, we illustrate that the LK-loss function is a nice approximation for the zero-norm. However, nonconvexity of the LK-loss makes it difficult to optimize. A continuous optimization method is developed to solve the proposed framework. The problems are formulated as DC (difference of convex functions) programming. The corresponding DC algorithms (DCAs) converge linearly. Furthermore, the proposed algorithms are applied directly to determine the hardness of licorice seeds using near-infrared spectral data with noisy input. Experiments in eight spectral regions show that the proposed methods improve generalization compared with the traditional support vector regressions (SVR), especially in high-frequency regions. Experiments on several benchmark data sets demonstrate that the proposed methods achieve better results than the traditional regression methods in most of data sets we have considered.

1  Introduction

In the training set , represents an input sample, defines the output of , and represents the number of samples. The goal of the regression problem is to learn the relationship between and , to minimize the expected risk. Generally the regularized regression framework can be formulated as
formula
1.1
where is a penalty parameter and the , , controls model complexity. Two popular choices in machine learning are and 2. The second term in the objective function defines the empirical risk that is presented by a loss function , and the different loss functions correspond to the different mathematical programming. For example, the -insensitive loss function, , is applied in the standard SVR (Vapnik 1998); the quadratic loss function (called -loss), , is employed in least-squares SVR (LSSVR; Suykens & Vandewalle, 1999; Suykens, Brabanter, Lukas, & Vandewalle, 2002). Compared with traditional SVR, the main advantages of the LSSVR are that it runs fast and is easy to implement, with comparable generalization.

The quadratic loss is a popular choice in regression problems, and it assumes a zero-mean gaussian distribution density exp. The gradient of the quadratic loss, , is an unbounded function. Thus, the -loss is sensitive to noise and outliers. In many applications, samples may be contaminated by the outliers or noise. These characteristics motivated us to construct robust regression models by replacing the least square loss with other robust losses such the Huber loss (Huber, 1981), pinball loss (Steinwart & Christmann, 2011), entropy-based losses (Principe, 2010; Singh, Pokharel, & Principe, 2014; Feng, Huang, Shi, Yang, & Suykens, 2015), and rescaled hinge loss function (Xu, Cao, Hu, & Principe, 2017).

The -loss assumes a zero-mean Laplace prior distribution exp. Its derivatives are bounded when and . Thus, the -loss is less sensitive to noise and outliers than the -loss. The -insensitive loss, , is a generalized -loss. The zero norm-based loss assumes a zero-mean pulse prior exp. It involves a discontinuous and nonconvex problem. Therefore, most work in dealing with the zero-norm has focused on effective approximation of the zero-norm. The -norm is only a convex approximation of the zero-norm.

Different nonconvex loss functions have been proposed to develop robust regressions. Typical examples include C-loss (Singh et al., 2014), rescaled hinge loss (Xu et al., 2017), the M-estimate cost (Chan & Zou, 2004; Liu, Pokharel, & Principe, 2006), and Ramp loss (Huang, Shi, & Suykens, 2014). The rescaled hinge loss is based on hinge loss function and has been successfully used in SVM classification (Xu et al., 2017). The ramp loss is based on a truncated hinge loss, which is a bounded loss nonconvex function. Based on kernel learning, the C-loss function is induced by a gaussian kernel and has been used successfully in correntropy learning (Liu et al., 2007; Principe, 2010; Feng et al., 2015). Correntropy, a nonlinear and local similarity measure, has a close relation to M-estimation. It is insensitive to noise and thus is a robust adaptation cost in the presence of outliers and noises. The kernel function in correntropy learning is usually based on a gaussian kernel that is desirable due to its smoothness and strict positive-definiteness. However, a gaussian kernel is not always the best choice.

In this work, we propose to use the Laplace kernel function in correntropy. The main contributions of this work are summarized as follows:

  1. A robust regression framework is proposed based on the Laplace kernel. Two regression formulations are presented with the Laplace kernel–induced loss (LK-loss): the LK-loss function (LKRE) and the mixed loss function (MLKRE).

  2. The nonconvexity of the LK-loss makes it difficult to optimize. A continuous optimization method in this work is developed to solve the proposed problems. By an appropriate decomposition of the LK-loss, the proposed models are formulated as DC (difference of convex functions) programming (Tao & An, 1997; Thi, Le, Nguyen, & Dinh, 2008; Yang & Wang, 2013; Thi, Tao, Minh, & Thanh, 2014). The corresponding DC algorithms converge linearly and have low computational burden, solving only a few simple quadratic programming problems.

  3. We demonstrate that the LK-loss can provide a good approach for the zero-norm. Given a vector , the zero-norm of can be approximated by an LK-loss function, .

  4. The LK-loss is bounded. Apart from , its derivative is also bounded. From the viewpoints of M-estimation (Huber 1981; Chan & Zhou, 2004), the LK-loss is a robust adaptation cost in the presence of outliers and noise.

  5. The proposed models are simulated on various data sets, and we evaluate the proposed methods under different noise settings. Furthermore, the proposed methods are directly applied in a practical application to analyze the hardness rate of licorice seeds employing near-infrared (NIR) spectroscopy (Yang & Sun, 2012) data. Experimental results in different spectral regions show that the proposed methods obtain better generalization than traditional regression methods, especially in high-frequency regions.

Throughout the work, we adopt the following notations. The scalar product of two vectors and in the -dimensional real space is denoted by or . For an -dimension vector , denotes the -norm of , , where denotes absolute an value operator, and denotes the -norm of , . The base of the natural logarithm is denoted by . A vector of zeros of arbitrary dimension is denoted by 0. An arbitrary dimension vector of ones is denoted by .

The rest of this letter is organized as follows. Section 2 gives a short summary of traditional regression methods and DC programming. In section 3, we propose a new robust regression framework based on Laplace kernel–induced loss function, and we present two regression formulations with LK-loss. Experiments are carried out in section 4. Section 5 summarizes the main contributions.

2  Background

2.1  Regression with Least Square Loss Function

Let , and take the -loss in equation 1.1; then the regularized linear regression formulation has the form
formula
2.1

where . This is similar to the least-square SVR (LSSVR), the goal of which is to minimize both the norm of the weight vector and regression errors in the training set, where is the penalty parameter. In contrast to the standard SVR, the LSSVR uses equality constraints for errors and -loss. As discussed above, quadratic loss is sensitive to noise and outliers. To tackle this drawback, the robustness of the LSSVR has been investigated recently (Suykens et al., 2002; Zhao & Sun, 2008; Wen, Hao, & Yang, 2010; Yang, Tan, & He, 2014) such as a weighted LSSVR (Suykens et al., 2002), the least-squares SVR with noisy data (Zhao & Sun, 2008; Wen et al., 2010), and the least-squares SVR with nonconvex loss function (Yang et al., 2014).

2.2  DC Programming and DC Algorithm

DC programming and DCA, introduced by Pham Dinh Tao in 1985, constitute the backbone of nonconvex continuous programming. Generally a DC program takes the form
formula
2.2
where and are lower semicontinuous proper convex functions on . Such a function is called a DC function. and are the DC components of . A function is said to be polyhedral convex if
formula
2.3
where . The is the indicator function of the nonempty convex set and is defined as if and otherwise.
A point that satisfies the following generalized Kuhn-Tucker condition is called a critical point of ,
formula
2.4
where is the subdifferential of the convex function .
The necessary local optimality condition for is
formula
2.5
which is also sufficient for many important classes of DC programs—for example, polyhedral DC programs or when is locally convex at . We use to denote the conjugate function of . The Fenchel-Rockafellar dual of is defined as
formula
2.6
DCA is an iterative algorithm based on local optimality conditions and duality. The idea of DCA is simple: at each iteration, one replaces the second component in the primal DC problem by its affine minimization, to generate the convex program
formula
2.7
which is equivalent to determining . The second DC component of the dual DC program is replaced by its affine minimization, , to obtain a convex program that is equivalent to determining .

In practice, a simplified form of the DCA is used. Two sequences and satisfying are constructed, and is a solution to the convex program, equation 2.7. The DCA scheme is simplified as follows.

Initialization: Choose an initial point and set

Repeat

Calculate

Solve convex program equation 2.7 to obtain

Let k: = k + 1

Until some stopping criterion is satisfied.

DCA is a descent algorithm without line search. These properties are used in the following sections (for simplicity, we omit the dual part of these properties):

  1. If , then is a critical point for . In this case, DCA terminates at the th iteration.

  2. Let be a local solution to the dual of and . If is differentiable at , then is a local solution to .

  3. If the optimal value of problem is finite and the infinite sequence is bounded, then every limit point of the sequence is a critical point of .

  4. DCA converges linearly for general DC programming. Especially for polyhedral DC programming, the sequence contains many finite elements, and in a finite number of iterations, the algorithm converges to a critical point satisfying the necessary optimality condition.

Moreover, if the second DC component in is differentiable, the subdifferential of the at point is reduced to a singleton, . In this case, is a solution to the following convex program:
formula
2.8

DCA is an efficient and robust algorithm for solving nonconvex problems, especially in a large-scale setting, and it has been successfully applied to many nonconvex optimizations.

3  A Regression Framework with Laplace Kernel–Induced Loss

In this section, we consider a new regression framework with Laplace kernel–induced loss function (LK-loss), and two regression formulations are presented based on the LK-loss.

3.1  Regression with LK-Loss (LKRE)

Let define the regression error with components . In a linear setting, the regression function has the form . Let in equation 1.1, and then the regularized regression framework in linear setting can be expressed as
formula
3.1
where penalty parameter controls a trade-off between empirical errors and model complexity. Considering the LK-loss,
formula
3.2
and replacing by LK-loss, we obtain a new regression formulation with LK-loss (LKRE):
formula
3.3
where . Note that problem 3.3 involves a nonconvex optimization, which makes it difficult to find its global optimization solution.

The LK-loss is based on the Laplace kernel function. It has the following obvious advantages:

  1. The LK-loss function (as illustrated in Figure 1) is a positive, symmetrical, and bounded function. It reaches its maximum if and only . The LK-loss satisfies
    formula
    formula
    We have
    formula
    formula
    Thus, the LK-loss function is insensitive to noise and outliers.
  2. For , we have
    formula
    Proof: The empirical risk obtained using the LK-loss function can be written as
    formula
    Computing the limit as , we have
    formula
    formula
  3. Compared with other approximations of the zero-norm such as the gaussian kernel-induced loss (C-loss),
    formula
    the approximation accuracy of the LK-loss is higher than that of the C-loss. A comparison of these two loss functions, and , with the same value , is illustrated in Figure 2.

Figure 1:

LK-loss function with different values.

Figure 1:

LK-loss function with different values.

Figure 2:

Two different loss functions.

Figure 2:

Two different loss functions.

Note that the LK-loss involves a nonconvex function, and constructing a hyperplane by directly solving problem 3.3 is generally difficult.

It is worth noting that the LK-loss can be expressed as a DC function
formula
3.13
with
formula
3.14
Using the decomposing formula 3.14, the LKRE, equation 3.3, is rewritten as
formula
3.15
Finally, the LKRE, equation 3.3, can be transformed into DC programming:
formula
3.16
with
formula
3.17
and
formula
3.18
According to the analysis in section 2, performing DCA for problem 3.16 amounts to computing the sequence , and is the solution to the quadratic programming:
formula
3.19
Meanwhile, we introduce a variable t with component . Then problem 3.19 is reformulated as
formula
3.20
The differential of the at point is given by
formula
3.21
with components
formula
where , and k is the current number of completed iterations.

Let . According to the DCA scheme, the main steps in solving the LKRE, equation 3.16 are shown next.

Algorithm 1 (DC algorithm for solving LKRE, equation 3.16):

  1. Let be sufficiently small, and set k = 0. Choose an initial point , where is the feasible region of equation 3.20.

  2. Solve quadratic programming, equation 3.20, to obtain .

  3. If or , then stop; is the computed solution. Otherwise, set k = k + 1 and go to equation 2.

Theorem 1.

  1. Algorithm 1 generates the sequence such that decreases monotonically.

  2. The sequence converges linearly.

Proof.

These two conclusions are direct consequences of the convergence properties of the general DC programming.

3.2  Regression with Double Loss

In this section, we develop a new regression formulation based on a combination of the square loss (-loss) and LK-loss. Specifically, we incorporate the LK-loss function into the regression, equation 2.1, by weighting the LK-loss with a suitably chosen parameter , which leads to a new linear regression model with a mixed-loss function called MLKRE:
formula
3.22
where are two penalty parameters. This can be reformulated as
formula
3.23
Applying the decomposition, equation 3.14, the MLKRE, equation 3.23, is reformulated as
formula
3.24
This can be transformed into a unconstrained DC programming:
formula
3.25
where
formula
3.26
is defined by equation 3.18:
formula
3.27
Its differentiable is defined as equation 3.21.
Like dealing with equation 3.16, performing DCA for solving equation 3.25 amounts to computing the sequence at each iteration , with being an optimal solution to the following convex problem:
formula
3.28

Let . According to the DCA scheme, the main steps in solving MLKRE, equation 3.25, are as follows.

Algorithm 2: DC algorithm for solving equation 3.25:

  1. Let be sufficiently small, and set k = 0. Choose an initial point , where is the feasible region of equation 3.28.

  2. Solve the quadratic program, equation 3.28, to obtain .

  3. If , or , then stop and is the computed solution. Otherwise, set k = k + 1 and go to 2.

Theorem 2.

  1. Algorithm 2 generates the sequence such that decreases monotonically.

  2. The sequence converges linearly.

Remarks.

The proposed LKRE and MLKRE models are based on the Laplace kernel–induced loss. They are different from the traditional regression models and have the following obvious advantages:

  1. As discussed above, the LK-loss is a good approximation of the zero-norm:
    formula
    The zero-norm defines the number of nonzero elements in the training error vector .
  2. The LKRE, equation 3.3, is different from the regression with the -loss. It is based on the Laplace kernel and LK-loss is a good approach of the zero-norm, from which we can conclude that the empirical risk obtained by LK-loss behaves like the zero-norm for large training errors (misclassified samples). Therefore, the proposed LKRE involves minimizing the number of nonzero elements in training error vector .

  3. Different from the -norm loss, the LK-loss is a bounded function. Apart from , its derivative is nonnegative at (or ) and bounded, while the -norm loss function and its gradient are unbounded. The weight function of -loss satisfies
    formula
    Thus, the -loss function is known to be highly sensitive to noise and outliers. The LSSVR is difficult to use when input data are noisy and have outliers. The proposed regression framework is with LK-loss, and thus it is robust to noise and outliers.
  4. Compared to the C-loss function , equation 3.12, (Singh et al., 2014) induced by the gaussian kernel and approaches the zero-norm as , Figure 2 shows that the approximation accuracy of the LK-loss is higher than that of the C-loss with the same parameter value.

  5. Compared with the ramp loss (Huang et al., 2014) based on a truncated hinge loss function, the proposed method is based on a Laplace kernel function. Both the proposed LK-loss and the ramp loss are bounded nonconvex loss functions.

  6. The classical SVR is with the generalized -loss function, which is an unbounded function although its weigh function approximates zero as the regression error approximates infinity. Thus the proposed algorithms are more robust than the traditional regression models with the -loss and -loss loss functions.

4  Experiments

To evaluate the proposed methods, we implemented numerical simulations on some real-world data sets. We chose the traditional SVR and LSSVR as the baseline methods. All the experiments were carried out in Matlab2014 on a PC with an Intel Core I5 processor with 2 GB of RAM. We used the quadprog function in Matlab (http:www.mathworks.com) to solve the related optimization problems.

The numerical experiments were implemented on various data sets, with the experiments composed of two parts. In the first part, the experiment was carried out on a practical application data set, near-infrared (NIR) spectroscopy data. In the second part, we ran the proposed algorithms on 10 benchmark data sets from the UCI Machine Learning Repository (Blake & Merz, 1998). We performed 10-fold cross-validation in all the data sets. The data set was split randomly into 10 subsets, with one of those sets reserved as a test set. This process was repeated 10 times, and the average of the 10-fold test results was used as the performance measure.

4.1  Experimental Design and Parameters Selection

This work focuses on the robustness of the proposed methods, and thus the training samples were contaminated to simulate noise. We added gaussian noise, which follows the normal distribution with a zero mean and variance to all data.

To evaluate the algorithms' performance, we specify the evaluation criteria before presenting the experimental results. We adopt the following popular regression estimation criteria (Peng, 2010):

  • The root mean square error (RMSE) and mean absolute error (MAE):
    formula
    4.1
    The RMSE is commonly used as the deviation measurement between the actual and predicted values. It represents the fitting precision. The smaller RMSE is, the better fitting the performance is.
  • The mean relative error (MRE):
    formula
    4.2
    The MRE is also a popular deviation measurement between the actual and predicted values.
  • The ratio of the sum of squared error (SSE) to the sum of squared deviation of testing sample SST (SSE/SST).
    formula
    4.3
  • The ratio between the interpretable sum deviation SSR and SST (SSR/SST):
    formula
    4.4
    where m denotes the number of test samples, is the target, and is the corresponding prediction. Denote as the average value of .

In most cases, a small SSE/SST indicates good agreement between estimates and actual values. Obtaining a smaller SSE/SST usually accompanies an increase in SSR/SST. However, an extremely small SSE/SST value is not good since it probably means overfitting. Therefore, a good estimator should strike the balance between SSE/SST and SSR/SST.

4.1.1  Parameter selection

The performance of proposed models usually depends on parameter choices. To save CPU time in parameter selection, we search for the optimal parameters:

1. The penalty parameters () in the proposed LKRE and MLKRE models implement trade-offs between empirical risk and model complexity. In general, when they are large, the risk minimization is dominant, leading to smaller regression error. These parameters are searched from the set in each data set. For spectral region 4000 to 10,000 cm, the relationship between the RMSE and penalty parameter C for the LKRE is illustrated in Figure 3, where the -axis denotes the values of parameter C and the -axis denotes the test error RMSE. We see from Figure 3 that the RMSE obtains a minimum value at when goes from 1 to 10,000.

Figure 3:

RMSE versus C with and on spectral region 4000–10,000 cm.

Figure 3:

RMSE versus C with and on spectral region 4000–10,000 cm.

2. The performance of the proposed algorithms also depends on the choices of the Laplace kernel parameter . The optimal parameter is chosen from the set . The relationship of the RMSE against the different for LKRE in the spectral region 4000 to 10,000 cm is illustrated in Figure 4, where the -axis denotes the values of parameter and the -axis denotes the RMSE. Figure 4 shows that the RMSE obtains a minimum value at when goes from 0.1 to 5. Moreover, for SVR and LSSVR, the insensitive parameter and penalty parameter are adjusted by 10-fold in the sets and , respectively, to minimize the RMSE. The averaged results with optimal parameters C and are reported.

Figure 4:

RMSE versus with and on region 4000–10,000 cm.

Figure 4:

RMSE versus with and on region 4000–10,000 cm.

3. The optimal parameter as noise variance is chosen from the set . For the LKRE and LSSVR, Figure 5 shows the relationship of the RMSE to the different in spectral region 4000 to 10,000 cm. Moreover, we know from Figure 5 that the RMSE of the LSSVR and LKRE increases gradually when goes from 0.01 to 10.0. However, when increases continuously, the RMSE of the LSSVR still increases while the RMSE of the LKRE remains relatively the same. This suggests that the proposed LKRE is insensitive to when takes values greater than 10 in the spectral region 4000 to 10,000 cm.

Figure 5:

RMSE versus different on spectral region 4000–10,000 cm.

Figure 5:

RMSE versus different on spectral region 4000–10,000 cm.

These findings help in selecting parameters in our experiments. The optimal values of these parameters are reported in the experimental results. Finally, the insensitive parameter in all cases is chosen as .

4.2  Experiments on the NIR Spectroscopy Data Set

Near-infrared (NIR) spectroscopy is based on the absorption of electromagnetic radiation in the region from 4000 to 10,000 cm. NIR spectra have been successfully applied to analyze the chemical ingredients or quality parameters of compounds. Licorice is a traditional Chinese herbal medicine with a hard seed. Usually the hardness of the seed is determined by soaking the seeds, although this method is time-consuming and sometimes destroys the seeds. The licorice seeds used in this experiment were harvested between 2002 and 2007 from various locations in China. The rate of the hardness of the seed varied from to . A total of 112 licorice seeds were used in the experiment.

The NIR spectra were acquired by using a spectrometer fitted with a diffuse reflectance fiber probe. Spectra were recorded over a range of 4000 to 12,000 cm with a resolution of 8 cm. Each spectrum was the average of 32 repeated scans. This procedure was repeated four times for each sample: twice from the front at different locations and twice from the rear at different locations. A final spectrum was taken as the mean spectrum of these four spectra. Consequently, the spectral data set contains 112 samples measured at 2100 wavelengths in the range of 4000 cm to 12,000 cm. To evaluate the performance of the proposed model, numerical experiments were carried out on eight spectral regions, denoted regions A to H, respectively. Information on them is summarized in Table 1.

Table 1:
Near-Infrared Spectral Sample Regions of Licorice Seeds.
RegionsSpectral Range (cm)Number of SamplesNumber of Variables
Region A 4000–6000 112 525 
Region B 6000–8000 112 525 
Region C 8000–10,000 112 525 
Region D 10,000–12,000 112 525 
Region E 4000–8000 112 1050 
Region F 8000–12,000 112 1050 
Region G 4000–10,000 112 1575 
Region H 4000–12,000 112 2100 
RegionsSpectral Range (cm)Number of SamplesNumber of Variables
Region A 4000–6000 112 525 
Region B 6000–8000 112 525 
Region C 8000–10,000 112 525 
Region D 10,000–12,000 112 525 
Region E 4000–8000 112 1050 
Region F 8000–12,000 112 1050 
Region G 4000–10,000 112 1575 
Region H 4000–12,000 112 2100 

4.2.1  Experimental Results

The average results of the four algorithms—LKRE, SVR, LSSVR, and MLKRE—are presented in Tables 2 and 3. The optimum parameters for these algorithms are also listed. We see from Tables 2 and 3 that the proposed LKRE and MLKRE outperform SVR in all eight regions, and compared with LSSVR,the proposed LKRE and MLKRE achieve slightly better results in eight regions.

Table 2:
Comparisons of Four Algorithms—SVR, LSSVR, LKRE and MLKRE—on Regions A–D.
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Region A SVR  29.8573 27.1146 3.6508 1.0842 0.2766 
 LSSVR  29.8501 27.1098 3.6699 1.0886 0.2857 
 LKRE  29.5984 26.7667 3.5283 1.0592 0.2621 
 MLKRE  29.6174 26.8991 3.6080 1.0666 0.2801 
Region B SVR  31.2885 28.4391 3.7667 1.1963 0.3086 
 LSSVR  31.0255 28.2304 3.7828 1.1728 0.2880 
 LKRE  30.9603 28.1729 3.7581 1.1696 0.2901 
 MLKRE  30.9524 28.1747 3.8397 1.1560 0.2699 
Region C SVR  32.7854 29.8530 3.9449 1.3175 0.3496 
 LSSVR  32.5839 29.6470 3.9138 1.3007 0.3381 
 LKRE  32.2285 29.4282 3.9858 1.2571 0.2900 
 MLKRE  32.0589 29.1947 3.9337 1.2515 0.2873 
Region D SVR  32.2493 29.3110 3.8546 1.2716 0.3305 
 LSSVR  31.7670 28.8923 3.8138 1.2281 0.2990 
 LKRE  31.6435 28.6983 3.7629 1.2139 0.2854 
 MLKRE  31.6003 28.7323 3.8999 1.2069 0.2748 
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Region A SVR  29.8573 27.1146 3.6508 1.0842 0.2766 
 LSSVR  29.8501 27.1098 3.6699 1.0886 0.2857 
 LKRE  29.5984 26.7667 3.5283 1.0592 0.2621 
 MLKRE  29.6174 26.8991 3.6080 1.0666 0.2801 
Region B SVR  31.2885 28.4391 3.7667 1.1963 0.3086 
 LSSVR  31.0255 28.2304 3.7828 1.1728 0.2880 
 LKRE  30.9603 28.1729 3.7581 1.1696 0.2901 
 MLKRE  30.9524 28.1747 3.8397 1.1560 0.2699 
Region C SVR  32.7854 29.8530 3.9449 1.3175 0.3496 
 LSSVR  32.5839 29.6470 3.9138 1.3007 0.3381 
 LKRE  32.2285 29.4282 3.9858 1.2571 0.2900 
 MLKRE  32.0589 29.1947 3.9337 1.2515 0.2873 
Region D SVR  32.2493 29.3110 3.8546 1.2716 0.3305 
 LSSVR  31.7670 28.8923 3.8138 1.2281 0.2990 
 LKRE  31.6435 28.6983 3.7629 1.2139 0.2854 
 MLKRE  31.6003 28.7323 3.8999 1.2069 0.2748 

Note: The numbers in bold are the best results.

Table 3:
Comparisons of Four Algorithms—SVR, LSSVR, LKRE and MLKRE—on Regions E–H.
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Region E SVR  0.0309 0.0259 0.0208 1.8620 1.0538 
 LSSVR  0.0296 0.0252 0.0201 1.8342 1.0314 
 LKRE  0.0292 0.0248 0.0199 1.7668 0.9729 
 MLKRE  0.0292 0.0247 0.0198 1.7605 0.9610 
Region F SVR  0.0206 0.0175 0.0385 1.7228 0.8158 
 LSSVR  0.0187 0.0163 0.0352 1.3733 0.4662 
 LKRE  0.0186 0.0162 0.0349 1.3444 0.4380 
 MLKRE  0.0185 0.0162 0.0348 1.3346 0.4350 
Region G SVR  0.0312 0.0262 0.0211 1.9098 1.0642 
 LSSVR  0.0298 0.0253 0.0202 1.8188 0.9800 
 LKRE  0.0297 0.0251 0.0201 1.8108 0.9719 
 MLKRE  0.0299 0.0254 0.0203 1.8175 0.9715 
Region H SVR  0.0337 0.0282 0.0226 2.0536 1.0894 
 LSSVR  0.0302 0.0257 0.0205 1.8675 1.0048 
 LKRE  0.0301 0.0255 0.0204 1.8411 0.9868 
 MLKRE  0.0301 0.0256 0.0204 1.8723 1.0116 
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Region E SVR  0.0309 0.0259 0.0208 1.8620 1.0538 
 LSSVR  0.0296 0.0252 0.0201 1.8342 1.0314 
 LKRE  0.0292 0.0248 0.0199 1.7668 0.9729 
 MLKRE  0.0292 0.0247 0.0198 1.7605 0.9610 
Region F SVR  0.0206 0.0175 0.0385 1.7228 0.8158 
 LSSVR  0.0187 0.0163 0.0352 1.3733 0.4662 
 LKRE  0.0186 0.0162 0.0349 1.3444 0.4380 
 MLKRE  0.0185 0.0162 0.0348 1.3346 0.4350 
Region G SVR  0.0312 0.0262 0.0211 1.9098 1.0642 
 LSSVR  0.0298 0.0253 0.0202 1.8188 0.9800 
 LKRE  0.0297 0.0251 0.0201 1.8108 0.9719 
 MLKRE  0.0299 0.0254 0.0203 1.8175 0.9715 
Region H SVR  0.0337 0.0282 0.0226 2.0536 1.0894 
 LSSVR  0.0302 0.0257 0.0205 1.8675 1.0048 
 LKRE  0.0301 0.0255 0.0204 1.8411 0.9868 
 MLKRE  0.0301 0.0256 0.0204 1.8723 1.0116 

Note: The numbers in bold are the best results.

Figure 6 illustrates the reduction of RMSE of LKRE on eight regions A to H for SVR, and Figure 7 shows the reduction of RMSE of LKRE on eight regions A to H for LSSVR. Compared with SVR, Figure 6 shows that the proposed LKRE causes the RMSE to decrease in all eight regions, especially in the high-frequency region F. Compared with the LSSVR, we see from Figure 7 that the LKRE causes the RMSE to decrease distinctly in all regions, especially in the frequency region E. This suggests that LKRE achieves better results than the SVR and LSSVR. In addition the proposed methods are more resistant to outliers and noise than the original SVR and LSSVR.

Figure 6:

The reduction of RMSE for LKRE on regions A–H.

Figure 6:

The reduction of RMSE for LKRE on regions A–H.

Figure 7:

The reduction of RMSE for LKRE on regions A–H.

Figure 7:

The reduction of RMSE for LKRE on regions A–H.

4.3  Experiments on UCI Data Sets and Synthetic Data

In this section, we carry out two experiments on eight UCI data sets and synthetic data generalized by the Sinc function: with . The 500 points of data generated in the synthetic data set are perturbed by gaussian noise with zero mean and variance 1.

Machine and concrete data are normalized to the interval [1,1]. The proposed algorithms are first compared with traditional SVR and LSSVR. The average experimental results on 10-fold cross-validation are reported in Table 4, as are the optimum parameters for these algorithms.

Table 4:
Comparisons of Three Algorithms—SVR, LSSVR, and LKRE—on UCI Data Sets.
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Diabetes SVR  0.5863 0.4779 0.1047 1.0580 0.7942 
(43 2) LSSVR  0.5837 0.4715 0.1045 1.0257 0.6818 
 LKRE  0.5772 0.4707 0.1029 0.9880 0.7713 
 MLKRE  0.5704 0.4591 0.0985 0.9669 0.7959 
Pyrim SVR  0.1150 0.0934 0.2123 2.4008 1.5071 
 LSSVR  0.1136 0.0903 0.2053 1.9577 1.0107 
 LKRE  0.1069 0.0847 0.1955 1.7330 1.0268 
 MLKRE  0.1075 0.0865 0.1960 1.4807 0.7875 
Slumptest SVR  2.9866 2.4242 0.0705 0.2492 0.9408 
 LSSVR  2.6696 2.1518 0.0618 0.1823 0.8903 
 LKRE  2.7986 2.2503 0.0649 0.2247 0.8920 
 MLKRE  2.6088 2.1193 0.0612 0.1729 0.8668 
Triazines SVR  0.1544 0.1141 0.3106 1.0563 0.1132 
 LSSVR  0.1538 0.1171 0.3074 1.0502 0.1126 
 LKRE  0.1511 0.1151 0.2995 1.0183 0.0894 
 MLKRE  0.1525 0.1149 0.3049 1.0329 0.1026 
Machine SVR  0.2119 0.1409 0.3296 1.6112 0.8621 
 LSSVR  0.2004 0.1393 0.2967 2.0238 1.5426 
 LKRE  0.2170 0.1344 0.3450 1.1392 0.3513 
 MLKRE  0.2184 0.1352 0.3467 1.1026 0.2386 
Auto-mpg SVR  3.4728 2.7628 0.1294 0.3592 0.9192 
 LSSVR  3.4739 2.7287 0.1217 0.3709 0.8690 
 LKRE  3.4759 2.7074 0.1173 0.3770 0.8152 
 MLKRE  3.4413 2.7245 0.1257 0.3521 0.8767 
Housing SVR  5.3824 3.8847 0.1733 0.6634 0.7263 
 LSSVR  5.1568 3.8086 0.1825 0.6630 0.9203 
 LKRE  5.3284 3.8249 0.1675 0.6333 0.6600 
 MLKRE  5.1373 3.8103 0.1834 0.6522 0.9068 
Concrete SVR  0.3841 0.3205 1.9925 bf1.2204 0.3635 
 LSSVR  0.3867 0.3234 1.8346 1.2407 0.3818 
 LKRE  0.3853 0.3218 2.0474 1.2268 0.3594 
 MLKRE  0.3848 0.3216 1.8648 1.2280 0.3751 
Synthetic data SVR  0.2939 0.1743 1.2551 1.0853 0.0850 
 LSSVR  0.2954 0.2014 2.9878 1.0300 0.0299 
 LKRE  0.2811 0.1703 1.4054 1.0928 0.0937 
 MLKRE  0.2816 0.1686 1.6725 1.0736 0.0740 
Data SetCriteriaRMSEMAEMRESSE/SSTSSE/SSR
Diabetes SVR  0.5863 0.4779 0.1047 1.0580 0.7942 
(43 2) LSSVR  0.5837 0.4715 0.1045 1.0257 0.6818 
 LKRE  0.5772 0.4707 0.1029 0.9880 0.7713 
 MLKRE  0.5704 0.4591 0.0985 0.9669 0.7959 
Pyrim SVR  0.1150 0.0934 0.2123 2.4008 1.5071 
 LSSVR  0.1136 0.0903 0.2053 1.9577 1.0107 
 LKRE  0.1069 0.0847 0.1955 1.7330 1.0268 
 MLKRE  0.1075 0.0865 0.1960 1.4807 0.7875 
Slumptest SVR  2.9866 2.4242 0.0705 0.2492 0.9408 
 LSSVR  2.6696 2.1518 0.0618 0.1823 0.8903 
 LKRE  2.7986 2.2503 0.0649 0.2247 0.8920 
 MLKRE  2.6088 2.1193 0.0612 0.1729 0.8668 
Triazines SVR  0.1544 0.1141 0.3106 1.0563 0.1132 
 LSSVR  0.1538 0.1171 0.3074 1.0502 0.1126 
 LKRE  0.1511 0.1151 0.2995 1.0183 0.0894 
 MLKRE  0.1525 0.1149 0.3049 1.0329 0.1026 
Machine SVR  0.2119 0.1409 0.3296 1.6112 0.8621 
 LSSVR  0.2004 0.1393 0.2967 2.0238 1.5426 
 LKRE  0.2170 0.1344 0.3450 1.1392 0.3513 
 MLKRE  0.2184 0.1352 0.3467 1.1026 0.2386 
Auto-mpg SVR  3.4728 2.7628 0.1294 0.3592 0.9192 
 LSSVR  3.4739 2.7287 0.1217 0.3709 0.8690 
 LKRE  3.4759 2.7074 0.1173 0.3770 0.8152 
 MLKRE  3.4413 2.7245 0.1257 0.3521 0.8767 
Housing SVR  5.3824 3.8847 0.1733 0.6634 0.7263 
 LSSVR  5.1568 3.8086 0.1825 0.6630 0.9203 
 LKRE  5.3284 3.8249 0.1675 0.6333 0.6600 
 MLKRE  5.1373 3.8103 0.1834 0.6522 0.9068 
Concrete SVR  0.3841 0.3205 1.9925 bf1.2204 0.3635 
 LSSVR  0.3867 0.3234 1.8346 1.2407 0.3818 
 LKRE  0.3853 0.3218 2.0474 1.2268 0.3594 
 MLKRE  0.3848 0.3216 1.8648 1.2280 0.3751 
Synthetic data SVR  0.2939 0.1743 1.2551 1.0853 0.0850 
 LSSVR  0.2954 0.2014 2.9878 1.0300 0.0299 
 LKRE  0.2811 0.1703 1.4054 1.0928 0.0937 
 MLKRE  0.2816 0.1686 1.6725 1.0736 0.0740 

Note: The numbers in bold are the best results.

According to the RMSE analysis, Table 4 shows that the proposed MLKRE outperforms the traditional SVR and LSSVR in seven of nine data sets. For other two data sets, the MLKRE obtains results comparable to the SVR and LSSVR. Moreover, the LKRE achieves better results than the SVR and LSSVR in four of nine datasets, and for the other five data sets, the performance of the LKRE has no significant difference from the SVR and LSSVR.

4.4  Comparisons of the Proposed LKRE with Other Robust Regression Models

We compare the proposed LKRE with other robust SVR methods in six UCI data sets. These models include:

  • Weighted least squares support vector machines, called WLS-SVR-H (Suykens et al., 2002)

  • Robust kernel-based regression: a comparison of iterative weighting schemes, called WLS-SVR-L (Brabanter et al., 2009)

  • Robust nonconvex least squares loss function for regression with outliers, called RLS-SVR (Wang & Zhong, 2014)

In these experiments, for each training data set, we randomly chose one-fifth of the samples and added large gaussian noise with zero mean and variance on their targets, where denotes the average value of the targets. The average experimental results on these algorithms are reported in Table 5.

Table 5:
Comparisons of the Proposed KLRE with Other Robust SVR Algorithms.
Data SetMethodsRMSEMAESSE/SSTSSR/SST
Diabetes WLS-SVR-H 0.5835 0.4808 0.9465 0.6179 
(43 2) WLS-SVR-L 0.5846 0.4796 0.9240 0.5042 
 RLS-SVR 0.5812 0.4711 0.8941 0.4490 
 LKRE 0.5337 0.4316 0.8933 0.8176 
Pyrim WLS-SVR-H 0.1026 0.0627 0.5590 0.4788 
 WLS-SVR-L 0.1056 0.0651 0.5947 0.5918 
 RLS-SVR 0.1048 0.0651 0.5836 0.6056 
 LKRE 0.0893 0.0678 1.3978 0.9288 
Triazines WLS-SVR-H 0.1474 0.1097 0.9080 0.2471 
 WLS-SVR-L 0.1464 0.1096 0.9033 0.3082 
 RLS-SVR 0.1463 0.1081 0.8910 0.2553 
 LKRE 0.1440 0.1019 0.9113 0.1354 
Machine WLS-SVR-H 52.8168 27.5543 0.1489 1.0060 
 WLS-SVR-L 52.1319 27.0399 0.1436 0.9335 
 RLS-SVR 51.6456 26.9077 0.1401 0.9454 
 LKRE 61.3626 38.1935 0.5151 0.5984 
Auto-mpg WLS-SVR-H 2.6334 1.9324 0.1080 0.8459 
 WLS-SVR-L 2.7141 1.9911 0.1148 0.8480 
 RLS-SVR 2.6333 1.9259 0.1077 0.8444 
 LKRE 3.7188 2.9140 0.4368 0.9194 
Housing WLS-SVR-H 3.6464 2.6132 0.1661 0.8808 
 WLS-SVR-L 3.7909 2.6898 0.1800 0.8676 
 RLS-SVR 3.5643 2.4787 0.1591 0.8663 
 LKRE 4.6173 3.4025 0.5850 0.8379 
Data SetMethodsRMSEMAESSE/SSTSSR/SST
Diabetes WLS-SVR-H 0.5835 0.4808 0.9465 0.6179 
(43 2) WLS-SVR-L 0.5846 0.4796 0.9240 0.5042 
 RLS-SVR 0.5812 0.4711 0.8941 0.4490 
 LKRE 0.5337 0.4316 0.8933 0.8176 
Pyrim WLS-SVR-H 0.1026 0.0627 0.5590 0.4788 
 WLS-SVR-L 0.1056 0.0651 0.5947 0.5918 
 RLS-SVR 0.1048 0.0651 0.5836 0.6056 
 LKRE 0.0893 0.0678 1.3978 0.9288 
Triazines WLS-SVR-H 0.1474 0.1097 0.9080 0.2471 
 WLS-SVR-L 0.1464 0.1096 0.9033 0.3082 
 RLS-SVR 0.1463 0.1081 0.8910 0.2553 
 LKRE 0.1440 0.1019 0.9113 0.1354 
Machine WLS-SVR-H 52.8168 27.5543 0.1489 1.0060 
 WLS-SVR-L 52.1319 27.0399 0.1436 0.9335 
 RLS-SVR 51.6456 26.9077 0.1401 0.9454 
 LKRE 61.3626 38.1935 0.5151 0.5984 
Auto-mpg WLS-SVR-H 2.6334 1.9324 0.1080 0.8459 
 WLS-SVR-L 2.7141 1.9911 0.1148 0.8480 
 RLS-SVR 2.6333 1.9259 0.1077 0.8444 
 LKRE 3.7188 2.9140 0.4368 0.9194 
Housing WLS-SVR-H 3.6464 2.6132 0.1661 0.8808 
 WLS-SVR-L 3.7909 2.6898 0.1800 0.8676 
 RLS-SVR 3.5643 2.4787 0.1591 0.8663 
 LKRE 4.6173 3.4025 0.5850 0.8379 

Note: The numbers in bold are the best results.

Compared with the WLS-SVR-H, WLS-SVR-L, and RLSSVR, Table 5 illustrates that the proposed LKRE achieves better results in three out of six data sets according to generalization analysis.

4.5  Comparisons of the Proposed Algorithms with Robust Regression Based on Correntropy

In this section, we compare the proposed algorithms with robust regression based on the C-loss (called CSVR) (Singh et al., 2014):
formula
4.5
The C-loss is induced by a gaussian kernel. The CSVR is a nonconvex optimization, and half-quadratic programming (He, Zheng, Tan, & Sun, 2014) is used successfully to solve CSVR.

In this experiment, we ran CSVR on UCI and NIR spectral data sets and chose the optimal parameters and to minimize RMSE by 10-fold cross-validation. For each training data set, we add gaussian noise with zero mean and variance . With and optimal parameter C, the average experimental results of three algorithms on UCI data sets are reported in Table 6, which shows that the proposed LKRE and MLKRE perform better than CSVR in seven of eight data sets.

Table 6:
Comparisons of LKRE, MLKRE, and CSVR on UCI Data Sets.
Data SetMethodsRMSEMAESSE/SSTSSR/SST
Diabetes CSVR 0.5787 0.4597 0.1020 0.9872  
(43 2) LKRE 0.5812 0.4711 0.8941 0.4490  
 MLKRE 0.5337 0.4316 0.8933 0.8176  
Pyrim CSVR 0.1100 0.0868 1.9743 1.1914  
 LKRE 0.1048 0.0651 0.5836 0.6056  
 MLKRE 0.0893 0.0678 1.3978 0.9288  
Triazines CSVR 0.1543 0.1165 1.0550 0.1258  
 LKRE 0.1463 0.1081 0.8910 0.2553  
 MLKRE 0.1440 0.1019 0.9113 0.1354  
Machine CSVR 0.1961 0.1344 1.6238 1.1482  
 LKRE 0.2176 0.1352 1.1392 0.3513  
 MLKRE 0.2186 0.1355 1.1026 0.2386  
Auto-mpg CSVR 3.4495 2.7062 0.3617 0.8714  
 LKRE 2.6333 1.9259 0.1077 0.8444  
 MLKRE 3.7188 2.9140 0.4368 0.9194  
Housing CSVR 5.1372 3.7574 0.6380 0.8815  
 LKRE 3.5643 2.4787 0.1591 0.8663  
 MLKRE 4.6173 3.4025 0.5850 0.8379  
Slumptest CSVR 2.6848 2.1705 0.1846 0.8904  
 KLRE 2.7986 2.2503 0.0649 0.2247  
 MKLRE 2.6088 2.1193 0.0612 0.1729  
Concrete CSVR 0.3866 0.3232 1.2370 0.3780  
 LKRE 0.3853 0.3218 1.2268 0.3594  
 MLKRE 0.3848 0.3216 1.2280 0.3751  
Data SetMethodsRMSEMAESSE/SSTSSR/SST
Diabetes CSVR 0.5787 0.4597 0.1020 0.9872  
(43 2) LKRE 0.5812 0.4711 0.8941 0.4490  
 MLKRE 0.5337 0.4316 0.8933 0.8176  
Pyrim CSVR 0.1100 0.0868 1.9743 1.1914  
 LKRE 0.1048 0.0651 0.5836 0.6056  
 MLKRE 0.0893 0.0678 1.3978 0.9288  
Triazines CSVR 0.1543 0.1165 1.0550 0.1258  
 LKRE 0.1463 0.1081 0.8910 0.2553  
 MLKRE 0.1440 0.1019 0.9113 0.1354  
Machine CSVR 0.1961 0.1344 1.6238 1.1482  
 LKRE 0.2176 0.1352 1.1392 0.3513  
 MLKRE 0.2186 0.1355 1.1026 0.2386  
Auto-mpg CSVR 3.4495 2.7062 0.3617 0.8714  
 LKRE 2.6333 1.9259 0.1077 0.8444  
 MLKRE 3.7188 2.9140 0.4368 0.9194  
Housing CSVR 5.1372 3.7574 0.6380 0.8815  
 LKRE 3.5643 2.4787 0.1591 0.8663  
 MLKRE 4.6173 3.4025 0.5850 0.8379  
Slumptest CSVR 2.6848 2.1705 0.1846 0.8904  
 KLRE 2.7986 2.2503 0.0649 0.2247  
 MKLRE 2.6088 2.1193 0.0612 0.1729  
Concrete CSVR 0.3866 0.3232 1.2370 0.3780  
 LKRE 0.3853 0.3218 1.2268 0.3594  
 MLKRE 0.3848 0.3216 1.2280 0.3751  

Note: The numbers in bold are best results.

In addition, the proposed LKRE is compared with CSVR on NIR spectroscopy data. With optimal parameters, the average experimental results on eight different regions are illustrated in Figures 8 and 9, respectively. We see from Figure 8 that the performance of LKRE is better than that of CSVR in regions A to D. For regions E to H, Figure 9 shows that the performance of LKRE is competitive with the CSVR in terms of the generalization analysis.

Figure 8:

The RMSEs of LKRE and SVR on regions A to D.

Figure 8:

The RMSEs of LKRE and SVR on regions A to D.

Figure 9:

The RMSEs of LKRE and SVR on regions E to H.

Figure 9:

The RMSEs of LKRE and SVR on regions E to H.

5  Conclusion

Motivated by the kernel learning and correntropy learning, we present a new loss function (LK-loss) for regression problems. Two robust regression formulations are proposed with LK-loss function. The proposed methods are simulated on UCI data sets, synthetic data, and a practical application data set. Moreover, we evaluate the proposed methods under different types of noise. The main work of this investigation is summarized as follows:

  • A robust regression framework is proposed based on the Laplace kernel–induced loss function (LK-loss). We show that the LK-loss is a symmetrical, bounded, and nonconvex loss function. Two regression formulations are developed with the LK-loss. In addition, we found that the LK-loss is a good approach of the zero-norm as the parameter value increases.

  • By a proper decomposition of the LK-loss function, the proposed regression formulations are posed as DC programming. The resulting DC algorithms converge linearly and have low computational burden, solving only a few simple quadratic programming problems.

  • Compared with the traditional SVR and LSSVR, experimental results in eight different spectral regions show that the proposed LKRE and MLKRE obtain better generalization, especially in high-frequency regions. The possible reason is that the LK-loss is less sensitive to noise than the -loss and -loss. We also carry out the proposed methods in eight UCI data sets and a synthetic data set with input noise. Experiments show that the proposed LKRE and MLKRE decrease regression errors in most cases.

  • Compared with other robust SVR algorithms—WLS-SVR-H, WLS-SVR-L, and RLSSVR—experimental results on UCI data sets show that the proposed LKRE achieves better performance in three out of six data sets.

  • Compared to the CSVR with correntropy-based loss, experimental results on UCI data and NIR spectroscopy data illustrate that the proposed LKRE and MLKRE are superior to CSVR in most cases.

In addition, the proposed approach can be extended to designing a nonlinear regression model with LK-loss function, an investigation we will address in future work.

It is worth noting that the proposed DCA depends greatly on two DC components (g and h). The question of finding a good DC decomposition for the zero-norm will be studied in future work.

Acknowledgments

This work is supported by the National Nature Science Foundation of China (11471010) and Chinese Universities Scientific Fund. We also thank the referees and the editor for their constructive comments. Their suggestions improved the letter significantly.

References

Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI Repository for Machine Learning Databases
.
Irvine
:
Department of Information and Computer Sciences, University of California, Irvine
. http://www.ics.uci.edu/mlearn/MLRepository.html
Brabanter
,
K. D.
,
Pelckmans
,
K.
,
Brabanter
J.
,
Debruyne
,
M.
,
Suykens
,
J.
,
Hubert
,
M.
, &
DeMoor
,
B.
(
2009
).
Robustness of kernel based regression: A comparison of iterative weighting schemes
. In
Proceedings of the 19th International Conference on Artificial Neural Networks
(pp.
100
110
).
Chan
,
C. H.
, &
Zou
,
X.
(
2004
).
A recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: Fast algorithm and convergence performance analysis
.
Applied and Computational Harmonic Analysis
,
41
,
164
189
.
Feng
,
X.
,
Huang
,
X.
,
Shi
,
L.
,
Yang
,
Y.
, &
Suykens
,
J.
(
2015
).
Learning with the maximum correntropy criterion induced losses for regression
.
Journal of Machine Learning Research
,
16
,
993
1034
.
He
,
R.
,
Zheng
,
W. S.
,
Tan
,
T.
, &
Sun
,
Z.
(
2014
).
Half-quadratic based iterative minimization for robust sparse representation
.
Transactions on Pattern Analysis and Machine Intelligence
,
36
,
261
275
.
Huang
,
X. L.
,
Shi
,
L.
, &
Suykens
,
J. A. K.
(
2014
).
Ramp loss linear programming support vector machine
.
Journal of Machine Learning Research
,
15
,
2185
2211
.
Huber
,
P. J.
(
1981
).
Robust statistics
.
New York
:
Wiley
.
Liu
,
W.
,
Pokharel
,
P. P.
, &
Principe
,
J. C.
(
2006
). Error entropy, correntropy and M-estimation. In
Proceedings of the IEEE Signal Processing Society Workshop on Machine Learing for Singal Processing
(pp.
179
184
).
Piscataway, NJ
:
IEEE
.
Liu
,
W.
,
Pokharel
,
P.
, &
Principe
,
J.
(
2007
).
Correntropy: Properties and applications in non-gaussian signal processing
.
IEEE Transactions on Signal Processing
,
55
,
5286
5298
.
Peng
,
X.
(
2010
).
TSVR: An efficient twin support vector machine for regression
.
Neural Networks
,
23
,
365
372
.
Principe
,
J. C.
(
2010
).
Information theoretic learning: Renyi entropy and kernel perspectives
.
New York
:
Springer
.
Singh
,
A.
,
Pokharel
,
R.
, &
Principe
,
J.
(
2014
).
The C-loss function for pattern classification
.
Pattern Recognition
,
47
,
441
453
.
Steinwart
,
I. A.
, &
Christmann
,
A.
(
2011
).
Estimating conditional quantiles with the help of the pinball loss
.
Bernoulli
,
17
,
211
225
.
Suykens
,
J. A. K.
,
Brabanter
,
J. D.
,
Lukas
,
L.
, &
Vandewalle
,
J.
(
2002
).
Weighted least squares support vector machines: Robustness and sparse approximation
.
Neurocomputing
,
48
(
1-4
),
85
105
.
Suykens
,
J. A. K.
, &
Vandewalle
,
J.
(
1999
).
Least squares support vector machine classifiers
.
Neural Processing Letters
,
9
,
293
300
.
Tao
,
P. D.
, &
An
,
L. T.
(
1997
).
Convex analysis approaches to DC programming: Theory, algorithms and applications
.
Acta Mathematica Scientia
,
22
,
287
367
.
Thi
,
H. A. L.
,
Tao
,
P. D.
,
Minh
,
L. H.
, &
Thanh
,
V. X.
(
2014
).
DC approximation approaches for sparse optimization
.
European Journal of Operational Research
,
244
,
26
46
.
Thi
,
H. A.
,
Le
,
W. M.
,
Nguyen
,
V. V.
, &
Dinh
,
T. P.
(
2008
).
A DC programming approach for feature selection in support vector machines learning
.
Advances in Data Analysis and Classification
,
2
,
259
278
.
Vapnik
,
V. N.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Wang
,
K. N.
, &
Zhong
,
P.
(
2014
).
Robust non-convex least squares loss function for regression with outliers
.
Knowledge-Based Systems
,
71
,
290
302
.
Wen
,
W.
,
Hao
,
Z.
, &
Yang
,
X.
(
2010
).
Robust least squares support vector machine based on recursive outlier elimination
.
Soft Computing
,
14
,
1241
1251
.
Xu
,
G.
,
Cao
,
Z. H.
,
Hu
,
B.
, &
Principe
,
J.
(
2017
).
Robust support vector machines based on the rescaled hinge loss function
.
Pattern Recognition
,
63
,
139
148
.
Yang
,
L. M.
, &
Sun
,
Q.
(
2012
).
Recognition of the hardness of licorice seeds using a semi-supervised learning method
.
Chemometrics and Intelligent Laboratory Systems
,
114
,
109
115
.
Yang
,
L. M.
, &
Wang
,
L.
(
2013
).
A class of smooth semi-supervised SVM by difference of convex functions programming and algorithm
.
Knowledge-Based Systems
,
41
,
1
7
.
Yang
,
X. W.
,
Tan
,
L. J.
, &
He
,
L. F.
(
2014
).
A robust least squares support vector machine for regression and classification with noise
.
Neurocomputing
,
140
,
41
52
.
Zhao
,
Y.
, &
Sun
,
J.
(
2008
).
Robust support vector regression in the primal
.
Neural Networks
,
21
,
1548
1555
.