Financial risk measures have been used recently in machine learning. For example, -support vector machine (-SVM) minimizes the conditional value at risk (CVaR) of margin distribution. The measure is popular in finance because of the subadditivity property, but it is very sensitive to a few outliers in the tail of the distribution. We propose a new classification method, extended robust SVM (ER-SVM), which minimizes an intermediate risk measure between the CVaR and value at risk (VaR) by expecting that the resulting model becomes less sensitive than -SVM to outliers. We can regard ER-SVM as an extension of robust SVM, which uses a truncated hinge loss. Numerical experiments imply the ER-SVM’s possibility of achieving a better prediction performance with proper parameter setting.
One important goal of classification methods is to construct classifiers with high prediction accuracy, that is, good generalization ability. The support vector machine (SVM) is designed to reduce the risk of misprediction for test data by means of the combination of a regularization term and a loss function that evaluates the fitting to training data. Risk minimization techniques are studied in financial mathematics field as well. Especially when considering long-term contracts, various risks are present in practice, and an effective way of hedging those risks is needed. One of the most widely used risk measures in finance is value at risk (VaR), a quantile at a predefined probability level. A coherent risk measure such as conditional VaR (CVaR) (Rockafellar & Uryasev, 2002; Artzner, Delbaen, Eber, & Heath, 1999) became popular because of the subadditivity property, which basically means that “a merger does not create extra risk,” though the CVaR is very sensitive to a few outliers in the tail of the distribution.
Several works have studied financial risk measures from a machine learning perspective. For example, Xu, Caramanis, Mannor, and Yun (2009) proposed a comprehensive robust classification model that uses a discounted loss function depending on the data and investigated the relationship between comprehensive robustness and convex risk measures. Takeda and Sugiyama (2008) showed that -SVM (Schölkopf, Smola, Williamson, & Bartlett, 2000) and its extended model, E-SVM (Perez-Cruz, Weston, Hermann, & Schölkopf, 2003), minimize the CVaR of margin distribution.
In this letter, we propose a new classification method, which we call ER-SVM, that minimizes a truncated CVaR by using VaR by expecting that the resulting model becomes less sensitive to outliers than E-SVM. Our model is closely related to robust SVM (Shen, Tseng, Zhang, & Wong, 2003; Xu, Crammer, & Schuurmans, 2006; Wu & Liu, 2007; Brooks, 2011), which minimizes a truncated hinge loss combined with a regularization term. We can say that ER-SVM is an extended variant of robust SVM (see Figure 1). More precisely, the permissible parameter range of robust SVM is included in that of ER-SVM, and therefore, ER-SVM may achieve a better prediction performance with proper parameter setting.
ER-SVM is formulated as a nonconvex problem and difficult to solve exactly, as well as existing robust SVMs. We propose a heuristic algorithm for solving ER-SVM approximately, which finds a feasible solution of ER-SVM in every iteration. Furthermore, when the algorithm is running, a hyperparameter of ER-SVM is set to an appropriate value. If of ER-SVM is set to the ratio of outliers, it achieves good prediction performance, but it is hard to predict the ratio in practical problem setting. Therefore, the proposed algorithm, where the parameter is automatically tuned, is very practical. In the algorithm, we repeat two steps: solving optimization problems of -SVM or E-SVM and removing training samples with large losses. The solution method is easy to implement because we can use existing tools and software for solving -SVM or E-SVM several times in order to solve ER-SVM.
Numerical experiments show the superior performance of ER-SVM over robust SVM, C-SVM, and E-SVM in the presence of outliers. Indeed, Figure 2 shows that CVaR minimization, which is equivalent to E-SVM, is sensitive to outliers, whereas our model, ER-SVM, is not sensitive. The feature of our model that ignores samples with large losses contributes to the superior performance over E-SVM.
The superior performance over robust SVM implies the effectiveness of ER-SVM’s extended parameter range relative to the range of robust SVM. Both problems, ER-SVM and robust SVM, can be regarded as the minimum distance problem to a set that consists of training samples, and hyperparameters of those problems control the size of . Robust SVM limits the range of the hyperparameter so that does not contain but ER-SVM removes the limitation.
The letter is organized as follows. Section 2 reviews several related support vector machine classifiers and risk measure minimization models in financial engineering. Section 3 presents the formulation of our model, ER-SVM, and a heuristic algorithm for ER-SVM. Section 4 provides geometric interpretations for ER-SVM by showing the dual formulation of ER-SVM. Those interpretations help us to understand ER-SVM from the geometric viewpoint and recognize the difference of ER-SVM from VaR or CVaR-based methods. In section 5, our model is compared to several related models such as robust SVM (Xu et al., 2006; Wu & Liu, 2007), -SVM (Schölkopf et al., 2000), E-SVM (Perez-Cruz et al., 2003), and VaR-SVM (Tsyurmasto, Zabarankin, & Uryasev, 2014). Section 6 concludes the letter.
2 Background and Related Work
2.1 Support Vector Machine
The SVM has been widely used for classification in machine learning. Let us address the binary classification problem of learning a linear function based on training samples , . We assume that the training samples are independent and identically distributed following the unknown probability distribution on . For simplicity, we generally focus on linear functions , but the discussions in this letter can be directly applicable to nonlinear kernel classifiers (see Schölkopf & Smola, 2002).
2.1.1 Equivalence Between C-SVM and -SVM
-SVM and C-SVM have the same optimal solution if we appropriately set two parameters and C by using the optimal solution of -SVM (see Schölkopf et al., 2000). It is said that setting appropriate for -SVM is easier and more intuitive than C of C-SVM because of -properties of -SVM, implying that is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.
However, of -SVM has the permissible range as (see Crisp & Burges, 2000 for details). -SVM is unbounded when is larger than , where (resp. ) is the index set of the samples with the positive (resp. negative) label. -SVM produces a trivial solution ( and ) when is smaller than some threshold .
2.1.2 Extension from -SVM to E-SVM
The classifiers of E-SVM and -SVM are the same up to a scaling factor when because the convex relaxation problem of E-SVM attains an optimal solution at . In other words, can be relaxed to without changing the optimal solution. On the other hand, in the case of , the relaxation problem gives a trivial solution ( and ) as well as -SVM. In that case, of E-SVM can be relaxed to , but it is still nonconvex. Therefore, a nonconvex optimization method (Perez-Cruz et al., 2003; Takeda & Sugiyama, 2008) needs to be applied to E-SVM.
2.1.3 Robust SVM
2.2 Risk Measure Minimization
We define financial risk measures using the losses of samples; , and show classification methods that minimize risk measures with respect to and b.
Note that equation 2.7 reduces to E-SVM equation 2.2, by changing a variable as . Let be an optimal solution for equation 2.7, with . As Figure 4 shows, is decreasing with respect to . We can compute the threshold of convexity, , such that by solving a linear programming problem as shown in Takeda, Mitsugi, and Kanamori (2013). For , equation 2.7 reduces to a convex problem with instead of W. As discussed before, the convex problem is equivalent to -SVM (Schölkopf et al., 2000). A classification model minimizing VaR with a convex constraint, , (i.e., the convex relaxation problem of equation 2.5 with respect to ) was studied in Tsyurmasto et al. (2014) as VaR-SVM.
3 Extended Robust SVM
The CVaR and VaR measures have strong points and weak points when applied to classification problems. Note that the VaR measure ignores samples with large losses. Minimizing the VaR gives us a robust estimate of () that is insensitive to outliers in the data. However, the performance is sensitive to the input parameter , which controls the ratio of ignored samples. We can easily discard essential training samples by regarding them as outliers and achieve worse prediction results. The CVaR minimization model, including -SVM, has an excellent reputation in its performance in standard setup. However, the estimator is sensitive to outliers.
3.1 ER-SVM Formulation
We can regard ER-SVM as a mixture model of CVaR and VaR minimization. The model has two input parameters, and , in the range . When becomes small enough so that , J becomes empty, and equation 3.1 is the same as CVaR minimization, equation 2.2, with . When becomes small enough so that , the objective function in equation 3.1 reduces to , and therefore equation 3.1 is the same as VaR minimization, equation 2.5, with (we can prove it easily from corollary 9 in Rockafellar & Uryasev, 2002).
Now we modify ER-SVM, equation 3.1, by replacing the nonconvex set W by and call the resulting problem convex ER-SVM.
As far as is in the convex range, ER-SVM, equation 3.1 is equivalent to convex ER-SVM.
Lemma 1 implies that for , convex ER-SVM gives an optimal solution of ER-SVM, equation 3.1. But, when , convex ER-SVM will lead to a trivial solution ( and ).
The theorem shows that ER-SVM is equivalent to robust SVM when . If of ER-SVM is in the range, the nonconvex constraint can be relaxed as . when , convex ER-SVM will achieve a trivial solution ( and ). In such a case, the nonconvex constraint of ER-SVM is equivalent to , because for the inequality constraint, holds at the optimal solution, , as long as the optimal value of ER-SVM is positive, similar to the properties of E-SVM shown in section 2.1.2. By extending the permissible range of from to , the margin is allowed to be negative as well as E-SVM, equation 2.2, and as a result, ER-SVM can provide a nontrivial solution that robust SVM can not attain.
3.2 ER-SVM Heuristic Algorithm
When we use ER-SVM, equation 3.1, for classification, it seems troublesome to handle two parameters of ER-SVM, and . If is set to the ratio of outliers, ER-SVM will achieve good prediction performance, but it is hard to predict the ratio in practical problem setting. In this section, we propose a heuristic algorithm for ER-SVM, which automatically tunes parameters of ER-SVM: and during execution. It makes parameter settings convenient if we cannot estimate the ratio of outliers in data sets.
The algorithm minimizes the objective value of ER-SVM with respect to J and other variables alternately, gradually increasing the cardinality of J from 0 to . When we use the idea of Larsen, Mausser, and Uryasev (2002), the algorithm observes the change of the solution with respect to increasing the cardinality of J and stops the execution when the solution does not change much.
Larsen et al.’s (2002) efficient heuristic algorithm was proposed for the VaR minimization problem equation 2.5. We modify the algorithm for solving trCVaR minimization problem. Algorithm A1 in Larsen et al. (2002) approximately solves the -VaR minimization problem. The general line of thought behind the heuristic algorithm is simple. The algorithm systematically reduces the VaR by repeating two phases (starting with , and ): (1) solve a -CVaR problem using data set , and (2) update so as to remove samples with large losses in the CVaR problem and reset so that (the number of samples with zero penalty, shown in the white area in Figure 3, is the same). These phases are repeated until samples are removed.
in the termination criterion shows how different the kth solution is from the previously obtained solution. When the solution does not change much even if samples with large losses are removed, this algorithm recognizes that all outliers were already removed and stops automatically. The solution in the first iteration is optimal to -CVaR minimization, equation 2.2. When the algorithm iterates sufficiently by setting , the resulting solution approximates an -VaR minimizer. Algorithm 1 gives an intermediate solution between the VaR minimizer and CVaR minimizer; it gets close to VaR minimizer from the CVaR minimizer as the iteration proceeds.
The algorithm automatically sets suitable values to and of ER-SVM. The following theorem implies which values are set to and for ER-SVM in every iteration of the algorithm.
A feasible solution of equation 3.9 in each iteration is also feasible for equation 3.1 with and .
Especially if Ik of equation 3.9 equals to for the optimal solution of equation 3.1, the optimal solution of equation 3.9 is also optimal for equation 3.1. Since we define Ik by the set of samples whose losses were small in equation 3.9, there is no guarantee that as well Larsen et al.’s (2012) algorithm. In general, it may not be easy to devise not only global optimization algorithms but also local optimization algorithms for nonconvex problems that have nonconvexity in both objective and constraint functions such as ER-SVM. However, algorithm 1 works well to find influential outliers for J, as numerical experiments in Figure 7 show later.
To solve CVaR minimization equation 3.9 in the algorithm, we can use existing tools and software (Chang & Lin, 2001; Perez-Cruz et al., 2003; Takeda & Sugiyama, 2008) for solving -SVM or E-SVM. Therefore, the algorithm is easily implemented because the main part is solving equation 3.9 in order to solve ER-SVM. When , we need to deal with equation 3.9 with the nonconvex set W, to which we can use a local solution algorithm as in Perez-Cruz et al. (2003) and Takeda and Sugiyama (2008). Algorithm 1 becomes more efficient by using a warm-start strategy for equation 3.9 (i.e., by using as an initial solution). The problem, equation 3.9, does not change much in any iteration when setting a small value to . Then small computation costs are needed to find the next solution, , by using a warm-start strategy.
4 Geometric Interpretation for ER-SVM
We give geometric interpretations for ER-SVM, equation 3.1, by showing the dual formulation of it. The dual ER-SVM is formulated as the minimum distance problem to a set that consists of samples , . From theorem 1, we can give the same interpretation for robust SVM equation 2.4. The interpretation for robust SVM is not obvious from the formulation of equation 2.4 which uses truncated hinge losses. Furthermore, we also give a geometric interpretation for VaR minimization. Those interpretations help us to understand ER-SVM from the geometric viewpoint and recognize the difference of ER-SVM from VaR or CVaR-based methods.
4.1 Dual of ER-SVM
For equation 4.1, assume that and are representative points (or means) of each class. We can interpret the problem in the viewpoint of robust optimization (Ben-Tal, El-Ghaoui, & Nemirovski, 2009) by regarding and as uncertain inputs and preparing uncertainty sets where lie. Note that are convex sets constructed from training samples , , respectively. ER-SVM, equation 4.1, finds a solution that is robust with respect to changes in the realization of .
Bishop (2006) (see section 4.1.4 for Fisher’s linear discriminant) discusses the problem maximizing in terms of for sample means of each classes. Although the problem selects a projection that maximizes the class separation, the resulting solution, , can induce considerable class overlap in the projected space, between and . The difficulty arises when the samples of each class are generated from class distributions with strongly nondiagonal covariances. Fisher’s linear discriminant finds a solution so as to give a small variance within each class in addition to maximizing the class separation, thereby minimizing the class overlap. We can say that ER-SVM, equation 4.1 tries to overcome the difficulty in another way by removing outliers with the use of J and considering the worst case in uncertainty sets for .
4.2 Transformation of Dual ER-SVM Depending on
In section 3.1, we referred to convex ER-SVM, equation 3.8, where the nonconvex constraint is replaced by the convex one, , without changing the optimality of ER-SVM as far as . In the range, the optimal value of ER-SVM is negative (see Figure 4). We can relate the sign of the optimal value of ER-SVM to the position of relative to for an optimal solution of ER-SVM. Indeed, holds if and only if the optimal value of ER-SVM is negative, and the interior of includes () if and only if the optimal value of ER-SVM is positive. We can prove them by mimicking the proof of lemma 1 in Takeda et al. (2013).
Therefore, if , we find that ER-SVM is equivalent to convex ER-SVM, equation 3.8. Otherwise, is essentially equivalent to the nonconvex inequality . We cannot detect whether can be made convex or not a priori by using , because the optimal solution is necessary. However, we can give geometric interpretations for ER-SVM in each case. Now we transform ER-SVM, equation 4.1, into two norm-minimization problems depending on .
Figure 5 depicts the hyperplane given by ER-SVM, equation 4.1, as well as the resulting and with the use of of equation 4.1. ER-SVM could detect all outliers shown with the star marks as . In this case, (or equivalently, ) holds, and therefore, ER-SVM, equation 4.1, equals the dual problem, equation 4.3, of convex ER-SVM. As theorem 3 implies, the hyperplane of equation 4.1 was obtained by maximizing the minimum distance between and with respect to J.
By fixing J, equation 4.3 reduces to a convex problem, whereas equation 4.4 remains a nonconvex problem because of the constraint . The former problem corresponds to convex ER-SVM whose parameter is in the convex range, . Indeed, the range was defined so that the optimal tr- is negative, but it is equivalently defined so that in the range. In other words, as long as , holds and ER-SVM, equation 4.1, is equivalent to convex ER-SVM, robust SVM equation 2.4, and robust -SVM (see equation 3.4). As becomes smaller, the set becomes larger and the nonconvex case, , tends to happen.
4.3 Relation to CVaR and VaR Minimization Models
Figure 6 shows the reduced convex hulls, and , of equation 4.8. Because in this case, UCM, equation 4.7, with these equals to E-SVM. The hyperplane was obtained by solving E-SVM, equation 2.2. The CVaR-based model takes into account all losses induced from training samples; therefore, the reduced convex hulls shown in the figure are influenced by outliers. Therefore, the hyperplane is also influenced by outliers, compared with the hyperplane of ER-SVM (see Figure 5).
5 Numerical Results
5.1 Properties of ER-SVM
The intermediate model, ER-SVM, has strong points of CVaR minimization and VaR minimization; a strong point of CVaR minimization is that the parameter choice of is not so sensitive for prediction, and a strong point of VaR minimization is that the optimal solution of VaR minimization is robust to outliers. We compared the performances of ER-SVM, equation 3.1, by algorithm 1, VaR minimization equation 2.5, by running the algorithm with and CVaR minimization, equation 2.7. Recall that equation 2.7 equals E-SVM and especially for , it reduces to -SVM (Schölkopf et al., 2000). VaR minimization, equation 2.5, reduces to VaR-SVM (Tsyurmasto et al., 2014) for large (precisely, as long as : see Figure 4). ER-SVM equals robust SVM, equation 2.4, when exceeds (see lemma 1 and theorem 1).
We used synthetic data generated by following Xu et al. (2006). We generated two-dimensional samples with label and from two normal distributions with different mean vectors and the same covariance matrix. The optimal hyperplane for the noiseless data set is with and . We added outliers only to the training set with label by drawing samples uniformly from a half-ring with center , inner radius , and outer radius in the space of . The training set contained 50 samples from each class (100 in total), including outliers. The ratio of outliers in the training set was set to one of the values from 0 to 5% (10% only for Figure 8). The test set has 500 samples from each class (1000 in total). We repeated all the experiments 100 times, drawing training and test sets every repetition.
The parameter setting of and in algorithm 1 for approximately solving ER-SVM is as follows: and (basic parameter setting throughout numerical experiments). This makes the number of removed samples, , unrelated to the choice of .
Figure 7 shows the influence of outliers on performances of three methods for . These figures show the average test errors with their estimation errors (the standard deviations divided by the square root of the number of trials, ) over the 100 runs. indicates the maximum convexity threshold for CVaR minimization (E-SVM) among 100 trials. This indicates that ER-SVM with is equivalent to robust-SVM. Figures 7a to 7d imply that ER-SVM (especially ER-SVM with ) achieved better performance as the ratio of outlier increases, while E-SVM’s performance became worse. We can confirm that the curve of the test error of ER-SVM was not so volatile with respect to compared to VaR minimization, and it achieved the lowest test errors among the three models in the presence of outliers.
We tested the precision and recall on the synthetic data including 10% outliers (i.e., 10 outliers are included in 100 samples) generated from the ring with inner radius . Precision is the number of outliers that algorithm 1 removed divided by the total number of removed samples, and recall is the number of outliers removed divided by the total number of existing outliers in the data set. Figure 8 (top) shows the precision recall curve with different values of that is used in the stopping criterion of algorithm 1. It varied from to . The upper panel demonstrates that larger produces a high precision rate, implying that the outliers are detected at the early stage (i.e., small k) of algorithm 1. Figure 8 (bottom) shows the test errors of the classifiers obtained by algorithm 1 with corresponding . The basic parameter setting used in the numerical experiments stresses the precision rather than the recall, which leads to high-prediction performances (i.e., small test errors).
5.2 Comparison to Existing Models
We compared the performance of ER-SVM to the following existing models: C-SVM; E-SVM; robust SVM (which was solved by the concave-convex procedure, CCCP (Collobert, Sinz, Weston, & Bottou, 2006)) and VaR minimization. C-SVM is a well-known classification method. E-SVM, equation 2.2 is an extension of -SVM (equivalent to C-SVM) and VaR minimization, equation 2.5 is an extension of VaR-SVM (Tsyurmasto et al., 2014). We can say that ER-SVM is a robust variant of E-SVM whereas robust SVM is a robust variant of C-SVM. We used for robust SVM, equation 2.4.
5.2.1 Synthetic Data
Table 1 shows the results (average error [%] standard deviations of test errors in 100 trials) of comparing four SVMs on the synthetic data set of the previous section. We found the best parameter setting from 9 candidates, , for ER-SVM, E-SVM, and VaR minimization and from for robust SVM, and C-SVM. Figure 9 depicts the results of Table 1: the average test error of each learning model by changing the ratio of outliers in data sets. ER-SVM, VaR minimization, and robust SVM were less influenced by outliers than standard SVM models such as C-SVM and E-SVM. Above all, the proposed model, ER-SVM, achieved a good prediction performance.
|Outlier Ratio||ER-SVM||robust SVM||E-SVM||C-SVM||VaR|
|Outlier Ratio||ER-SVM||robust SVM||E-SVM||C-SVM||VaR|
Notes: Average test error [%] standard deviations in 100 trials for synthetic data. The minimum average test error of five models for each outlier ratio is shown in bold.
5.2.2 UCI Data Sets
We generated contaminated data sets from original data sets of UCI repository (Blake & Merz, 1998), that are shown in Table 2, by adding outliers as follows. We scaled all attributes of the original data set from −1.0 to 1.0, generated outliers uniformly from a ring with center and radius R, and assigned the wrong label to by using optimal classifiers of E-SVM. The radius R for generating outliers was properly set so that outliers have an impact on the test errors. In addition, the percentage of outliers (outlier ratio) was increased until outliers had a large influence on the test errors.
We generated 10 contaminated data sets for each outlier ratio. Figure 10 shows the average of test errors on 10 contaminated data sets with the best parameter choice. The best parameter was chosen among 9 candidates at equal intervals from for ER-SVM, E-SVM, and VaR minimization and from 9 candidates (the same number of candidates for ) as for robust SVM and C-SVM. The plots on each line of ER-SVM, E-SVM, and VaR minimization indicate that the best parameter value was attained in the nonconvex range ; no plots indicate that it was attained in the convex range .
When the outlier ratio increases, ER-SVM tends to have good prediction performance over robust SVM in most cases. Especially, for the liver-disorders and diabetes data sets, nonconvex ER-SVM achieved good prediction performances compared to other methods. As for the other data sets, the best parameter of ER-SVM was chosen in the convex range for almost all outlier ratios; therefore, the differences in performance between convex ER-SVM and robust SVM were in the parameter setting and C. may be more easily adjusted than C. ER-SVM has a possibility of achieving better prediction performances with proper parameter setting.
We proposed extended robust SVM (ER-SVM), which minimizes an intermediate risk measure between the CVaR and VaR by expecting that the resulting model becomes less sensitive to outliers than E-SVM. Our model, ER-SVM, is an extension of robust SVM (Xu et al., 2006; Wu & Liu, 2007). Indeed, if we set of ER-SVM to a value in the convex range, ER-SVM and robust SVM give the same classifier. Numerical experiments show the superior performance of our model over robust SVM, -SVM, and E-SVM in the presence of outliers. The effectiveness of the extended parameter range of ER-SVM contributes to the superior performance over robust SVM, whereas ignoring samples with large losses in ER-SVM contributes to the superior performance over E-SVM.
ER-SVM includes two parameters, and . We expect that if is set to the ratio of outliers, ER-SVM will achieve a good prediction performance, but it is hard to predict the ratio in practical problem setting. Therefore, in this letter, we proposed a heuristic algorithm for ER-SVM that automatically tunes these parameter values during execution.
Our algorithm has tuning parameter , , and . The prediction performance is not significantly affected by and . However, some guidelines for the choice of parameters would be helpful in practice. If we do not care about computation time, it is better to set a small value for . As for , to may be appropriate for scaled data sets (see Figure 8). A challenging issue is to develop a reliable cross-validation method for in the existence of outliers.
In the future, we want to investigate a practical way to set parameters directly for and of ER-SVM. The parameter controls the number of ignored samples, and controls the thickness of the margin. The superior performance of ER-SVM over robust SVM is due to the extended permissible range for . With small in the extended range, the margin of ER-SVM can be negative, different from robust SVM. We also need to check the problem settings and features of data sets where the negative margin works.