Newton methods can be applied in many supervised learning approaches. However, for large-scale data, the use of the whole Hessian matrix can be time-consuming. Recently, subsampled Newton methods have been proposed to reduce the computational time by using only a subset of data for calculating an approximation of the Hessian matrix. Unfortunately, we find that in some situations, the running speed is worse than the standard Newton method because cheaper but less accurate search directions are used. In this work, we propose some novel techniques to improve the existing subsampled Hessian Newton method. The main idea is to solve a two-dimensional subproblem per iteration to adjust the search direction to better minimize the second-order approximation of the function value. We prove the theoretical convergence of the proposed method. Experiments on logistic regression, linear SVM, maximum entropy, and deep networks indicate that our techniques significantly reduce the running time of the subsampled Hessian Newton method. The resulting algorithm becomes a compelling alternative to the standard Newton method for large-scale data classification.
Some studies other than Byrd et al. (2011) have considered subsampled Hessian Newton methods. For example, Martens (2010) proposed and applied a subsampled Hessian method for training neural networks. Chapelle and Erhan (2011) make an extension to use preconditioned conjugate gradient methods for obtaining search directions.
In this work, we begin by pointing out in section 2 that for some classification problems, the subsampled Hessian Newton method may be slower than the full Hessian Newton method. The main reason is that by using only a subset S, the resulting search direction and step size are very different from the full Newton direction that minimizes the second-order approximation of the function reduction. Based on this observation, in section 3, we propose some novel techniques to improve the subsampled Hessian Newton method. The main idea is to solve a two-dimensional subproblem for adjusting the search direction so that the second-order approximation of the function value is better minimized. The theoretical convergence of the proposed methods is given in section 4. In section 5, we apply the proposed methods to several machine learning problems: logistic regression (LR), l2-loss support vector machines (SVM), maximum entropy (ME), and deep neural networks.
Our implementation for LR, SVM, and ME extends from the software (Fan, Chang, Hsieh, Wang, & Lin, 2008), while for deep networks, we extend the implementation in Martens (2010). Experiments in section 6 show that the proposed methods are faster than the subsampled Newton method originally proposed in Byrd et al. (2011). Therefore, our improved subsampled Hessian Newton method can effectively train large-scale data. A supplementary file including additional analysis and experiments is available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00751.
2 Subsampled Hessian Newton-CG Method and Its Practical Performance
Although solving equation 2.1 with subsampled Hessian is cheaper than full Hessian, the less accurate direction may result in slower convergence (i.e., more iterations in the Newton method). In Figure 1, we conduct a simple comparison between using subsampled and full Hessian. For fairness, all other implementation details are kept the same. We check the relationship between the closeness to the optimal objective value and the following measures: number of iterations and training time.
It can be clearly seen in Figure 1a that the implementation of using subsampled Hessian needs significantly more iterations to converge. We then check running time in Figure 1b. The difference between the two settings becomes smaller because each iteration of using subsampled Hessian is cheaper. However, the approach of using subsampled Hessian is still worse. Although from Byrd et al. (2011) and our subsequent experiments, subsampled Newton is shown to be faster for some other problems, our example here demonstrates that the opposite result may occur.
The slower convergence of the subsampled Hessian method in Figure 1a indicates that its direction is not as good as the full Newton direction. This situation is expected because the sampled set Sk may not represent the full set well. In section 6, we will see that as the size of Sk shrinks, the performance worsens.
3 Modified Subsampled Hessian Newton Directions
The main objective of this section is to adjust a subsampled Newton direction so that it gives a smaller objective function value of equation 2.5.
Once equation 3.5 is solved, we must choose an initial step size for line search. Obviously we can apply equation 3.3, which aims to find a suitable initial . Interestingly, the equality in equation 3.6 implies that if we apply to equation 3.3, then is obtained. This derivation indicates that a reasonable initial step size for line search is .
For the first outer iteration, it is unclear what should be. We can simply set , so is the same as that by equation 3.3.
In equation 3.5, two products between the full Hessian and vectors are needed. However, with careful implementations, training instances are accessed once rather than twice (see an example in section 5.1).
The remaining issue is the selection of the vector . One possible candidate is , the direction at the previous iteration. Then information from both subsets and Sk is used in generating the direction of the current iteration. Another possible is to use . Then equation 3.4 attempts to combine the second-order information (i.e., Newton direction ) and the first-order information (i.e., negative gradient). In section 6, we compare different choices of . A summary of our improved subsampled-Hessian Newton method is in algorithm 2.
3.1 Relations with Prior Works of Using Second Directions
The main difference between our and past studies is that our directions ( and if ) are obtained from using second-order information. The coefficients for combining directions are then obtained by solving a two-variable optimization problem.
In this section, we discuss the convergence properties of the proposed methods. The proof is related to that in Byrd et al. (2011), but some essential modifications are needed. In addition, our analysis more broadly covers continuously differentiable because Byrd et al. (2011) require twice differentiability. We begin with proving the convergence of using equation 3.4 because the proof for equation 3.7 is similar.
4.1 Convergence of Using Equation 3.4
We present the convergence result in the following theorem.
Let be continuously differentiable and assume the following conditions hold:
Finally, we have shown that all conditions in theorem 11.7 of Griva et al. (2009) are satisfied, so the convergence is obtained.
4.2 Convergence of Using Equation 3.7
We show that the convergence proof in section 4.1 can be modified if equation 3.7 is used. Because equation 3.7 differs from 3.4 only in using rather than Hk, all we must address are places in theorem 1 that involve Hk. Clearly we only need to check inequalities 4.5 and 4.11. We easily see that they still hold and the derivation is in fact simpler. Therefore, the convergence is established.
5 Examples: Logistic Regression, l2-Loss Linear SVM, Maximum Entropy, and Deep Neural Networks
In this section, we discuss how the proposed approach can be applied to various machine learning problems.
5.1 Logistic Regression
After the CG procedure, we must calculate and in order to apply our proposed approach. This step may be the bottleneck because the whole training set rather than a subset is used. From equation 5.2, we can calculate and together, so the number of data accesses remains the same as using only .3
For convergence, we check if assumptions in theorem 1 hold:
5.2 l2-Loss Linear SVM
5.3 Maximum Entropy
Regarding the convergence, we prove that theorem 1 holds, but leave details in the supplementary materials.
5.4 Deep Neural Networks
We apply deep neural networks for multiclass classification, where the number of classes is k and the class labels are assumed to be . A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multilayer structure. Between two layers, a weight vector maps inputs (the previous layer) to outputs (the next layer). An illustration is in Figure 2.
Because the total number of parameters is large, to apply Newton methods, we need a Hessian-free approach. There are two major challenges:
In contrast to equation 5.2, the (sub)-Hessian vector product is now much more complicated because of the network structure.
is not a convex function.
Martens (2010) and Martens and Sutskever (2012) have designed a subsampled Hessian Newton method to handle these two difficulties. For fast Hessian-vector product, they employ the technique of forward-differentiation (Wengert, 1964; Pearlmutter, 1994). We give more details in the supplementary materials.
One may criticize that is already of the best solution of minimization of the second-order approximation defined in equation 5.10, so equation 3.7 should not give a better direction. However, because of the damping factor , is not the optimal solution to minimize in equation 5.10. Therefore, equation 5.11 may be useful to find a more accurate solution to minimize and hence obtain a better direction. We provide detailed experiments in section 6.2.
Because the objective function of deep networks is nonconvex and our algorithm has been extended to incorporate the LM procedure, we do not have theoretical convergence like that in section 4.
In this section, we conduct experiments on logistic regression, l2-loss linear SVM, maximum entropy, and deep neural networks. We compare the proposed approaches with existing subsampled Hessian Newton methods in Byrd et al. (2011) and Martens (2010).
Programs for experiments in this letter can be found at http://www.csie.ntu.edu.tw/∼cjlin/papers/sub_hessian/sub_hessian_exps.tar.gz. All data sets except are publicly available at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.
6.1 Logistic Regression and l2-Loss Linear SVM
We select some large, sparse, and two-class data for experiments. Such data sets are commonly used in evaluating linear classifiers such as SVM and logistic regression. Detailed data statistics are in Table 1.
|Data Set .||n .||.||l .||.||.|
|Data Set .||n .||.||l .||.||.|
Notes: n is the number of features, and is the average number of nonzero features per training example. l is the number of training examples. and are the parameters among to achieve the best five-fold cross-validation accuracy for logistic regression and l2-loss linear SVM, respectively.
We compare five methods:
Full: The Newton method of using the full Hessian. We do not use (the maximal number of CG iterations) for a stopping condition of the CG procedure, so only equation 2.2 is used.
Full-CG: The Newton method of using the full Hessian. is set as the maximal number of CG steps. For example, Full-CG10 means that .
Subsampled: The method proposed in Byrd et al. (2011), where the backtracking line search starts with . See also algorithm 1.
Method 1: The same as Subsampled, but the initial for backtracking line search is by equation 3.3. Although this modification is minor, we are interested in checking how important the initial of line search is in a subsampled Newton method.
Method 2: The method proposed in section 3 by using the direction . Note that we set .
The constant in the sufficient decrease condition is chosen to be . The ratio of backtracking line search is (see equation 2.4). Our experimental framework is modified from the Newton-CG implementation of the software (Fan et al., 2008).
Second when is reduced from to , the running time of the three subsampled Newton methods (Subsampled, Method 1, Method 2) increases. This result is expected because a smaller leads to worse directions.
Third, a comparison between Full and Full-CG10 shows that the number of CG steps per outer iteration may significantly affect the running time. A smaller reduces the cost per outer iteration, but may cause more outer iterations. In Figures 3 and 4, Full-CG10 is in general slower than Full, so selecting a larger seems to be necessary for these problems. On the other hand, except in Figures 3a and 4a, Subsampled-CG10 is faster than Full-CG10. This result is consistent with that in Byrd et al. (2011). However, Subsampled-CG10 is slower than Full, where they differ not only in the use of subsampled or full Hessian, but also in the stopping condition of the CG procedure. This example indicates that in comparing two methods, it is important to keep all settings except the one for analysis the same.
Finally after using our proposed techniques, Method 2 becomes faster than Full and Full-CG10. The only exception is , which has #features #instances, so using only a subset Sk may cause significant information loss. Therefore, subsampled Newton methods are less suitable for such data sets. In addition, the C value chosen for this set is relatively large. The Hessian matrix becomes more ill conditioned in this situation, so using the full Hessian can obtain better search directions.
The discussion indicates the importance of setting a proper value, so in Figure 5, we analyze results of using various values. In Figure 5a, is the best, but in Figure 5b, the best becomes . Therefore, the best value is problem dependent, regardless of using full or subsampled Hessian. Unfortunately, we do not have a good strategy for selecting a suitable , so this is a future issue for investigation.
6.2 Maximum Entropy for Multiclass Classification
We select some large multiclass data for experiments. Detailed data statistics are in Table 2. For simplicity, is used for all problems.
|Data Set .||n .||.||l .||k .|
|Data Set .||n .||.||l .||k .|
Notes: n is the number of features, and is the average number of nonzero features per training example. l is the number of training examples. For , we use only of the data because in section 6.3, the remaining are used as the test set
Second, except Figure 6 a, all subsampled methods (Subsampled, Method 1, and Method 2) are faster than Full or Full-CG10. This result is similar to what has been reported in Byrd et al. (2011). Therefore, for these multi-classproblems, the subsampled Hessian method may have obtained a good enough direction, a situation consistent with our conclusion from the previous observation.
Finally, Full is much faster than Full-CG10 in Figure 6a but is slower in others. This result confirms our earlier finding that a suitable is problem dependent.
6.3 Deep Learning for Multiclass Classification
|Data Set .||n .||l .||lt .||k .||Deep Structure .|
|Data Set .||n .||l .||lt .||k .||Deep Structure .|
Notes: n is the number of features. l is the number of training instances. lt is the number of testing instances. For the last column, the first * means the number of features, and the last * means the number of classes. Note that either the set we obtained is already in the range of or we conduct a feature-wise scaling on the data set.
Martens-sub: This method, proposed in Martens (2010), stores a subset of CG iterates and selects the one such that the objective value of using the subset Sk (i.e., ) is the smallest.
Martens-full: This method, proposed in Martens (2010), stores a subset of CG iterates and selects the one such that the objective function value of using the full data set (i.e., ) is the smallest.
Comb2: The same as Comb1 except that the stopping condition, equation 2.2, is used for the CG procedure.
For simplicity, we do not implement some tricks used in Martens (2010):
No use of the previous solution as the initial guess in the CG procedure. The zero vector is used as the initial CG iterate.
For Comb2, we set for equation 2.2. The value is smaller than in section 6.1 because otherwise, the CG stopping condition is too loose. In addition, we consider the same as Martens (2010) for the Levenberg-Marquardt method. We use the subsampling rate and from Martens (2010) for equation 5.8.
From results presented in Figure 7, we observe that the proposed Comb1 and Comb2 methods are faster than other methods. Therefore, the optimization problem, equation 3.7, is useful to improve the quality of the search direction. The difference between Comb1 and Comb2 is generally small, but for and , Comb2 is slightly better because of a smaller number of CG iterations per outer iteration. We observe that equation 2.2 with is generally looser than equation 6.2. Earlier we mentioned that equation 2.2 with is too loose. Therefore, with a similar issue discussed earlier on finding suitable , we clearly see the importance of using a CG stopping condition that is neither too loose nor too strict.
In this letter, we have proposed novel techniques to improve the subsampled Hessian Newton method. We demonstrate the effectiveness of our method on logistic regression, linear SVM, maximum entropy, and deep neural networks. The asymptotic convergence is proved, and the running time is shown to be shorter than Byrd et al. (2011) and Martens (2010). This work gives a compelling example of showing that little extra cost in finding the search direction may lead to dramatic overall improvement.
This work was supported in part by the National Science Council of Taiwan grant 101-2221-E-002-199-MY3. We thank the anonymous reviewers for valuable comments.
See references of CG in, for example, Golub and Van Loan (1996).
This property holds because the initial point of the CG procedure is the zero vector.
Note that from the recent development of linear classification, it is known that the number of data accesses may affect the running time more than the number of operations.
Note that n0 is the number of features and is the number of classes.
In this letter, we use the sigmoid function as an activation function, that is, .