Abstract

Newton methods can be applied in many supervised learning approaches. However, for large-scale data, the use of the whole Hessian matrix can be time-consuming. Recently, subsampled Newton methods have been proposed to reduce the computational time by using only a subset of data for calculating an approximation of the Hessian matrix. Unfortunately, we find that in some situations, the running speed is worse than the standard Newton method because cheaper but less accurate search directions are used. In this work, we propose some novel techniques to improve the existing subsampled Hessian Newton method. The main idea is to solve a two-dimensional subproblem per iteration to adjust the search direction to better minimize the second-order approximation of the function value. We prove the theoretical convergence of the proposed method. Experiments on logistic regression, linear SVM, maximum entropy, and deep networks indicate that our techniques significantly reduce the running time of the subsampled Hessian Newton method. The resulting algorithm becomes a compelling alternative to the standard Newton method for large-scale data classification.

1  Introduction

The problems we consider arise from supervised learning, which aims to train a model based on observed labeled training data and predict the labels of previously unseen data with the model. Given a set of training examples , where yi is a label (class) and is a feature vector, and a loss function parameterized by a weight vector , the goal is to minimize the average loss of training data:
formula
1.1
We require that the loss function is convex. For large-scale problems, the leading obstacle is that it can be very costly to evaluate the objective function , the gradient , and the Hessian when all the training examples are used. To overcome this obstacle, Byrd, Chin, Neveitt, and Nocedal (2011) proposed a sampling framework. Because training examples are theoretically drawn from some probability distribution, of equation 1.1 can be seen as an expected loss. Based on the fact that training examples are often redundant to some extent, we could employ a subset of training examples instead of the whole training data set to the optimization process. Let be a training subset. The stochastic approximation of the objective function is defined as
formula
1.2
To avoid expensive Hessian calculation, Byrd et al. (2011) further select a subset to define the following subsampled Hessian:
formula
1.3
Byrd et al. (2011) then devise a subsampled Hessian Newton method. For a multinomial logistic regression problem, the running speed is shown to be better than that of using the full Hessian. Although these authors consider a general setting of having both subsets R and S is assumed in their analysis and experiments. We follow the same setting and focus on studying the use of subsampled Hessian with S.
To avoid overfitting in machine learning practice, a regularization term is often considered together with the training loss. Then the objective function becomes
formula
where C is a given regularization parameter. With this modification, is no longer the average of subsampled training losses. Therefore, in our analysis, we do not restrict and HS to be the same as equations 1.2 and 1.3, respectively. Instead, we consider a more general setting by assuming that for the function to be minimized, its HS and the full Hessian are related,
formula
1.4
where is the sequence of iterates generated by the optimization procedure, are two constants, and means that BA is positive semidefinite. For subsampled Hessian Newton methods considered in this letter, we show in section 5 that equation 1.4 easily holds for popular methods like logistic regression, SVM, and maximum entropy.

Some studies other than Byrd et al. (2011) have considered subsampled Hessian Newton methods. For example, Martens (2010) proposed and applied a subsampled Hessian method for training neural networks. Chapelle and Erhan (2011) make an extension to use preconditioned conjugate gradient methods for obtaining search directions.

In this work, we begin by pointing out in section 2 that for some classification problems, the subsampled Hessian Newton method may be slower than the full Hessian Newton method. The main reason is that by using only a subset S, the resulting search direction and step size are very different from the full Newton direction that minimizes the second-order approximation of the function reduction. Based on this observation, in section 3, we propose some novel techniques to improve the subsampled Hessian Newton method. The main idea is to solve a two-dimensional subproblem for adjusting the search direction so that the second-order approximation of the function value is better minimized. The theoretical convergence of the proposed methods is given in section 4. In section 5, we apply the proposed methods to several machine learning problems: logistic regression (LR), l2-loss support vector machines (SVM), maximum entropy (ME), and deep neural networks.

Our implementation for LR, SVM, and ME extends from the software (Fan, Chang, Hsieh, Wang, & Lin, 2008), while for deep networks, we extend the implementation in Martens (2010). Experiments in section 6 show that the proposed methods are faster than the subsampled Newton method originally proposed in Byrd et al. (2011). Therefore, our improved subsampled Hessian Newton method can effectively train large-scale data. A supplementary file including additional analysis and experiments is available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00751.

2  Subsampled Hessian Newton-CG Method and Its Practical Performance

We briefly review the method in Byrd et al. (2011). At the kth iteration, from the current solution and a given subset Sk, we find a direction by solving the following linear system,
formula
2.1
where is the subsampled Hessian defined in equation 1.3. The main difference from the standard Newton method is that rather than the full Hessian is used.
For large-scale problems, existing Newton methods often approximately solve the linear system by the conjugate gradient (CG) method, which mainly computes a sequence of Hessian-vector products. Byrd et al. (2011) adopt the same strategy, so subsampled Hessian-vector products are computed. They terminate the CG procedure after either a prespecified maximal number of CG iterations has been reached or the following inequality has been satisfied:
formula
2.2
where is a given tolerance. For many machine learning problems, the Hessian (or subsampled Hessian) takes a special form, and hence, the Hessian-vector product in the CG procedure can be conducted without explicitly forming the Hessian. This type of Hessian-free Newton-CG method for machine learning applications has been used in, for example, Keerthi and DeCoste (2005) and Lin, Weng, and Keerthi (2008).
After obtaining the direction , to ensure the convergence, a line search is used to find a step size that satisfies the following sufficient decrease condition,
formula
2.3
where is a prespecified constant. Note that the direction obtained after CG iterations is a descent direction (i.e., ),1 so equation 2.3 is satisfied for some . Byrd et al. (2011) suggest using backtracking line search by sequentially trying
formula
2.4
until equation 2.3 holds. Then is used to update to . The overall procedure is summarized in algorithm 1. Byrd et al. (2011) proved that if the subsampled Hessian satisfies some minor conditions, then algorithm 1 leads to the convergence of to zero. In their experiments, the maximal number of CG iterations is set to
formula
formula

Although solving equation 2.1 with subsampled Hessian is cheaper than full Hessian, the less accurate direction may result in slower convergence (i.e., more iterations in the Newton method). In Figure 1, we conduct a simple comparison between using subsampled and full Hessian. For fairness, all other implementation details are kept the same. We check the relationship between the closeness to the optimal objective value and the following measures: number of iterations and training time.

Figure 1:

A comparison between using subsampled Hessian and full Hessian. Data set and logistic regression (LR) are considered. See section 6 for details of experimental settings. CG10 indicates the value in algorithm 1.

Figure 1:

A comparison between using subsampled Hessian and full Hessian. Data set and logistic regression (LR) are considered. See section 6 for details of experimental settings. CG10 indicates the value in algorithm 1.

It can be clearly seen in Figure 1a that the implementation of using subsampled Hessian needs significantly more iterations to converge. We then check running time in Figure 1b. The difference between the two settings becomes smaller because each iteration of using subsampled Hessian is cheaper. However, the approach of using subsampled Hessian is still worse. Although from Byrd et al. (2011) and our subsequent experiments, subsampled Newton is shown to be faster for some other problems, our example here demonstrates that the opposite result may occur.

The slower convergence of the subsampled Hessian method in Figure 1a indicates that its direction is not as good as the full Newton direction. This situation is expected because the sampled set Sk may not represent the full set well. In section 6, we will see that as the size of Sk shrinks, the performance worsens.

Based on the above discussion, we aim at improving the subsampled Hessian method so that it is generally faster than the full Hessian setting. It is well known that the standard Newton method of using the full Hessian solves the following second-order approximation of :
formula
2.5
When is near the optimum, is almost the same as the approximation in equation 2.5. Therefore, not only does the full Newton direction (i.e., ) satisfy the sufficient decrease condition in equation 2.3, but also the method enjoys fast quadratic convergence. We observe in the above experiment that when subsampled Hessian is used, satisfies condition 2.3 at a much later stage. Therefore, a crucial reason for the inferior running speed is that the obtained directions are not as good as the full Newton directions. In section 3, we propose novel techniques to improve the subsampled Hessian Newton method.

3  Modified Subsampled Hessian Newton Directions

The main objective of this section is to adjust a subsampled Newton direction so that it gives a smaller objective function value of equation 2.5.

To make the direction closer to the full Newton, extra computational operations or data accesses are needed. Let us check what we may be able to afford. At each iteration of a standard line-search Newton method, the number of times to access all data is
formula
3.1
The first term in equation 3.1 indicates that we need at least one full data access for calculating and . In general, the second term is larger than the third, so the subsampled method aims at reducing the number of data accesses in the CG procedure by considering only a subset of data. Given that the whole data set must be accessed at least once, our ideas are to use one extra access of data for either improving the direction or reducing the number of line-search steps. Specifically, we take this extra data access to calculate and use the full Hessian matrix.
To begin, we consider the initial step size for line search. While Byrd et al. (2011) start with , another setting is to consider
formula
3.2
and minimize the right-hand side of equation 3.2 to get
formula
3.3
Using this value as the initial step size can potentially reduce the number of line search steps, but the extra cost is to calculate the product between the full Hessian and in equation 3.3.
By still paying one extra data access, we extend the above idea to minimize equation 2.5 as much as possible. We propose minimizing the following form
formula
3.4
where is the search direction obtained at the current iteration, and is a chosen vector. If , then problem 3.4 is reduced to 3.2 and the method in 3.3 becomes a special case here.
The function in equation 3.4 is convex to and , so we can solve it by taking the gradient to be zero:
formula
3.5
This two-variable linear system has a simple closed-form solution. The resulting direction is a descent one because from equation 3.5,
formula
3.6
Therefore, the backtracking line search is guaranteed to terminate.
Using this new direction, , we have
formula
if Therefore, this modified subsampled Newton direction can easily satisfy the sufficient decrease condition 2.3 if equation 3.4 gives a good estimate of the function-value reduction.

Once equation 3.5 is solved, we must choose an initial step size for line search. Obviously we can apply equation 3.3, which aims to find a suitable initial . Interestingly, the equality in equation 3.6 implies that if we apply to equation 3.3, then is obtained. This derivation indicates that a reasonable initial step size for line search is .

For the first outer iteration, it is unclear what should be. We can simply set , so is the same as that by equation 3.3.

In equation 3.5, two products between the full Hessian and vectors are needed. However, with careful implementations, training instances are accessed once rather than twice (see an example in section 5.1).

The remaining issue is the selection of the vector . One possible candidate is , the direction at the previous iteration. Then information from both subsets and Sk is used in generating the direction of the current iteration. Another possible is to use . Then equation 3.4 attempts to combine the second-order information (i.e., Newton direction ) and the first-order information (i.e., negative gradient). In section 6, we compare different choices of . A summary of our improved subsampled-Hessian Newton method is in algorithm 2.

formula
For problems with many variables (e.g., deep learning problems discussed in section 5.4), it is difficult to solve equation 3.4 by using the full Hessian. Instead, we can replace Hk with to have the following optimization problem:
formula
3.7
More details are in section 5.4.

3.1  Relations with Prior Works of Using Second Directions

The concept of combining several search directions in one iteration has been applied in past optimization works. For example, in an early study, Polyak (1964) proposed a heavy ball method so that
formula
where , , and L are constants. It can be easily shown that
formula
Thus, the update from to involves all past gradients . Drori and Teboulle (2014) showed that some other first-order optimization methods can be expressed in a similar setting of using all past gradients.

The main difference between our and past studies is that our directions ( and if ) are obtained from using second-order information. The coefficients for combining directions are then obtained by solving a two-variable optimization problem.

4  Convergence

In this section, we discuss the convergence properties of the proposed methods. The proof is related to that in Byrd et al. (2011), but some essential modifications are needed. In addition, our analysis more broadly covers continuously differentiable because Byrd et al. (2011) require twice differentiability. We begin with proving the convergence of using equation 3.4 because the proof for equation 3.7 is similar.

4.1  Convergence of Using Equation 3.4

Following Byrd et al. (2011), we need to be uniformly positive definite. That is, there exists such that
formula
4.1
If a convex loss is used, this property can be easily achieved by adding an l2 regularization term to the objective function. We also need that there exists such that
formula
4.2
To connect and Hk, as mentioned in section 1, we require and such that
formula
4.3

We present the convergence result in the following theorem.

Theorem 1.

Let be continuously differentiable and assume the following conditions hold:

  1. The sublevel set is bounded, where is the initial point.

  2. is Lipschitz continuous. That is, there exists such that
    formula
  3. Equations 4.1 to 4.3 hold.

Then the sequence generated by algorithm 2 using the direction of solving equation 3.4 satisfies
formula
Proof.
Like Byrd et al. (2011), we establish all needed conditions in theorem 11.7 of Griva, Nash, and Sofer (2009) for the convergence proof. We begin with showing that
formula
4.4
is bounded above zero for all k. From equations 3.6, 4.3, and 4.1,
formula
4.5
Following Byrd et al. (2011) to apply properties of CG, there exists a matrix Qk and a vector such that
formula
4.6
where Qk’s columns form an orthonormal basis of the Krylov subspace that includes .2 Therefore, is in the range of Qk’s columns, and hence
formula
4.7
From equations 4.7, 4.6, and 4.2, we can derive
formula
4.8
Using equations 4.2 and 4.3,
formula
4.9
formula
4.10
formula
4.11
where equation 4.9 is from the first equality in equation 3.5, equation 4.10 is from equation 4.6, and equation 4.11 is from equation 4.1. Therefore, equations 4.5, 4.8, and 4.11 imply that
formula
which means our modified direction is a sufficient descent direction.
The remaining condition needed is to show that the search direction is gradient related and is bounded in norm. From equations 4.11 and 4.8, we have
formula
From equation 4.5,
formula
so
formula
4.12
Because the sublevel set is bounded, from the continuity of , we have that is bounded, and so is .

Finally, we have shown that all conditions in theorem 11.7 of Griva et al. (2009) are satisfied, so the convergence is obtained.

It can be seen that equations 4.5, 4.8, and 4.11 are three major inequalities derived in the proof. While equation 4.8 is the same as that in Byrd et al. (2011), equations 4.5 and 4.11 are new. The key difference is that we must make a connection between and .

4.2  Convergence of Using Equation 3.7

We show that the convergence proof in section 4.1 can be modified if equation 3.7 is used. Because equation 3.7 differs from 3.4 only in using rather than Hk, all we must address are places in theorem 1 that involve Hk. Clearly we only need to check inequalities 4.5 and 4.11. We easily see that they still hold and the derivation is in fact simpler. Therefore, the convergence is established.

5  Examples: Logistic Regression, l2-Loss Linear SVM, Maximum Entropy, and Deep Neural Networks

In this section, we discuss how the proposed approach can be applied to various machine learning problems.

5.1  Logistic Regression

For two-class data with label , the logistic regression (LR) loss is
formula
Furthermore, the optimization problem of regularized logistic regression is
formula
Note that is twice continuously differentiable with the following Hessian,
formula
5.1
where I is the identity matrix, and D is a diagonal matrix with
formula
By using a subset Sk, the subsampled Hessian-vector product is
formula
5.2
By a sequence of matrix-vector products in equation 5.2, we have a Hessian-free approach because is never explicitly formed.

After the CG procedure, we must calculate and in order to apply our proposed approach. This step may be the bottleneck because the whole training set rather than a subset is used. From equation 5.2, we can calculate and together, so the number of data accesses remains the same as using only .3

For convergence, we check if assumptions in theorem 1 hold:

  1. Because of the regularization term , the sublevel set is bounded (See, for example, appendix A of Lin et al., 2008.)

  2. The gradient is Lipschitz continuous. The mean-value theorem implies
    formula
    where is between and . From equation 5.1 and , is bounded by
    formula
  3. The regularization term implies that equation 4.1 holds with . For equation 4.2, because , we have
    formula
    Next, we check equation 4.3. Because Hk is a special case of when , it also satisfies equations 4.1 and 4.2. Then equation 4.3 follows.

5.2  l2-Loss Linear SVM

For two-class data with label , the squared hinge loss (l2-loss) takes the following form:
formula
Two-class l2-loss SVM then solves the following optimization problem:
formula
5.4
In equation 5.4, , C is the regularization parameter, and . It is known that equation 5.4 is differentiable but not twice differentiable. However, we can define the following generalized Hessian (Mangasarian, 2006),
formula
5.5
where I is the identity matrix, , and D is a diagonal matrix with
formula
Then the Newton method can still be applied (see an example in Lin et al., 2008, that uses the full Hessian). For the subsampled Hessian, the diagonal matrix D is modified to have
formula

Regarding the convergence, most assumptions in theorem 1 hold by the same explanation in section 5.1. The only exception is the Lipschitz continuity of , where a proof can be found in equation 15 of Mangasarian (2002).

5.3  Maximum Entropy

For multiclass data with label , maximum entropy (ME) is an extension of LR.4 The probability that an instance is associated with label y is
formula
where y is the label of and .
Minimizing the negative log likelihood with a regularization term leads to the following optimization problem.
formula
Byrd et al. (2011) refer to ME as multinomial logistic regression.
Notice that is twice continuously differentiable. The Hessian-vector product of is
formula
5.6
and
formula
Details of the derivation are in the supplementary materials. If using the subsampled Hessian, we have
formula

Regarding the convergence, we prove that theorem 1 holds, but leave details in the supplementary materials.

5.4  Deep Neural Networks

We apply deep neural networks for multiclass classification, where the number of classes is k and the class labels are assumed to be . A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multilayer structure. Between two layers, a weight vector maps inputs (the previous layer) to outputs (the next layer). An illustration is in Figure 2.

Figure 2:

An illustration of the deep neural networks, where W1 contains eight elements, W2 contains four, and W3 has eight.

Figure 2:

An illustration of the deep neural networks, where W1 contains eight elements, W2 contains four, and W3 has eight.

Assume the network has L layers. We use to denote a structure so that the number of neurons at the mth layer is nm.5 The example in Figure 2 has a structure of 4-2-2-4. At the mth layer, the output is obtained by the following recurrence,
formula
5.7
where is the weight from node i of the st layer to node j of the mth layer, is an activation function,6 is the output, and is the given input feature vector. Assume is the collection of all variables. A model is obtained by minimizing the following regularized training losses:
formula
5.8
where is a regularization constant. We follow Martens (2010) to consider the least-square loss,
formula
where and . After the training procedure of constructing the network, the predicted class for a test instance is
formula

Because the total number of parameters is large, to apply Newton methods, we need a Hessian-free approach. There are two major challenges:

  1. In contrast to equation 5.2, the (sub)-Hessian vector product is now much more complicated because of the network structure.

  2. is not a convex function.

Martens (2010) and Martens and Sutskever (2012) have designed a subsampled Hessian Newton method to handle these two difficulties. For fast Hessian-vector product, they employ the technique of forward-differentiation (Wengert, 1964; Pearlmutter, 1994). We give more details in the supplementary materials.

For a nonconvex , Hk is not positive semidefinite, so Schraudolph (2002) has proposed using a generalized Gauss-Newton matrix to approximate the Hessian in deep learning problems. Though we do not give details, a generalized Gauss-Newton matrix is positive semidefinite and takes the following form,
formula
5.9
where B is a positive-definite matrix. We can clearly see that this form is similar to the Hessian matrix of LR and l2-loss SVM in equations 5.1 and 5.5, respectively. We can add a regularization term to ensure the positive definiteness. The Hessian-vector product becomes the product between the generalized Gauss-Newton matrix and the vector. This calculation can be conducted by -operators discussed above.
In algorithm 1, the line search procedure is used to ensure the convergence. Other optimization techniques for a similar purpose include the Levenberg-Marquardt method (Moré, 1978) and trust region methods (Conn, Gould, & Toint, 2000). Martens (2010) applies the Levenberg-Marquardt method prior to the line search procedure. Specifically, instead of the linear system in equation 2.1, the direction is obtained by solving the following linear system,
formula
where Gk is the Gauss-Newton matrix in equation 5.9 at the current iteration, I is an identity matrix, and is a damping factor. For the next iteration, is updated by
formula
Note that are constants and
formula
where
formula
5.10
and is the ratio between actual and predicted function reductions. According to Martens & Sutskever (2012), this adjustment of the direction is practically useful in training deep networks.
Next we discuss how to improve the direction using methods developed in section 3. Because deep neural networks have a large number of variables, we solve equation 3.7 instead of equation 3.4. Further, in equation 3.7 is replaced with the Gauss-Newton matrix Gk. The optimization problem is convex in and , so the following linear system is solved:
formula
5.11

One may criticize that is already of the best solution of minimization of the second-order approximation defined in equation 5.10, so equation 3.7 should not give a better direction. However, because of the damping factor , is not the optimal solution to minimize in equation 5.10. Therefore, equation 5.11 may be useful to find a more accurate solution to minimize and hence obtain a better direction. We provide detailed experiments in section 6.2.

Because the objective function of deep networks is nonconvex and our algorithm has been extended to incorporate the LM procedure, we do not have theoretical convergence like that in section 4.

6  Experiments

In this section, we conduct experiments on logistic regression, l2-loss linear SVM, maximum entropy, and deep neural networks. We compare the proposed approaches with existing subsampled Hessian Newton methods in Byrd et al. (2011) and Martens (2010).

Programs for experiments in this letter can be found at http://www.csie.ntu.edu.tw/∼cjlin/papers/sub_hessian/sub_hessian_exps.tar.gz. All data sets except are publicly available at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.

6.1  Logistic Regression and l2-Loss Linear SVM

We select some large, sparse, and two-class data for experiments. Such data sets are commonly used in evaluating linear classifiers such as SVM and logistic regression. Detailed data statistics are in Table 1.

Table 1:
Summary of Two-Class Sets Used for Experiments on Logistic Regression and l2-Loss Linear SVM.
Data Setnl
 1,355,191 455 19,996 64l 64l 
 3,052,939 340 368,444 32l 2l 
 20,216,830 36 8,407,752 0.0625l 0.015625l 
 29,890,095 29 19,264,097 0.0625l 0.015625l 
Data Setnl
 1,355,191 455 19,996 64l 64l 
 3,052,939 340 368,444 32l 2l 
 20,216,830 36 8,407,752 0.0625l 0.015625l 
 29,890,095 29 19,264,097 0.0625l 0.015625l 

Notes: n is the number of features, and is the average number of nonzero features per training example. l is the number of training examples. and are the parameters among to achieve the best five-fold cross-validation accuracy for logistic regression and l2-loss linear SVM, respectively.

We compare five methods:

  1. Full: The Newton method of using the full Hessian. We do not use (the maximal number of CG iterations) for a stopping condition of the CG procedure, so only equation 2.2 is used.

  2. Full-CG: The Newton method of using the full Hessian. is set as the maximal number of CG steps. For example, Full-CG10 means that .

  3. Subsampled: The method proposed in Byrd et al. (2011), where the backtracking line search starts with . See also algorithm 1.

  4. Method 1: The same as Subsampled, but the initial for backtracking line search is by equation 3.3. Although this modification is minor, we are interested in checking how important the initial of line search is in a subsampled Newton method.

  5. Method 2: The method proposed in section 3 by using the direction . Note that we set .

For the CG procedure at each iteration, we let in the stopping condition, equation 2.2. Following Byrd et al. (2011), except Full, we set
formula
because in some situations equation 2.2 is difficult to be satisfied.

The constant in the sufficient decrease condition is chosen to be . The ratio of backtracking line search is (see equation 2.4). Our experimental framework is modified from the Newton-CG implementation of the software (Fan et al., 2008).

We present the running time in Figure 3 (logistic regression) and Figure 4 (l2-loss linear SVM) by using the following subsampling rates:

  1. .

  2. .

Figure 3:

Experiments on logistic regression. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

Figure 3:

Experiments on logistic regression. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

Figure 4:

Experiments on L2-loss linear SVM. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

Figure 4:

Experiments on L2-loss linear SVM. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

In each panel, the x-axis shows the running time in seconds, while the y-axis gives the relative difference to the optimal value defined as
formula
6.1
where is an optimal solution and is the kth iterate. Because is not available, we run one method long enough to obtain an approximate reference value. Note that in Figures 3 and 4, both the x-axis and y-axis are log-scaled.
We can make four observations. First, among the three subsampled Hessian Newton methods, in general
formula
where indicates faster convergence speed. The difference is bigger when using a smaller Sk. In this situation, the obtained direction using subsampled Hessian is very different from the full Newton direction, so some adjustments are very useful. Overall our approach (Method 2) significantly improves on the approach in Byrd et al. (2011). In Figures 4c and 4d, Method 1 is slightly worse than Subsampled. An investigation shows that at final iterations, the sufficient decrease condition, equation 2.3 holds at . In such a situation, the effort by Method 1 to find a good initial is not very cost-effective.

Second when is reduced from to , the running time of the three subsampled Newton methods (Subsampled, Method 1, Method 2) increases. This result is expected because a smaller leads to worse directions.

Third, a comparison between Full and Full-CG10 shows that the number of CG steps per outer iteration may significantly affect the running time. A smaller reduces the cost per outer iteration, but may cause more outer iterations. In Figures 3 and 4, Full-CG10 is in general slower than Full, so selecting a larger seems to be necessary for these problems. On the other hand, except in Figures 3a and 4a, Subsampled-CG10 is faster than Full-CG10. This result is consistent with that in Byrd et al. (2011). However, Subsampled-CG10 is slower than Full, where they differ not only in the use of subsampled or full Hessian, but also in the stopping condition of the CG procedure. This example indicates that in comparing two methods, it is important to keep all settings except the one for analysis the same.

Finally after using our proposed techniques, Method 2 becomes faster than Full and Full-CG10. The only exception is , which has #features #instances, so using only a subset Sk may cause significant information loss. Therefore, subsampled Newton methods are less suitable for such data sets. In addition, the C value chosen for this set is relatively large. The Hessian matrix becomes more ill conditioned in this situation, so using the full Hessian can obtain better search directions.

The discussion indicates the importance of setting a proper value, so in Figure 5, we analyze results of using various values. In Figure 5a, is the best, but in Figure 5b, the best becomes . Therefore, the best value is problem dependent, regardless of using full or subsampled Hessian. Unfortunately, we do not have a good strategy for selecting a suitable , so this is a future issue for investigation.

Figure 5:

A comparison between different . Data sets and , subsampled Hessian, and logistic regression (LR) are considered. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled.

Figure 5:

A comparison between different . Data sets and , subsampled Hessian, and logistic regression (LR) are considered. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled.

An important issue of the proposed method in section 3 is the selection of . While we have experimentally demonstrated that is useful, we have also checked
formula
Because of space considerations, details are in the supplementary materials. Results clearly show that using is much better than . We believe the superiority of comes from solving a subproblem of using some second-order information.

6.2  Maximum Entropy for Multiclass Classification

We select some large multiclass data for experiments. Detailed data statistics are in Table 2. For simplicity, is used for all problems.

Table 2:
Summary of Multiclass Data Sets Used for Experiments on Maximum Entropy.
Data Setnlk
 780 150 60,000 10 
 47,236 65 518,571 53 
 55,197 163 6412 105 
 128 31 86,400 1,000 
Data Setnlk
 780 150 60,000 10 
 47,236 65 518,571 53 
 55,197 163 6412 105 
 128 31 86,400 1,000 

Notes: n is the number of features, and is the average number of nonzero features per training example. l is the number of training examples. For , we use only of the data because in section 6.3, the remaining are used as the test set

We compare the five settings considered in section 6.1, present results in Figure 6, and have the following observations.

Figure 6:

Experiments on maximum entropy. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

Figure 6:

Experiments on maximum entropy. We present running time (in seconds) versus the relative difference to the optimal function value. Both x-axis and y-axis are log-scaled. Left: . Right: .

First, as in section 6.1, we still have
formula
However, the difference between Method 2 and the other two is generally smaller. This seems to indicate that the direction obtained using subsampled Hessian is good enough and the use of two directions by does not offer significant improvements. In this regard, the selection of a good initial step size for line search becomes more crucial, so in Figures 6c and 6d, Method 1 is much better than Subsampled, but it is almost the same as Method 2.

Second, except Figure 6 a, all subsampled methods (Subsampled, Method 1, and Method 2) are faster than Full or Full-CG10. This result is similar to what has been reported in Byrd et al. (2011). Therefore, for these multi-classproblems, the subsampled Hessian method may have obtained a good enough direction, a situation consistent with our conclusion from the previous observation.

Finally, Full is much faster than Full-CG10 in Figure 6a but is slower in others. This result confirms our earlier finding that a suitable is problem dependent.

6.3  Deep Learning for Multiclass Classification

For deep neural networks, we consider some multiclass data sets listed in Table 3. They are different from those used in section 6.2 because our deep neural network implementation is not suitable for sparse data with many features.

Table 3:
Summary of the Data Sets Used for Experiments on Deep Networks.
Data SetnlltkDeep Structure
 16 7494 3498 10 *-300-200-100-30-* 
 256 7291 2007 10 *-500-250-100-30-* 
 780 60,000 10,000 10 *-1000-500-250-30-* 
 128 86,400 21,600 1,000 *-1000-* 
Data SetnlltkDeep Structure
 16 7494 3498 10 *-300-200-100-30-* 
 256 7291 2007 10 *-500-250-100-30-* 
 780 60,000 10,000 10 *-1000-500-250-30-* 
 128 86,400 21,600 1,000 *-1000-* 

Notes: n is the number of features. l is the number of training instances. lt is the number of testing instances. For the last column, the first * means the number of features, and the last * means the number of classes. Note that either the set we obtained is already in the range of or we conduct a feature-wise scaling on the data set.

We compare the following methods. The first three are variants of subsampled Hessian methods considered in Martens (2010). They differ in selecting a CG iterate as the approximate Newton direction. Our experimental framework is modified from the code at http://www.cs.toronto.edu/∼jmartens/docs/HFDemo.zip. Note that Martens (2010) considers the following stopping condition rather than equation 2.2 for the CG procedure. Let be the sequence of CG iterates. The CG procedure stops at the ith step, and let if
formula
6.2
where and .
  1. Martens-sub: This method, proposed in Martens (2010), stores a subset of CG iterates and selects the one such that the objective value of using the subset Sk (i.e., ) is the smallest.

  2. Martens-last: This method, proposed in Martens (2010), selects the last CG iterate as the search direction. This setting corresponds to the Subsampled method in section 6.1.

  3. Martens-full: This method, proposed in Martens (2010), stores a subset of CG iterates and selects the one such that the objective function value of using the full data set (i.e., ) is the smallest.

  4. Comb1: The method proposed in section 3 by using the direction . Note that equation 3.7 is used, and all other settings are the same as Martens-last. (For more details, see section 5.4.)

  5. Comb2: The same as Comb1 except that the stopping condition, equation 2.2, is used for the CG procedure.

For simplicity, we do not implement some tricks used in Martens (2010):

  • No preconditioning.

  • No use of the previous solution as the initial guess in the CG procedure. The zero vector is used as the initial CG iterate.

For Comb2, we set for equation 2.2. The value is smaller than in section 6.1 because otherwise, the CG stopping condition is too loose. In addition, we consider the same as Martens (2010) for the Levenberg-Marquardt method. We use the subsampling rate and from Martens (2010) for equation 5.8.

From results presented in Figure 7, we observe that the proposed Comb1 and Comb2 methods are faster than other methods. Therefore, the optimization problem, equation 3.7, is useful to improve the quality of the search direction. The difference between Comb1 and Comb2 is generally small, but for and , Comb2 is slightly better because of a smaller number of CG iterations per outer iteration. We observe that equation 2.2 with is generally looser than equation 6.2. Earlier we mentioned that equation 2.2 with is too loose. Therefore, with a similar issue discussed earlier on finding suitable , we clearly see the importance of using a CG stopping condition that is neither too loose nor too strict.

Figure 7:

Experiments on deep neural networks. We present running time (in seconds) versus test error.

Figure 7:

Experiments on deep neural networks. We present running time (in seconds) versus test error.

7  Conclusion

In this letter, we have proposed novel techniques to improve the subsampled Hessian Newton method. We demonstrate the effectiveness of our method on logistic regression, linear SVM, maximum entropy, and deep neural networks. The asymptotic convergence is proved, and the running time is shown to be shorter than Byrd et al. (2011) and Martens (2010). This work gives a compelling example of showing that little extra cost in finding the search direction may lead to dramatic overall improvement.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan grant 101-2221-E-002-199-MY3. We thank the anonymous reviewers for valuable comments.

References

Byrd
,
R. H.
,
Chin
,
G. M.
,
Neveitt
,
W.
, &
Nocedal
,
J.
(
2011
).
On the use of stochastic Hessian information in optimization methods for machine learning
.
SIAM Journal on Optimization
,
21
(
3
),
977
995
.
Chapelle
,
O.
, &
Erhan
,
D.
(
2011
).
Improved preconditioner for Hessian free optimization
.
In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
.
Conn
,
A. R.
,
Gould
,
N. I. M.
, &
Toint
,
P. L.
(
2000
).
Trust-region methods
.
Philadelphia
:
Society for Industrial and Applied Mathematics
.
Drori
,
Y.
, &
Teboulle
,
M.
(
2014
).
Performance of first-order methods for smooth convex minimization: A novel approach
.
Mathematical Programming
,
145
,
451
482
.
Fan
,
R.-E.
,
Chang
,
K.-W.
,
Hsieh
,
C.-J.
,
Wang
,
X.-R.
, &
Lin
,
C.-J.
(
2008
).
LIBLINEAR: A library for large linear classification
.
Journal of Machine Learning Research
,
9
,
1871
1874
.
Golub
,
G. H.
, &
Van Loan
,
C. F.
(
1996
).
Matrix computations
(3rd ed.).
Baltimore, MD
:
Johns Hopkins University Press
.
Griva
,
I.
,
Nash
,
S. G.
, &
Sofer
,
A.
(
2009
).
Linear and nonlinear optimization
(2nd ed.).
Philadelphia
:
SIAM
.
Huang
,
F.-L.
,
Hsieh
,
C.-J.
,
Chang
,
K.-W.
, &
Lin
,
C.-J.
(
2010
).
Iterative scaling and coordinate descent methods for maximum entropy
.
Journal of Machine Learning Research
,
11
,
815
848
.
Keerthi
,
S. S.
, &
DeCoste
,
D.
(
2005
).
A modified finite Newton method for fast solution of large scale linear SVMs
.
Journal of Machine Learning Research
,
6
,
341
361
.
Lin
,
C.-J.
,
Weng
,
R. C.
, &
Keerthi
,
S. S.
(
2008
).
Trust region Newton method for large-scale logistic regression
.
Journal of Machine Learning Research
,
9
,
627
650
.
Mangasarian
,
O. L.
(
2002
).
A finite Newton method for classification
.
Optimization Methods and Software
,
17
(
5
),
913
929
.
Mangasarian
,
O. L.
(
2006
).
Exact 1-norm support vector machines via unconstrained convex differentiable minimization
.
Journal of Machine Learning Research
,
7
,
1517
1530
.
Martens
,
J.
(
2010
).
Deep learning via Hessian-free optimization
. In
Proceedings of the 27th International Conference on Machine Learning (ICML)
.
Madison, WI
:
Omnipress
.
Martens
,
J.
, &
Sutskever
,
I.
(
2012
).
Training deep and recurrent networks with Hessian-free optimization
. In
G.
Montavon
,
G. B.
Orr
, &
K.-R.
Müller
(Eds.),
Neural networks: Tricks of the trade
(pp.
479
535
).
Berlin: Springer
.
Moré
,
J. J.
(
1978
).
The Levenberg-Marquardt algorithm: Implementation and theory
. In
G.
Watson
(Ed.),
Numerical analysis
,
in Lecture Notes in Mathematics 630
(pp.
105
116
).
New York
:
Springer-Verlag
.
Pearlmutter
,
B. A.
(
1994
).
Fast exact multiplication by the Hessian
.
Neural Computation
,
6
(
1
),
147
160
.
Polyak
,
B. T.
(
1964
).
Some methods of speeding up the convergence of iteration methods
.
USSR Computational Mathematics and Mathematical Physics
,
4
(
5
),
1
17
.
Schraudolph
,
N. N.
(
2002
).
Fast curvature matrix-vector products for second-order gradient descent
.
Neural Computation
,
14
(
7
),
1723
1738
.
Wengert
,
R. E.
(
1964
).
A simple automatic derivative evaluation program
.
Communications of the ACM
,
7
(
8
),
463
464
.

Notes

1

See references of CG in, for example, Golub and Van Loan (1996).

2

This property holds because the initial point of the CG procedure is the zero vector.

3

Note that from the recent development of linear classification, it is known that the number of data accesses may affect the running time more than the number of operations.

4

For a detailed explanation connecting LR and ME, see, for example, section 5.2 of Huang, Hsieh, Chang, and Lin (2010).

5

Note that n0 is the number of features and is the number of classes.

6

In this letter, we use the sigmoid function as an activation function, that is, .

Supplementary data