Abstract

The Levenberg-Marquardt (LM) learning algorithm is a popular algorithm for training neural networks; however, for large neural networks, it becomes prohibitively expensive in terms of running time and memory requirements. The most time-critical step of the algorithm is the calculation of the Gauss-Newton matrix, which is formed by multiplying two large Jacobian matrices together. We propose a method that uses backpropagation to reduce the time of this matrix-matrix multiplication. This reduces the overall asymptotic running time of the LM algorithm by a factor of the order of the number of output nodes in the neural network.

1.  Introduction

A neural network is a smooth function that maps an input column vector to an output column vector and where is a parameter vector known as the weight vector.

For the specific input and output vectors and , corresponding to a training pattern p, the Jacobian matrix of the neural network is defined to be , which is a matrix with element (i, j) equal to . The Gauss-Newton matrix is defined to be G = ∑pGp, where Gp = JpTJp. We define , and np as the number of training patterns. Then Jp is a no × nw matrix, and so forming the matrix G by direct matrix multiplication and summation over all patterns would take 2nonpnw2 floating point operations (flops), ignoring lower power terms.

We define a technique that can calculate the G matrix in the faster time of approximately 3npnw2 flops (ignoring lower-power terms). This faster algorithm is related to the method of Schraudolph (2002) and exploits a trick that backpropagation (Werbos, 1974; Rumelhart, Hinton, & Williams, 1986) can be used to quickly multiply an arbitrary column vector on the left by JpT.

Forming the G matrix is important because it is central to the Levenberg-Marquardt (LM) training algorithm (Levenberg, 1944; Marquardt, 1963). The LM algorithm uses a weight update that requires the inverse of G. Details are given by Bishop (1995). Since , the inversion of G will take time O(nw3), and since usually npnw, it turns out that the formation of the matrix G is usually slower than its inversion. Hence, our algorithm is reducing the asymptotic time of the most time-critical step of the LM algorithm. Previous research to speed up the formation of G has concentrated on parallel implementations (Suri, Deodhare, & Nagabhushan, 2002).

2.  The Technique

Backpropagation is an algorithm to calculate the gradient very efficiently for a given pattern p and error function, Ep. If we assume the computations at the nodes of the network are dwarfed by those at the network weights, then the backpropagation algorithm takes 3nw flops per pattern.

By the chain rule, . Hence, we see that backpropagation can be used to multiply a column vector, , very efficiently on the left by the transposed Jacobian matrix. The choice of column vector here is arbitrary; it does not have to specifically be . This is the trick we use to create our fast algorithm for calculating G.

A standard method to calculate the Jacobian matrix is as follows. To calculate the ith row of Jp, we use backpropagation to multiply JpT by the ith column of I, an no × no identity matrix. Repeating this for all i ∈ {1, 2, …, no} outputs will calculate the full Jp matrix in 3nonw flops.

The new method to calculate the Gp matrix is as follows. Since Gp = JpTJp, the ith column of Gp is equal to the product of the matrix JpT with the ith column of Jp. Hence each column of Gp can be calculated using one pass of backpropagation. Therefore, calculating the whole Gp matrix from a given Jp matrix takes 3nw2 flops.

In addition to the time taken to calculate Jp and Gp, we also need one initial forward pass through the network, which will take 2nw flops. Hence, the total flop count to calculate G, when summing over all np patterns, is np(2nw + 3nonw + 3nw2). Since usually , the most significant term here is 3npnw2 flops.

3.  Discussion

Since the work of Schraudolph (2002) allows fast multiplication of the G matrix by an arbitrary column vector, in time 7npnw flops, it would be trivial to extend that work to form the full G matrix column by column. This would give an asymptotically equivalent algorithm to ours, but in a slower absolute flop count of 7npnw2.

The calculation time of the direct multiplication method and our method could both be halved further by exploiting the symmetry of G.

Our calculations indicate that while Strassen multiplication (Huss-Lederman, Jacobson, Tsao, Turnbull, & Johnson, 1996) is not useful in calculating Gp for a single pattern, it does confer an asymptotic advantage when calculating G for all patterns in a single outer product. However, doing so is memory intensive and significantly more complicated to implement than our method.

We have not considered hardware acceleration and caching issues, both of which would likely favor conventional matrix multiplication over our method.

4.  Conclusion

We have presented a way to use backpropagation to reduce the time taken to calculate the Gauss-Newton matrix in Levenberg-Marquardt down by a factor proportional to no. This reduces the critical time step in implementing the LM algorithm and so could be a useful tool to optimize any LM implementation where no ≫ 1.

Acknowledgments

We are grateful to the anonymous reviewers for their suggestions for this note.

References

Bishop
,
C. M.
(
1995
).
Neural networks for pattern recognition
.
New York
:
Oxford University Press
.
Huss-Lederman
,
S.
,
Jacobson
,
E. M.
,
Tsao
,
A.
,
Turnbull
,
T.
, &
Johnson
,
J. R.
(
1996
).
Implementation of Strassen's algorithm for matrix multiplication
. In
Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing ’96
.
Washington, DC
:
IEEE Computer Society
.
Levenberg
,
K.
(
1944
).
A method for the solution of certain non-linear problems in least squares
.
Quart. Appl. Math.
,
2
,
164
168
.
Marquardt
,
D. W.
(
1963
).
An algorithm for least-squares estimation of nonlinear parameters
.
Journal of the Society for Industrial and Applied Mathematics
,
11
,
431
441
.
Rumelhart
,
D.
,
Hinton
,
G.
, &
Williams
,
R.
(
1986
).
Learning representations by back-propagating errors
.
Nature
,
6088
,
533
536
.
Schraudolph
,
N. N.
(
2002
).
Fast curvature matrix-vector products for second-order gradient descent
.
Neural Computation
,
14
(
7
),
1723
1738
.
Suri
,
N.N.R.R.
,
Deodhare
,
D.
, &
Nagabhushan
,
P.
(
2002
).
Parallel Levenberg-Marquardt-based neural network training on Linux clusters—a case study
. In
Linux Clusters, ICVGIP 2002, 3rd Indian Conference on Computer Vision, Graphics and Image Processing
.
N.P.
:
Allied Publishers
.
Werbos
,
P. J.
(
1974
).
Beyond regression: New tools for prediction and analysis in the behavioral sciences
.
Unpublished doctoral dissertation, Harvard University
.