## Abstract

The Levenberg-Marquardt (LM) learning algorithm is a popular algorithm for training neural networks; however, for large neural networks, it becomes prohibitively expensive in terms of running time and memory requirements. The most time-critical step of the algorithm is the calculation of the Gauss-Newton matrix, which is formed by multiplying two large Jacobian matrices together. We propose a method that uses backpropagation to reduce the time of this matrix-matrix multiplication. This reduces the overall asymptotic running time of the LM algorithm by a factor of the order of the number of output nodes in the neural network.

## 1. Introduction

A neural network is a smooth function that maps an input column vector to an output column vector and where is a parameter vector known as the weight vector.

For the specific input and output vectors and , corresponding to a training pattern *p*, the Jacobian matrix of the neural network is defined to be , which is a matrix with element (*i*, *j*) equal to . The Gauss-Newton matrix is defined to be *G* = ∑_{p}*G _{p}*, where

*G*=

_{p}*J*

_{p}^{T}

*J*. We define , and

_{p}*n*as the number of training patterns. Then

_{p}*J*is a

_{p}*n*×

_{o}*n*matrix, and so forming the matrix

_{w}*G*by direct matrix multiplication and summation over all patterns would take 2

*n*

_{o}n_{p}n_{w}^{2}floating point operations (flops), ignoring lower power terms.

We define a technique that can calculate the *G* matrix in the faster time of approximately 3*n _{p}n_{w}*

^{2}flops (ignoring lower-power terms). This faster algorithm is related to the method of Schraudolph (2002) and exploits a trick that backpropagation (Werbos, 1974; Rumelhart, Hinton, & Williams, 1986) can be used to quickly multiply an arbitrary column vector on the left by

*J*

_{p}^{T}.

Forming the *G* matrix is important because it is central to the Levenberg-Marquardt (LM) training algorithm (Levenberg, 1944; Marquardt, 1963). The LM algorithm uses a weight update that requires the inverse of *G*. Details are given by Bishop (1995). Since , the inversion of *G* will take time O(*n _{w}*

^{3}), and since usually

*n*≫

_{p}*n*, it turns out that the formation of the matrix

_{w}*G*is usually slower than its inversion. Hence, our algorithm is reducing the asymptotic time of the most time-critical step of the LM algorithm. Previous research to speed up the formation of

*G*has concentrated on parallel implementations (Suri, Deodhare, & Nagabhushan, 2002).

## 2. The Technique

Backpropagation is an algorithm to calculate the gradient very efficiently for a given pattern *p* and error function, *E _{p}*. If we assume the computations at the nodes of the network are dwarfed by those at the network weights, then the backpropagation algorithm takes 3

*n*flops per pattern.

_{w}By the chain rule, . Hence, we see that backpropagation can be used to multiply a column vector, , very efficiently on the left by the transposed Jacobian matrix. The choice of column vector here is arbitrary; it does not have to specifically be . This is the trick we use to create our fast algorithm for calculating *G*.

A standard method to calculate the Jacobian matrix is as follows. To calculate the *i*th row of *J _{p}*, we use backpropagation to multiply

*J*

_{p}^{T}by the

*i*th column of

*I*, an

*n*×

_{o}*n*identity matrix. Repeating this for all

_{o}*i*∈ {1, 2, …,

*n*} outputs will calculate the full

_{o}*J*matrix in 3

_{p}*n*flops.

_{o}n_{w}The new method to calculate the *G _{p}* matrix is as follows. Since

*G*=

_{p}*J*

_{p}^{T}

*J*, the

_{p}*i*th column of

*G*is equal to the product of the matrix

_{p}*J*

_{p}^{T}with the

*i*th column of

*J*. Hence each column of

_{p}*G*can be calculated using one pass of backpropagation. Therefore, calculating the whole

_{p}*G*matrix from a given

_{p}*J*matrix takes 3

_{p}*n*

_{w}^{2}flops.

In addition to the time taken to calculate *J _{p}* and

*G*, we also need one initial forward pass through the network, which will take 2

_{p}*n*flops. Hence, the total flop count to calculate

_{w}*G*, when summing over all

*n*patterns, is

_{p}*n*(2

_{p}*n*+ 3

_{w}*n*+ 3

_{o}n_{w}*n*

_{w}^{2}). Since usually , the most significant term here is 3

*n*

_{p}n_{w}^{2}flops.

## 3. Discussion

Since the work of Schraudolph (2002) allows fast multiplication of the *G* matrix by an arbitrary column vector, in time 7*n _{p}n_{w}* flops, it would be trivial to extend that work to form the full

*G*matrix column by column. This would give an asymptotically equivalent algorithm to ours, but in a slower absolute flop count of 7

*n*

_{p}n_{w}^{2}.

The calculation time of the direct multiplication method and our method could both be halved further by exploiting the symmetry of *G*.

Our calculations indicate that while Strassen multiplication (Huss-Lederman, Jacobson, Tsao, Turnbull, & Johnson, 1996) is not useful in calculating *G _{p}* for a single pattern, it does confer an asymptotic advantage when calculating

*G*for all patterns in a single outer product. However, doing so is memory intensive and significantly more complicated to implement than our method.

We have not considered hardware acceleration and caching issues, both of which would likely favor conventional matrix multiplication over our method.

## 4. Conclusion

We have presented a way to use backpropagation to reduce the time taken to calculate the Gauss-Newton matrix in Levenberg-Marquardt down by a factor proportional to *n _{o}*. This reduces the critical time step in implementing the LM algorithm and so could be a useful tool to optimize any LM implementation where

*n*≫ 1.

_{o}## Acknowledgments

We are grateful to the anonymous reviewers for their suggestions for this note.