## Abstract

Deep learning has traditionally been computationally expensive, and advances in training methods have been the prerequisite for improving its efficiency in order to expand its application to a variety of image classification problems. In this letter, we address the problem of efficient training of convolutional deep belief networks by learning the weights in the frequency domain, which eliminates the time-consuming calculation of convolutions. An essential consideration in the design of the algorithm is to minimize the number of transformations to and from frequency space. We have evaluated the running time improvements using two standard benchmark data sets, showing a speed-up of up to 8 times on 2D images and up to 200 times on 3D volumes. Our training algorithm makes training of convolutional deep belief networks on 3D medical images with a resolution of up to 128 × 128 × 128 voxels practical, which opens new directions for using deep learning for medical image analysis.

## 1 Introduction

Deep learning has traditionally been computationally expensive, and advances in training methods have been the prerequisite for expanding its application to a variety of image classification problems. The development of layer-wise training methods (Hinton, Osindero, & Teh, 2006) greatly improved the efficiency of the training of deep belief networks (DBNs), which has made feasible the use of large sets of small images (e.g., 28 × 28), such as those used for handwritten digit recognition. Subsequently, new directions for speeding up the training of deep models were opened with the advance of programmable graphics cards (GPUs), which can perform thousands of operations in parallel. Raina and Madhavan (2009) demonstrated that by using graphics cards, training of restricted Boltzmann machines (RBMs) on small image patches (e.g., 24 × 24) can be performed up to 70 times faster than on the CPU, facilitating the application to larger training sets. However, the number of trainable weights of a DBN increases greatly with the resolution of the training images, which can make training on large images impracticable. In order to scale DBNs to high-resolution images, Lee, Grosse, Ranganath, and Ng (2009, 2011) introduced the convolutional deep belief network (convDBN), a deep generative model that uses weight sharing to reduce the number of trainable weights. They showed that a convDBN can be used to classify images with a resolution of up to 200 × 200 pixels. To speed up the training of convolutional neural networks (CNNs) on high-resolution images, Krizhevsky, Sutskever, and Hinton (2012) replaced traditional convolutions of the first layer of their CNN with strided convolutions, a type of convolution that shifts the filter kernel as a sliding window with a fixed step size or stride greater than one. Through the use of strided convolutions, the number of hidden units in each convolutional layer is greatly reduced, which reduces both training time and required memory. Using a highly optimized GPU implementation of convolutions, they were able to train a CNN on images with a resolution of 256 × 256 pixels, achieving state-of-the-art performance on the ILSVRC-2010 and ILSVRC-2012 competitions (Krizhevsky et al., 2012). An alternative approach was proposed by Mathieu, Henaff, and LeCun (2014) who sped up the training of CNNs by calculating convolutions between batches of images and filters using fast Fourier transforms (FFTs), albeit at the cost of additional memory required for storing the filters.

In recent years, deep learning has shown great potential for medical image analysis. For example, it has been used for segmentation (Ciresan, Giusti, & Schmidhuber, 2012; Liao, Gao, Oto, & Shen, 2013), brain parcellation (Lee, Laine, & Klein, 2011), cancer detection (Cruz-Roa, Edison, Ovalle, Madabhushi, & Gonz, 2013), and deformable registration (Wu, Kim, Wang, & Gao, 2013). However, due to the high computational costs of training large networks, applications have been mostly limited to the analysis of 2D images (Ciresan et al., 2012; Cruz-Roa et al., 2013; Lee, Laine et al., 2011), or small 3D images, frequently extracted as patches of images that are too large to train (Liao et al., 2013; Wu et al., 2013).

Recently we showed the potential of convDBNs for discriminating between Alzheimer’s and healthy subjects using their 3D MRI scans (Brosch & Tam, 2013). In this letter, we detail our training algorithm and GPU implementation in full, with a much more thorough analysis of the running time on high-resolution 2D images (512 × 512) and 3D volumes (128 × 128 × 128), showing speed-ups of up to 8-fold and 200-fold, respectively. Our proposed method performs training in the frequency domain, which replaces the calculation of time-consuming convolutions with simple element-wise multiplications while adding only a small number of FFTs. In contrast to similar FFT-based approaches (e.g., Mathieu et al., 2014), our method does not use batch processing of the images as a means to reduce the number of FFT calculations; rather it minimizes FFTs even when processing a single image, which significantly reduces the required amount of scarce GPU memory. We show that our method can be efficiently implemented on multiple graphics cards, further improving the run-time performance over other GPU-accelerated training methods. In addition, we formalize the expression of the strided convolutional DBN (sconvDBN), a type of convDBN that uses strided convolutions to speed up training and reduce memory requirements, in terms of stride-1 convolutions, which enables the efficient training of sconvDBNs in the frequency domain.

## 2 Algorithm

In this section, we briefly review the fundamentals of convDBNs (Lee et al., 2009), followed by an introduction to sconvDBNs. Furthermore, we show how an sconvDBN can be mapped to an equivalent convDBN in order to learn the parameters of the sconvDBN using our efficient training algorithms that performs contrastive divergence (CD) (Hinton, 2002) in the frequency domain. (For a comprehensive introduction to DBNs and convDBNs, see Hinton et al., 2006, and Lee, Grosse et al., 2011, respectively.) In this section, for notational simplicity, we assume the input images to be square 2D images, but the model generalizes with little modification to nonsquare images of higher dimensions.

### 2.1 Convolutional Deep Belief Networks

*E*of the model as where

*Z*is a normalization constant. If the visible and hidden units are binary stochastic units, the energy of a convRBM is defined as The key terms and notation are defined in Table 1. At the first layer, the number of channels

*N*

_{c}is one for gray-scale images and three for RGB color images. For subsequent layers,

*N*

_{c}is equal to the number of filters of the previous layer.

Symbol . | Description . |
---|---|

Weights of filter kernels connecting visible units with hidden units | |

b _{i} | Bias terms of visible units |

c _{j} | Bias terms of hidden units |

N_{c} | Number of channels of the visible units |

Number of visible units per channel | |

N_{k} | Number of filters and feature maps |

Number of hidden units per feature map | |

Number of weights per filter kernel | |

Element-wise product followed by summation | |

Valid convolution | |

Horizontally and vertically flipped version of |

Symbol . | Description . |
---|---|

Weights of filter kernels connecting visible units with hidden units | |

b _{i} | Bias terms of visible units |

c _{j} | Bias terms of hidden units |

N_{c} | Number of channels of the visible units |

Number of visible units per channel | |

N_{k} | Number of filters and feature maps |

Number of hidden units per feature map | |

Number of weights per filter kernel | |

Element-wise product followed by summation | |

Valid convolution | |

Horizontally and vertically flipped version of |

Note: For notational simplicity, we assume the input images to be square 2D images.

### 2.2 Strided Convolutional Deep Belief Networks

*s*= 1 by reorganizing the values of and to and as illustrated in Figure 1. This reindexing scheme allows the energy function to be expressed in terms of conventional (stride-1) convolutions, which facilitates training in the frequency domain. The new indices of and can be calculated from the old indices of and as follows: After we reorganize and to and , the energy of the model can be rewritten as where denotes periodic convolution. The number of channels, number of visible units per channel, and number of weights per channel after reorganization are given by , and , respectively. The posteriors and can be derived from the energy equation and are given by where is the sigmoid function defined as , . To train a sconvRBM on a set of images, the weights and bias terms can be learned by CD. During each iteration of the algorithm, the gradient of each parameter is estimated, and a gradient step with a fixed learning rate is applied. The gradient of the filter weights can be approximated by where are reindexed images from the training set, and are samples drawn from and , and . To apply the model to real-valued data like certain types of images, the visible units can be modeled as gaussian units. When the visible units are mean–centered and standardized to unit variance, the expectation of the visible units is given by A binary hidden unit can encode only two states. In order to increase the expressive power of the hidden units, we use noisy rectified linear units as the hidden units, which have been shown to improve the learning performance of RBMs (Nair & Hinton, 2010). The hidden units can be sampled with where denotes gaussian noise. The learning algorithm in the spatial domain is summarized in Figure 2a.

### 2.3 Efficient Training of convRBMs in the Frequency Domain

*x*in the frequency domain, denotes the inverse Fourier transform, and denotes element-wise multiplication. The algorithm for approximating the gradient in the frequency domain is summarized in Figure 2b.

The only operations that cannot be directly mapped to the frequency domain are the calculation of the maximum function, the generation of gaussian noise, and trimming of the filter kernels. To perform the first two operations, an image needs to be mapped to the spatial domain and back. However, these operations need only be calculated times per iteration and are therefore not a significant contributor to the total running time. Because filter kernels are padded to the input image size, the size of the learned filter kernels must be explicitly enforced by trimming. This is done by transferring the filter kernels to the spatial domain, setting the values outside the specified filter kernel size to zero, and then transforming the filter kernels back to the frequency domain. This procedure needs to be performed only once per mini-batch. Since the number of mini-batches is relatively small compared to the number of training images, trimming of the filter kernels also does not add significantly to the total running time of the training algorithm.

### 2.4 GPU Implementation and Memory Considerations

To further reduce training times, the algorithm can be efficiently implemented on graphics cards, which can perform hundreds of operations in parallel. To optimize efficiency, it is crucial to maximize the utilization of the large number of available GPU cores, while minimizing the required amount of GPU memory. Because our algorithm requires only a relatively small number of FFT calculations per iteration, the computational bottleneck is the calculation of the element-wise operations, which can be performed in parallel. Thus, we distribute the processing of a single 2D image over independent threads, with one thread per element in the frequency domain. The large number of parallel threads results in a high utilization of the GPU even when each image of a mini-batch is processed sequentially, which we do in our method because this greatly reduces the amount of GPU memory required to store the visible and hidden units compared to processing batches of images in parallel.

If the size of an sconvDBN exceeds the available amount of memory of a single GPU, training can be distributed by assigning different channels to separate GPUs as illustrated in Figure 3. The inference of the hidden units of different channels and the estimation of filter increments of different filters are independent of each other. Therefore, the processing and storing of filters, filter increments, and hidden units can be distributed over multiple GPUs without the need to transfer memory content between GPUs, which would decrease the GPU utilization. To infer the reconstructed visible units , each GPU calculates a separate reconstruction from a distinct set of filters in parallel. Afterward, the separate reconstructions are added up sequentially to create the complete reconstruction, which requires the synchronization of GPUs and consequently decreases the GPU utilization. However, this operation is relatively fast compared to the large number of filtering operations and therefore does not add significantly to the total running time.

*N*

_{b}is the number of images per mini-batch. Depending on the choice of the batch size, this method requires more memory for storing the visible and hidden units while requiring less memory for storing the filters. Alternatively, Mathieu et al. (2014) proposed speeding up the training of CNNs by calculating convolutions between batches of images and filters using FFTs. The memory required for storing the visible units, hidden units, and filters using this method is given by

Table 2 shows a comparison of the memory per GPU required for storing key variables when training a network used in previous work by Krizhevsky et al. (2012). For the first layer, a comparison with Mathieu et al. (2014) training method could not be performed, because that method does not support strided convolutions, which would significantly reduce the memory required for storing the hidden units. In all layers, the proposed approach compensates for the increased memory requirements for the filters by considering one image at a time rather than a batch, and it still outperforms batched learning in the spatial domain in terms of speed (see section 3), despite not using batches.

. | . | . | . | . | . | . | Memory in MB . | ||
---|---|---|---|---|---|---|---|---|---|

Layer . | N_{b}
. | N_{v}
. | N_{c}
. | N_{w}
. | N_{k}
. | s
. | Freq . | Spat . | B-FFT . |

1 | 128 | 224 | 3 | 11 | 48 | 4 | 28.7 | 147.1 | — |

2 | 128 | 55 | 48 | 5 | 128 | 1 | 72.9 | 260.5 | 330.9 |

3 | 128 | 27 | 128 | 3 | 192 | 1 | 69.2 | 114.8 | 182.3 |

4 | 128 | 13 | 192 | 3 | 192 | 1 | 24.0 | 33.0 | 55.5 |

5 | 128 | 13 | 192 | 3 | 128 | 1 | 16.1 | 27.3 | 42.3 |

. | . | . | . | . | . | . | Memory in MB . | ||
---|---|---|---|---|---|---|---|---|---|

Layer . | N_{b}
. | N_{v}
. | N_{c}
. | N_{w}
. | N_{k}
. | s
. | Freq . | Spat . | B-FFT . |

1 | 128 | 224 | 3 | 11 | 48 | 4 | 28.7 | 147.1 | — |

2 | 128 | 55 | 48 | 5 | 128 | 1 | 72.9 | 260.5 | 330.9 |

3 | 128 | 27 | 128 | 3 | 192 | 1 | 69.2 | 114.8 | 182.3 |

4 | 128 | 13 | 192 | 3 | 192 | 1 | 24.0 | 33.0 | 55.5 |

5 | 128 | 13 | 192 | 3 | 128 | 1 | 16.1 | 27.3 | 42.3 |

Notes: A comparison with Mathieu et al.’s method could not be made for the first layer because that method does not support strided convolutions. In all layers, our method consumes less memory for storing the key variables than the other two methods.

## 3 Evaluation

To demonstrate where the performance gains are produced, we trained a two-layer sconvDBN on 2D and 3D images using our frequency-domain method and the following methods that compute convolutions on the GPU but using different approaches: (1) our spatial domain implementation that convolves a single 2D or 3D image with a single 2D or 3D filter kernel at a time, (2) Krizhevsky’s spatial domain convolution implementation (Krizhevsky, 2012), which is a widely used method (Hinton & Srivastava, 2012; Scherer, Müller, & Behnke, 2010; Zeiler & Fergus, 2013) that calculates the convolution of batches of 2D images and 2D filter kernels in parallel (note that this method cannot be applied to 3D images, so it was only used for the 2D experiments), and (3) our implementation that calculates convolutions using FFTs but without mapping the other operations that would allow the algorithm to stay in the frequency domain when not computing convolutions. The parameters that we used for training convRBMs on 2D and 3D images are summarized in Table 3. The key parameters that we varied for our experiments are the filter size and stride size of the first layer and the filter size and the number of channels of the second layer. Because the number of channels of the second layer is equal to the number of filters of the first layer, we also varied the number of filters of the first layer in order to attain the desired number of channels. For all implementations, the training time is directly proportional to the number of filters. Therefore, a detailed comparison of all four methods with a varying number of filters was not included in this letter. The hardware details of our test environment are summarized in Table 4.

. | ImageNet (2D) . | OASIS (3D) . | ||
---|---|---|---|---|

Parameter | First Layer | Second Layer | First Layer | Second Layer |

Filter size | 5 to 52 | 5 to 13 | 3 to 36 | 3 to 9 |

Stride size | 1, 2, 4 | 1 | 1, 2, 4 | 1 |

Number of channels | 3 | 16, 32, 64 | 1 | 8, 16, 32 |

Number of filters | 32 | 32 | 32 | 16 |

Image dimension | ||||

Number of images | 256 | 256 | 100 | 100 |

Batch size | 128 | 128 | 25 | 25 |

. | ImageNet (2D) . | OASIS (3D) . | ||
---|---|---|---|---|

Parameter | First Layer | Second Layer | First Layer | Second Layer |

Filter size | 5 to 52 | 5 to 13 | 3 to 36 | 3 to 9 |

Stride size | 1, 2, 4 | 1 | 1, 2, 4 | 1 |

Number of channels | 3 | 16, 32, 64 | 1 | 8, 16, 32 |

Number of filters | 32 | 32 | 32 | 16 |

Image dimension | ||||

Number of images | 256 | 256 | 100 | 100 |

Batch size | 128 | 128 | 25 | 25 |

Processor | Intel i7-3770 CPU @ 3.40 GHz |

CPU memory | 8 GB |

Graphics card | NVIDIA GeForce GTX 660 |

GPU cores | 960 cores @ 1.03 GHz |

GPU memory | 2 GB |

Processor | Intel i7-3770 CPU @ 3.40 GHz |

CPU memory | 8 GB |

Graphics card | NVIDIA GeForce GTX 660 |

GPU cores | 960 cores @ 1.03 GHz |

GPU memory | 2 GB |

For the comparison on 2D images, we used a data set of 256 natural color images from the ImageNet data set (Deng et al., 2009). All images were resampled to a resolution of 512 × 512 pixels per color channel. For the evaluation on 3D images, we used 100 magnetic resonance images (MRIs) of the brain from the OASIS data set (Marcus et al., 2007). We resampled all volumes to a resolution of 128 × 128 × 128 voxels and a voxel size of 2 mm × 2 mm × 2 mm .

### 3.1 Running Time Analysis on 2D Color Images (ImageNet)

Figure 4a shows a comparison of running times for training the first sconvRBM layer on 256 images with varying filter and stride sizes. Due to internal limitations of Krizhevsky’s convolution implementation, it cannot be applied to images with a resolution of 512 × 512 pixels when using a stride size smaller than four, and those comparisons could not be made. Our frequency domain implementation is 2 to 24 times faster than our convolution implementation, where the speed gains are larger for larger filter and stride sizes. For a stride of one, the impact of the convolution implementation on the total running time is relatively low because the computational bottleneck is the inference and sampling of the hidden units. As the number of hidden units decreases with larger strides, the running time becomes more dependent on the time spent to calculate convolutions. Hence, the differences between the four methods are more pronounced for larger strides. For a stride of four, training in the frequency domain is 8 to 24 times faster than training in the spatial domain using our convolution implementation and 2 to 7 times faster than using batched convolutions. Calculating convolutions by FFTs is the slowest method for all stride sizes and 2D filter sizes up to 44, largely due to the cost of calculating Fourier transforms.

Figure 4b shows a similar comparison for training the second convRBM layer for a stride size of one and varying filter sizes and numbers of channels. In contrast to training the first layer, training times mostly depend on the calculation of convolutions, where the impact of calculating convolutions on the total running time increases with the increasing number of channels. Training in the frequency domain is 5 to 26 times faster than training in the spatial domain using single-image convolutions and 2 to 8 times faster than using batched convolutions. For all channel sizes, batched training is about 3 to 4 times faster than nonbatched training, and calculating convolutions using FFTs is much slower than batched training and training in the frequency domain. To summarize, training of 2D images in the frequency domain is much faster than training in the spatial domain even for small filter sizes. Using the largest filter kernels in both layers, we show that the proposed method yields a speed-up of 7 to 8 times compared to state-of-the-art GPU implementations.

### 3.2 Running Time Analysis on 3D Volumes (OASIS)

Figure 5 shows the comparison of running times for training a first- and second-layer sconvRBM on 3D volumes for varying filter sizes, stride sizes, and varying numbers of channels. In contrast to training on 2D images, the computational costs of calculating 3D convolutions break even with calculating FFTs for small filter sizes, because the number of multiplications and additions per convolution increases cubically instead of quadratically with the filter kernel size. As a result, simply training by convolutions in the frequency domain is faster than in the spatial domain. However, our proposed training algorithm still outperforms both other methods, even at the smallest filter size. For filter sizes of five and larger, our frequency domain implementation is 3.5 to 200 times faster than our spatial domain implementation using single-image convolutions and 2.7 to 17 times faster than calculating convolutions by FFTs. Similar to the results on 2D images, training times of the first layer using a stride of one depend strongly on the time required to calculate the expectation of the hidden units and sample the hidden units. Hence, performance improvements of our frequency domain method are more pronounced for larger strides and numbers of channels, where the impact of calculating convolutions on the total training time is also larger. This makes the proposed method particularly suitable for training sconvRBMs on high-resolution 3D volumes.

## 4 Conclusion

We have presented a fast training method for convolutional models that performs training in the frequency domain in order to replace the time-consuming computation of convolutions with simple element-wise multiplications. We have shown that it is also essential to map the other operations to frequency space wherever possible to minimize the number of Fourier transforms and that this greatly decreases training times over performing only the convolutions in the frequency domain. In addition, our method can be efficiently implemented on the GPU and is faster than a highly optimized GPU implementation of batched convolutions in the spatial domain. We have evaluated the running time improvements using two standard benchmark data sets, showing a speed-up of up to 8 times on 2D images from the ImageNet dataset and up to 200 times on 3D volumes from the OASIS data set.

## Acknowledgments

This work was supported by the Natural Sciences and Engineering Research Council of Canada and the Milan and Maureen Ilich Foundation. In addition, data used in the evaluation of this work were funded by the Open Access Structural Imaging Series (grants P50 AG05681, P01 AG03991, R01 AG021910, P20 MH071616, U24 RR021382). We thank the anonymous reviewers for many helpful comments that improved the letter.