Abstract
The convolutional neural network (CNN), one of the deep learning models, has demonstrated outstanding performance in a variety of computer vision tasks. However, as the network architectures become deeper and more complex, designing CNN architectures requires more expert knowledge and trial and error. In this article, we attempt to automatically construct high-performing CNN architectures for a given task. Our method uses Cartesian genetic programming (CGP) to encode the CNN architectures, adopting highly functional modules such as a convolutional block and tensor concatenation, as the node functions in CGP. The CNN structure and connectivity, represented by the CGP, are optimized to maximize accuracy using the evolutionary algorithm. We also introduce simple techniques to accelerate the architecture search: rich initialization and early network training termination. We evaluated our method on the CIFAR-10 and CIFAR-100 datasets, achieving competitive performance with state-of-the-art models. Remarkably, our method can find competitive architectures with a reasonable computational cost compared to other automatic design methods that require considerably more computational time and machine resources.
1 Introduction
Deep learning, a machine learning approach using deep neural networks, is becoming popular for solving artificial intelligence tasks. Deep neural networks (DNNs) have been successful in various tasks such as image recognition (LeCun et al., 1998; Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), and reinforcement learning tasks (Mnih et al., 2013, 2015). Particularly, convolutional neural networks (CNNs) (LeCun et al., 1998) have demonstrated outstanding performance on image recognition tasks in the last few years and have been applied to a variety of computer vision applications (Vinyals et al., 2015; Zhang et al., 2016). A commonly used CNN architecture consists of a series of convolution and pooling layers followed by fully connected layers. Several recent studies have demonstrated significant progress by developing CNN architectures. Powerful CNN models (e.g., GoogLeNet (Szegedy et al., 2015), ResNet (He et al., 2016), and DenseNet (Huang et al., 2017)) have continued to show considerable improvement over the years, achieving state-of-the-art results on various benchmark datasets.
Despite their success, designing CNN architectures is still a difficult task because many design parameters exist, such as the depth of a network, the type and parameters of each layer, and the connectivity of the layers. As the successful networks have become deeper and more complex, they require a greater number of structural hyperparameters to be tuned to achieve the best performance on a specific dataset. Therefore, considerable trial and error or expert knowledge is required to construct suitable architectures for a target task. Considering this situation, automatic design methods for CNN architectures are highly beneficial.
Neural network architecture design can be viewed as a model selection problem in machine learning. The straight forward approach is to deal with the architecture design as a hyperparameter optimization problem. Namely, the hyperparameters regarding network structure such as the number of layers and neurons are optimized using techniques based on Bayesian optimization or evolutionary computation (Snoek et al., 2012; Loshchilov and Hutter, 2016) to improve the performance of the validation dataset.
Evolutionary computation has been traditionally applied to optimize both the network topology and the connection weights (Schaffer et al., 1992; Stanley and Miikkulainen, 2002). There are two types of encoding schemes for network representations: direct and indirect encoding (Yao, 1999). Direct encoding represents the number and connectivity of neurons directly as the genotype, whereas indirect encoding represents a generation rule for network architectures. Although almost all traditional approaches optimize the number and connectivity of low-level neurons, modern neural network architectures for deep learning have many units and various types of units to be optimized. Optimizing so many structural parameters in a reasonable amount of computational time may be difficult for traditional evolutionary neural network methods. A promising approach is the use of highly functional modules as a minimum unit to reduce the search space of deep architectures.
In this article, we attempt to design CNN architectures based on genetic programming (GP). We use the Cartesian genetic programming (CGP) (Miller and Thomson, 2000; Harding, 2008; Miller and Smith, 2006) encoding scheme, which is a direct encoding scheme, to represent the CNN structure and connectivity. As we aim to search the CNN architectures, the phenotype of GP should be the network structure. Therefore, we adopt a graph-based GP rather than a tree-based GP. The advantage of the CGP-based method is its simplicity and flexibility; it can represent variable-length network structures including skip connections and encode the network by a fixed length string. To reduce the search space, we also adopt relatively highly functional modules operating the three-dimensional tensor (e.g., convolutional block and tensor concatenation) as the node function in CGP. For instance, one of our node functions consists of a convolution layer, batch normalization, and a rectified linear unit (ReLU), not only a single function or a low-level neuron. To evaluate the architecture represented by the CGP, we train the network using a training dataset by a usual stochastic gradient descent method, and then the performance on another training dataset is assigned as the fitness of the architecture. We call the former dataset the model training dataset and the latter dataset the architecture evaluation dataset. Based on this fitness function, an evolutionary algorithm optimizes the CNN architectures. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky and Hinton, 2009). The experimental results show that our method can find the competitive CNN architectures with a reasonable computational cost compared with state-of-the-art models.
This article expands on the work in Suganuma et al. (2017). In this article, we additionally propose simple speed-up techniques for an architecture search and add the result for the CIFAR-100 dataset.
The rest of this article is organized as follows. The next section presents related work on the neural network architecture design as well as their limitations. In Section 3, we describe our genetic programming approach to designing CNN architectures. We test the performance of our method in Section 4. Finally, in Section 5, we describe our conclusion and future work.
2 Related Work
This section provides a review of related works on automatic neural network architecture design. We roughly divide the previous studies into two categories: optimization of learning and structure parameters and optimization of neural network architectures. The former indicates the traditional hyperparameter optimization approach, whereas the latter aims to search the flexible architectures, which is more relevant to this article.
2.1 Optimization of Learning and Model Parameters
The hyperparameters in machine learning methods, such as learning rate, regularization coefficients, and the number of neurons in neural networks, should be tuned to improve predictive performance. In general, we cannot obtain gradients of such hyperparameters. The naive methods for hyperparameter optimization are the grid search and the random search (Bergstra and Bengio, 2012). A more sophisticated approach is to use the sequential model-based global optimization methods such as Bayesian optimization (Snoek et al., 2012). Bayesian optimization is a global optimization method of black-box and noisy objective functions, and it maintains a surrogate model learned by using previously evaluated solutions. A Gaussian process is usually adopted as the surrogate model, which can easily handle the uncertainty and noise of the objective function.
Snoek et al. (2012) optimized nine hyperparameters in CNN, such as the number of epochs and the learning rate, and showed that an automatically tuned network outperforms networks tuned by a human expert. Snoek et al. (2015) succeeded in improving the scalability of hyperparameter search by using a deep neural network instead of the Gaussian process to reduce the computational cost for surrogate model building, and they optimized the learning hyperparameters of the fixed CNN architecture. Bergstra et al. (2011) proposed the tree-structured Parzen estimator (TPE) and showed better results than manual search and random search. They also proposed a meta-modeling approach (Bergstra et al., 2013) based on the TPE for supporting automatic hyperparameter optimization. Hutter et al. (2011) proposed an algorithm called sequential model-based algorithm configuration (SMAC) for general algorithm configuration. SMAC adopts a random forest as the surrogate model instead of a Gaussian process. Several studies optimized the hyperparameters of DNNs using SMAC-based methods (e.g., Domhan et al., 2015).
The hyperparameter optimization approach often tunes the learning parameters (e.g., learning rate, mini-batch size, and regularization parameters) and the predefined structural parameters (e.g., the numbers of layers and neurons, and the type of activation functions). In general, this approach requires predefined architectures, which means it is hard to design flexible architectures from scratch.
2.2 Optimization of Neural Network Architectures
2.2.1 Reinforcement Learning Approach
Interesting works, the automatic design of deep neural network architecture using reinforcement learning, have been attempted recently (Zoph and Le, 2017; Baker et al., 2017). In Zoph and Le (2017), a recurrent neural network (RNN) was used to generate neural network architectures, and the RNN was trained with reinforcement learning to maximize the expected accuracy on a learning task. This method uses distributed training and asynchronous parameter updates with 800 graphics processing units (GPUs) to accelerate the reinforcement learning process. Baker et al. (2017) proposed a metamodeling approach based on reinforcement learning to produce CNN architectures. A Q-learning agent explores and exploits a space of model architectures with an -greedy strategy and experience replay. Additionally, this method optimizes the number of layers, which is a useful feature because it allows a system to design suitable architectures for a target dataset.
We can see that these methods adopt the indirect encoding scheme for network representation from the evolutionary computation perspective because they train the generative rules for network architectures. These methods have succeeded in constructing competitive CNN architectures for image classification tasks. Unlike these methods, our approach uses direct encoding based on Cartesian genetic programming to design the CNN architectures as described in Section 3. In addition, we introduce relatively highly functional modules such as convolutional block and tensor concatenation to efficiently find better CNN architectures.
2.2.2 Evolutionary Computation Approach
Evolutionary algorithms have been applied in optimizing neural network architectures so far (Schaffer et al., 1992; Stanley and Miikkulainen, 2002). The methods for evolutionary neural networks optimize the connection weights and/or network structure of low-level neurons by the evolutionary algorithm. Morse and Stanley (2016) compared the evolutionary algorithm, stochastic gradient descent, and RMSProp on the optimization of the weights in neural networks. The evolutionary algorithm showed competitive performance, but the number of neural network weights used in the experiment was 1,500, which is quite a small number compared to the modern deep neural networks used for computer vision tasks. Sun et al. (2018) proposed a neural network training method for unsupervised DNNs such as the auto-encoder and the restricted Boltzmann machine, which combines the genetic algorithm and stochastic gradient descent. The method, however, was not applied to CNNs. Verbancsics and Harguess (2013, 2015) optimized the weights of artificial neural networks and CNNs by using the hypercube-based neuroevolution of augmenting topologies (HyperNEAT) (Stanley et al., 2009). However, to the best of our knowledge, the methods with HyperNEAT have not achieved competitive performance compared with the state-of-the-art methods. These methods seem to be difficult to scale for recent deep neural networks having a large number of neurons and connections.
To deal with large-scale architectures, an approach combining back-propagation and evolution is promising. In this approach, neural network architectures are designed by the evolutionary algorithm, and the network weights are optimized by a stochastic gradient descent method through back-propagation. Compared to the hyperparameter optimization approach, this approach can design more flexible architectures from scratch by using the evolutionary algorithm.
Real et al. (2017) optimized large-scale neural networks by using the evolutionary algorithm and achieved better performance than modern CNNs in image classification tasks. In this method, they represent the CNN architecture as a graph structure and optimize it by the evolutionary algorithm. The connection weights of the reproduced architecture are optimized by stochastic gradient descent as well as the usual neural network training, and the accuracy for the architecture evaluation dataset is assigned as the fitness. The individuals are initialized by the small networks and become larger through evolution. However, this method was run on 250 computers and required approximately 10 days for optimization.
Miikkulainen et al. (2017) proposed a method called CoDeepNEAT, which is an extended version of NEAT, and showed good performance in image classification and image captioning tasks. This method designs the network architectures using blueprints and modules. The blueprint chromosome is a graph where each node has a pointer to a particular module species. Each module chromosome is a graph that represents a small DNN. Specifically, each node in the blueprint is replaced with a module selected from a particular species to which that node points. During the evaluation phase, the modules and blueprints are combined to generate assembled networks, and the networks are evaluated.
Xie and Yuille (2017) designed CNN architectures using the genetic algorithm with a binary string representation. They proposed a method for encoding a network structure, where the connectivity of each layer is defined by a binary string representation. The type of each layer, the number of channels, and the size of a receptive field are not evolved in this method.
These studies focus on designing large and flexible network architectures. In general, these methods require considerable computational cost to optimize the neural network architectures because they need to train the connection weights to calculate the fitness of the architectures. In this work, we propose using Cartesian genetic programming (CGP) to represent the deep neural network architectures and to use highly functional modules as the node functions to reduce the search space. In addition, we introduce simple techniques to reduce the computational cost of the architecture search.
3 Designing CNN Architectures Using Cartesian Genetic Programming
In our method, we directly encode the CNN architectures based on CGP (Miller and Thomson, 2000; Miller and Smith, 2006; Harding, 2008) and use highly functional modules as node functions. The CNN architecture defined by CGP is trained by a stochastic gradient descent using a model training dataset and assigns the fitness value based on the accuracies for another training dataset (i.e., the architecture evaluation dataset). Then, the architecture is optimized to maximize the accuracy on the architecture evaluation dataset by using the evolutionary algorithm. Figure 1 illustrates an overview of the proposed method. In this section, we describe the network representation and the evolutionary algorithm used in the proposed method. Additionally, we explain the simple speed-up techniques of the architecture search.
3.1 Representation of CNN Architectures
For the CNN architecture representation, we use the CGP encoding scheme which represents an architecture of CNNs as directed acyclic graphs with a two-dimensional grid. CGP was proposed as a general form of genetic programming in Miller and Thomson (2000), and it is called Cartesian because CGP represents a program using a two-dimensional grid of nodes. The graph corresponding to a phenotype is encoded to a string called a genotype and optimized by the evolutionary algorithm.
Let us assume that the grid has rows by columns; then the number of intermediate nodes is , and the numbers of inputs and outputs depend on the task. The genotype consists of a string of integers with a fixed length, and each gene determines the function type of the node and the connection between nodes. The -th column's node is only allowed to be connected from to -th column's nodes, where is called a level-back parameter. Figure 2 shows an example of the genotype, the phenotype, and the corresponding CNN architecture. As seen in Figure 2, the CGP encoding scheme has a possibility that not all of the nodes are connected to the output nodes (e.g., node No. 5 in Figure 2). We call these nodes inactive nodes. Whereas the genotype in CGP is a fixed-length representation, the number of nodes in the phenotypic network varies because of the inactive nodes, which is a desirable feature because the number of layers can be determined by the evolutionary algorithm.
Referring to the modern CNN architectures, we select the highly functional modules as the node function. The frequently used processing in the CNN is convolution and pooling; the convolution processing uses local connectivity and spatially shares the learnable weights, and the pooling is nonlinear downsampling. We prepare the six types of node functions called ConvBlock, ResBlock, max pooling, average pooling, concatenation, and summation. These nodes operate on the three-dimensional (3-D) tensor (also known as the feature map) defined by the dimensions of the row, column, and channel.
The ConvBlock consists of a convolutional layer with the stride of one followed by the batch normalization (Ioffe and Szegedy, 2015) and the rectified linear unit (ReLU) (Nair and Hinton, 2010). To maintain the size of the input, we pad the input with zero values around the border before the convolutional operation. Therefore, the ConvBlock takes the tensor as input and produces the tensor, where , , , and are the numbers of rows, columns, input channels, and output channels, respectively. We prepare several ConvBlocks with different output channels and receptive field size (kernel size) in the function set of CGP.
As shown in Figure 3, the ResBlock is composed of the ConvBlock, the batch normalization, the ReLU, and the tensor summation. The ResBlock is a building block of the modern succeeded CNN architectures, for example, He et al. (2016), Zagoruyko and Komodakis (2016), and Kupyn et al. (2018). Following this recent trend of human architecture design, we decided to use ResBlock as the building block in our method. The ResBlock performs identity mapping by the shortcut connection as described in He et al. (2016). The row and column sizes of the input are preserved in the same way as the ConvBlock after convolution. As shown in Figure 3, the output feature maps of the ResBlock are calculated by the ReLU activation and the summation with the input. The ResBlock takes the tensor as an input and produces the tensor. We prepare several ResBlocks with different output channels and receptive field size (kernel size) in the function set of CGP.
The max and average poolings perform a max and average operation, respectively, over the local neighbors of feature maps. We use the pooling with the receptive field size and the stride of two. The pooling layer takes the tensor and produces the tensor, where and .
The concatenation function takes two feature maps and concatenates them in the channel dimension. When concatenating the feature maps with different numbers of rows and columns, we downsample the larger feature map by max pooling to make them the same sizes as the input. Let us assume that we have two inputs of size and , then the size of the output feature maps is .
The summation performs the element-wise summation of two feature maps, channel by channel. Similar to the concatenation, when summing the two feature maps with the different numbers of rows and columns, we downsample the larger feature map by max pooling. In addition, if the inputs have different numbers of channels, we expand channels of the feature maps with a smaller channel size by filling with zero. Let us assume that we have two inputs of size and , then the sizes of the output feature maps are . In Figure 2, the summation node applies the max pooling to downsample the first input into the same size as the second input. By using the summation and concatenation operations, our method can express the shortcut connection or branch layers, such as those used in GoogLeNet (Szegedy et al., 2015) and residual network (ResNet) (He et al., 2016).
The output node represents the softmax function to produce a distribution over the target classes. The outputs fully connect to all elements of the input. The node functions used in the experiments are displayed in Table 1.
Node type . | Symbol . | Variation . |
---|---|---|
ConvBlock | CB (, ) | |
ResBlock | RB (, ) | |
Max pooling | MP | — |
Average pooling | AP | — |
Concatenation | Concat | — |
Summation | Sum | — |
Node type . | Symbol . | Variation . |
---|---|---|
ConvBlock | CB (, ) | |
ResBlock | RB (, ) | |
Max pooling | MP | — |
Average pooling | AP | — |
Concatenation | Concat | — |
Summation | Sum | — |
: Number of output channels
: Receptive field size (kernel size)
3.2 Evolutionary Algorithm
Following the standard CGP, we use a point mutation as the genetic operator. The function and the connection of each node randomly change to valid values according to the mutation rate. The fitness evaluation of the CNN architecture involves the CNN training and requires approximately 0.5 to 1 hours in our setting. Therefore, we need to efficiently evaluate some candidate solutions in parallel at each generation. To efficiently use the computational resource, we repeatedly apply the mutation operator while an active node does not change, and obtain the candidate solutions to be evaluated. We call this mutation the forced mutation. Moreover, to maintain a neutral drift, which is effective for the CGP evolution (Miller and Smith, 2006; Miller and Thomson, 2000), we modify a parent by the neutral mutation if the fitnesses of the offspring do not improve. The neutral mutation operates on only the genes of inactive nodes without modification of the phenotype. We use the modified evolution strategy (with in our experiments) using the above artifice. The procedure of our evolutionary algorithm is listed in Algorithm 1.
The evolution strategy, the default evolutionary algorithm in CGP, is an algorithm with fewer strategy parameters: the mutation rate and the offspring size, meaning that we do not need to expend considerable effort to tune such strategy parameters. This is a reason we use the evolution strategy in our method.
3.3 Speed-Up Techniques
The proposed CNN architecture optimization is time-consuming because it requires training the candidate CNN architectures in the usual way to assign the fitness value. We introduce two speed-up techniques for the architecture search: the rich initialization and the early termination of training. In the rich initialization, we initialize the individual by ResNet (He et al., 2016) or DenseNet (Huang et al., 2017) such as structure and start the evolution with a good architecture. The early termination technique stops the neural network training runs that are unlikely to reach better accuracy.
3.3.1 Rich Initialization
Using a good and hand-designed initial network architecture helps the efficient architecture search more than using a randomly initialized network architecture. By initializing the individual using a sophisticated architecture, we expect to find a better architecture in an early generation. In the experiment, we consider two well-known CNN architectures, the residual network (ResNet) (He et al., 2016) and the densely connected convolutional network (DenseNet) (Huang et al., 2017) as the initial individuals. Because we use the direct encoding of network architectures based on CGP, it is easy to edit the genotype string to represent the architectures we want. Specifically, to leverage the original node functions, the CGP-CNN starts with the modified ResNet and DenseNet architectures. The architectures of the modified ResNet and DenseNet are displayed in Table 2. The ResNet and DenseNet architectures can be represented by using ResBlock and ConvBlock, respectively. Although these architectures are slightly different from those in He et al. (2016) and Huang et al. (2017), considering the pooling layers in ResNet and the transition layers in the DenseNet, they are more promising initial individuals than the randomly initialized ones. We denote this rich initialization technique as RichInit.
(a) The modified ResNet. . | (b) The modified DenseNet. . | ||
---|---|---|---|
Layers . | Node functions . | Layers . | Node functions . |
ResBlock (1) | RB (32, 3) 3 | Convolution | CB (32, 3) |
Pooling Layer | MP | ||
ResBlock (2) | RB (64, 3) 3 | Transition Layer | CB (32, 3) |
Pooling Layer | MP | (1) | AP |
ResBlock (3) | (128, 3) 3 | ||
Pooling Layer | MP | Transition Layer | CB (64, 3) |
Classification Layer | Fully-connected | (2) | AP |
Classification | AP | ||
Layer | Fully-connected |
(a) The modified ResNet. . | (b) The modified DenseNet. . | ||
---|---|---|---|
Layers . | Node functions . | Layers . | Node functions . |
ResBlock (1) | RB (32, 3) 3 | Convolution | CB (32, 3) |
Pooling Layer | MP | ||
ResBlock (2) | RB (64, 3) 3 | Transition Layer | CB (32, 3) |
Pooling Layer | MP | (1) | AP |
ResBlock (3) | (128, 3) 3 | ||
Pooling Layer | MP | Transition Layer | CB (64, 3) |
Classification Layer | Fully-connected | (2) | AP |
Classification | AP | ||
Layer | Fully-connected |
3.3.2 Early Termination of Network Training
In the training procedure for the CNNs used in Suganuma et al. (2017), CNNs are trained for a fixed number of epochs. However, one can stop the training early to save computational costs if the architecture seems to have no chance of reaching good performance. Here, we introduce a simple early termination technique based on a reference curve.
The reference curve is constructed by using the previous accuracy curves of network training and is used to decide whether the current architecture is promising or not. Let denote the reference curve at the -th generation, where indicates the maximum number of epochs, and is initialized with the initial parent's accuracy curve for the architecture evaluation dataset. We terminate the CNN training if the accuracies for the architecture evaluation dataset of the current architecture are worse than the values of the reference curve by consecutive times. If the training is terminated early, the fitness of the architecture is assigned zero. Figure 4 illustrates the concept of the early termination.
When the best fitness among the offspring exceeds the parent's fitness, the values of the reference curve are updated by taking the average of the reference curve and the accuracy curve of the best offspring: , where is the accuracy curve of the best offspring, and the values consist of the accuracies for the architecture evaluation dataset at each epoch. The procedures for early termination and the reference curve update are listed in Algorithms 2 and 3, respectively. We expect that the evaluation process of the architectures to speed up without performance deterioration by using early termination.
4 Experiments and Results
4.1 Dataset
We apply our method to the CIFAR-10 and CIFAR-100 datasets consisting of color images ( pixels) in 10 and 100 classes, respectively. Each dataset is split into a training set of images and a test set of images. We randomly sample examples from the training set to train the CNN, and the remaining 5,000 examples are used for architecture evaluation (i.e., fitness evaluation of CGP).
On the CIFAR-10 dataset, we also consider a small-data scenario. The small-data scenario is a situation where we only have a small number of data for training. We use only one-tenth of the dataset in the experiment. In general, the performance of the CNN architectures highly depends on the dataset, and it is difficult to manually design an appropriate network architecture for a new dataset. The hand-designed CNN architectures seem to be tuned for the benchmark datasets such as CIFAR-10. The purpose of the small-data scenario is to simulate the new dataset. In the small-data scenario, we randomly sample images for the model training and 500 images for architecture evaluation.
4.2 Experimental Setting
To assign the fitness value to the candidate CNN architecture, we train the CNN by stochastic gradient descent (SGD) with a mini-batch size of 128. The softmax cross-entropy loss is used as the loss function. We initialize the weights by the method described in He et al. (2015) and use the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate , and momentum , . We train each CNN for 50 epochs and use the maximum accuracy of the last ten epochs as the fitness value. We reduce the learning rate by a factor of 10 at the 30th epoch.
We preprocess the data with pixel-mean subtraction. To prevent overfitting, we use a weight decay with coefficient . We also use data augmentation based on He et al. (2016): padding 4 pixels on each size and randomly crop a patch from the padded image or its horizontally flipped image.
The parameter setting for CGP is shown in Table 3. We use a relatively large number of columns to generate deep architectures. The number of active nodes in the individual of CGP is restricted. Therefore, we apply the mutation operator until the CNN architecture that satisfies the restriction of the number of active nodes is generated. The offspring size is two; that is the same number of GPUs in our experimental machines. We test two node function sets called ConvSet and ResSet for our method. The ConvSet contains ConvBlock, max pooling, average pooling, summation, and concatenation in Table 1, and the ResSet contains ResBlock, Max pooling, Average pooling, Summation, and Concatenation. The difference between these two function sets is whether the set contains ConvBlock or ResBlock. The numbers of generations are 500 for ConvSet, 300 for ResSet, and 500 for the small-data scenario.
Parameters . | Values . |
---|---|
Mutation rate | 0.05 |
# Offspring () | 2 |
# Rows () | 5 |
# Columns () | 30 |
Minimum number of active nodes | 10 |
Maximum number of active nodes | 50 |
Levels-back () | 10 |
Parameters . | Values . |
---|---|
Mutation rate | 0.05 |
# Offspring () | 2 |
# Rows () | 5 |
# Columns () | 30 |
Minimum number of active nodes | 10 |
Maximum number of active nodes | 50 |
Levels-back () | 10 |
The best CNN architecture from the CGP process is retrained using all images in the training set, and then we compute the test accuracy. We optimize the weights of the obtained architecture for 500 epochs with a different training procedure; we use SGD with a momentum of 0.9, a mini-batch size of 128 and a weight decay of . Following the learning rate schedule in He et al. (2016), we start with a learning rate of 0.01 and set it to 0.1 at the 5th epoch, and reduce it by a factor of 10 at the 250th and 370th epochs. We report the test accuracy at the 500th epoch as the final performance.
We implement the proposed method using the Chainer framework (Tokui et al., 2015) (version 1.16.0) and run it on a machine with two NVIDIA GeForce GTX 1080 or two GTX 1080 Ti GPUs. We use a GTX 1080 and 1080 Ti for the experiments on the CIFAR-10 and 100 datasets, respectively. Due to the memory limitation, the candidate CNNs occasionally take up the GPU memory, and the network training process fails due to an out-of-memory error. In that case, we assign a zero fitness to the candidate architecture.
4.3 Result on the CIFAR-10 and 100 Datasets
We run the proposed method ten times on each dataset and report the classification errors. Here, the result of the proposed method without speed-up techniques (rich initialization and early termination described in Subsection 3.3) is discussed. We compare the classification performance with the state-of-the-art CNN models, including the hand-designed CNNs and automatically designed CNNs by the architecture search methods, on the CIFAR-10 and 100 datasets. A summary of the classification performances is provided in Tables 4 and 5. We refer to the architectures constructed by the proposed method as CGP-CNN. For instance, CGP-CNN (ConvSet) means the proposed method with the ConvSet node function set. The models, Maxout, Network in Network, VGG, ResNet, FractalNet, and Wide ResNet, are the hand-designed CNN architectures, whereas MetaQNN, Neural Architecture Search, Large-Scale Evolution, Genetic CNN, and CoDeepNEAT are the models obtained by the architecture search methods. The hand-designed CNN architectures mean that the CNN architectures are designed by human experts. The values of other models, except for VGG and ResNet on CIFAR-100, are referenced from the literature. We implemented the VGG net and ResNet for CIFAR-100 because they were not applied to these datasets in Simonyan and Zisserman (2015) and He et al. (2016). The architecture of the VGG is identical with configuration D in Simonyan and Zisserman (2015). We denote this model as VGG in this article. In Tables 4 and 5, the numbers of learnable weight parameters in the models are also listed. In CGP-CNN, the numbers of learnable weight parameters of the best architecture are reported.
Model . | # params . | Test error . | GPU days . |
---|---|---|---|
Maxout (Goodfellow et al., 2013) | — | 9.38 | — |
Network in Network (Lin et al., 2014) | — | 8.81 | — |
VGG (Simonyan and Zisserman, 2015) | 15.2M | 7.94 | — |
ResNet (He et al., 2016) | 1.7M | 6.61 | — |
FractalNet (Larsson et al., 2017) | 38.6M | 5.22 | — |
Wide ResNet (Zagoruyko and Komodakis, 2016) | 36.5M | 4.00 | — |
CoDeepNEAT (Miikkulainen et al., 2017) | — | 7.30 | — |
Genetic CNN (Xie and Yuille, 2017) | — | 7.10 | 17 |
MetaQNN (Baker et al., 2017) | 3.7M | 6.92 | |
Large-Scale Evolution (Real et al., 2017) | 5.4M | 5.40 | 2750 |
Neural Architecture Search (Zoph and Le, 2017) | 37.4M | 3.65 | |
CGP-CNN (ConvSet) | 1.50M | 5.92 | 31 |
CGP-CNN (ResSet) | 2.01M | 5.01 | 30 |
Model . | # params . | Test error . | GPU days . |
---|---|---|---|
Maxout (Goodfellow et al., 2013) | — | 9.38 | — |
Network in Network (Lin et al., 2014) | — | 8.81 | — |
VGG (Simonyan and Zisserman, 2015) | 15.2M | 7.94 | — |
ResNet (He et al., 2016) | 1.7M | 6.61 | — |
FractalNet (Larsson et al., 2017) | 38.6M | 5.22 | — |
Wide ResNet (Zagoruyko and Komodakis, 2016) | 36.5M | 4.00 | — |
CoDeepNEAT (Miikkulainen et al., 2017) | — | 7.30 | — |
Genetic CNN (Xie and Yuille, 2017) | — | 7.10 | 17 |
MetaQNN (Baker et al., 2017) | 3.7M | 6.92 | |
Large-Scale Evolution (Real et al., 2017) | 5.4M | 5.40 | 2750 |
Neural Architecture Search (Zoph and Le, 2017) | 37.4M | 3.65 | |
CGP-CNN (ConvSet) | 1.50M | 5.92 | 31 |
CGP-CNN (ResSet) | 2.01M | 5.01 | 30 |
Model . | # params . | Test error . |
---|---|---|
Maxout (Goodfellow et al., 2013) | — | 38.57 |
Network in Network (Lin et al., 2014) | — | 35.68 |
VGG (Simonyan and Zisserman, 2015) | 15.2M | 33.45 |
ResNet (He et al., 2016) | 1.7M | 32.40 |
FractalNet (Larsson et al., 2017) | 38.6M | 23.30 |
Wide ResNet (Zagoruyko and Komodakis, 2016) | 36.5M | 19.25 |
CoDeepNEAT (Miikkulainen et al., 2017) | — | — |
Neural Architecture Search (Zoph and Le, 2017) | 37.4M | — |
Genetic CNN (Xie and Yuille, 2017) | — | 29.03 |
MetaQNN (Baker et al., 2017) | 3.7M | 27.14 |
Large-Scale Evolution (Real et al., 2017) | 40.4M | 23.0 |
CGP-CNN (ConvSet) | 2.01M | |
CGP-CNN (ResSet) | 4.60M |
Model . | # params . | Test error . |
---|---|---|
Maxout (Goodfellow et al., 2013) | — | 38.57 |
Network in Network (Lin et al., 2014) | — | 35.68 |
VGG (Simonyan and Zisserman, 2015) | 15.2M | 33.45 |
ResNet (He et al., 2016) | 1.7M | 32.40 |
FractalNet (Larsson et al., 2017) | 38.6M | 23.30 |
Wide ResNet (Zagoruyko and Komodakis, 2016) | 36.5M | 19.25 |
CoDeepNEAT (Miikkulainen et al., 2017) | — | — |
Neural Architecture Search (Zoph and Le, 2017) | 37.4M | — |
Genetic CNN (Xie and Yuille, 2017) | — | 29.03 |
MetaQNN (Baker et al., 2017) | 3.7M | 27.14 |
Large-Scale Evolution (Real et al., 2017) | 40.4M | 23.0 |
CGP-CNN (ConvSet) | 2.01M | |
CGP-CNN (ResSet) | 4.60M |
On the CIFAR-10 dataset, CGP-CNNs outperform most of the hand-designed models and have a good balance between the classification errors and the number of parameters. CGP-CNN (ResSet) shows better performance compared to CGP-CNN (ConvSet). Compared with other architecture search methods, CGP-CNN (ConvSet and ResSet) outperforms MetaQNN (Baker et al., 2017), Genetic CNN (Xie and Yuille, 2017), and CoDeepNEAT (Miikkulainen et al., 2017), and the best architecture of CGP-CNN (ResSet) outperforms Large-Scale Evolution (Real et al., 2017). The Neural Architecture Search (Zoph and Le, 2017) achieved the best error rate, but this method used 800 GPUs and required considerable computational costs to search the best architecture. Table 4 also shows the number of GPU days (the computational time multiplied by the number of GPUs used in the experiments) for the architecture search. As seen in this table, our method can find a good architecture with a reasonable computational cost. We guess that our method could reduce the search space and find better architectures in early iteration by using the highly functional modules. The CIFAR-100 dataset is a very challenging task because there are many classes. CGP-CNN finds the competitive network architectures in a reasonable computational time. Even though our model is not at the same level as the state-of-the-art architectures, our model has a good balance between the classification errors and the number of parameters.
The error rates of the architecture search methods (not only our method) do not reach the Wide ResNet which is a human-designed architecture. However, these human-designed architectures are developed with the expenditure of tremendous human effort. An advantage of architecture search methods is that they can automatically find a good architecture for a new dataset. Another advantage of CGP-CNN is that the numbers of weight parameters in the discovered architectures are fewer than those in the human-designed architectures, which is beneficial when we want to implement CNN on a mobile device. Note that we did not introduce any criteria for the architecture complexity in the fitness function. It might be possible to find more compact architectures by introducing the penalty term into the fitness function, which is important future work.
Figure 5 illustrates the examples of the CNN architectures obtained by the proposed method, CGP-CNN (ConvSet and ResSet). As seen in Figure 5, we observe the complex architectures that are hard to design by hand. Specifically, CGP-CNN (ConvSet) uses the summation and concatenation nodes leading a wide network and allowing the formation of skip-connections. Therefore, the CGP-CNN (ConvSet) architecture is wider than that of CGP-CNN (ResSet). Additionally, we also observe that CGP-CNN (ResSet) has a similar structure to ResNet (He et al., 2016). ResNet consists of a series of two types of modules: the module with several convolutions and shortcut connections without downsampling, and downsampling convolution with a stride of 2. Although our method cannot downsample in the ConvBlock and the ResBlock, we see that CGP-CNN (ResSet) uses pooling layer as an alternative to the downsampling convolution. We can say that our method can also find an architecture similar to one designed by human experts.
In addition, we conducted the Wilcoxon rank sum test to statistically compare ResSet and ConvSet. We set the significance level to (i.e., ). On CIFAR-10 and CIFAR-100, the p-values for ResSet and ConvSet were and , respectively, indicating that the architectures using ResSet outperform those using ConvSet. We cannot conduct a statistical comparison test between CGP-CNN and other architecture search methods because we cannot obtain the detailed experimental results of other methods.
4.4 The Effect of the Rich Initialization
To investigate the effect of the rich initialization described in Subsection 3.3.1, we compare the CGP-CNN (ConvSet) and CGP-CNN (ResSet) models using rich initialization to those models without the rich initialization. Since the DenseNet and ResNet-like architectures can be constructed by using our ConvSet and ResSet, respectively, we initialize the CGP-CNN (ConvSet) with the DenseNet and initialize the CGP-CNN (ResNet) with the ResNet. A summary of the error rates and computational times is provided in Table 6. We observe that rich initialization contributes to improving the performance on both the CIFAR-10 and CIFAR-100 datasets. Additionally, the performances of the ResNet-like initialization show better performance than those of the DensNet-like initialization.
Model . | # params . | Time (days) . | Error rate (CIFAR-10) . | Error rate (CIFAR-100) . |
---|---|---|---|---|
CGP-CNN (ConvSet) | 1.50M | 15.6 | — | |
with RichInit (DenseNet) | 2.01M | 14.5 | — | |
CGP-CNN (ConvSet) | 2.04M | 13.0 | — | |
with RichInit (DenseNet) | 2.95M | 15.7 | — | |
CGP-CNN (ResSet) | 3.52M | 14.7 | — | |
with RichInit (ResNet) | 2.72M | 12.4 | — | |
CGP-CNN (ResSet) | 3.43M | 10.9 | — | |
with RichInit (ResNet) | 4.34M | 11.7 | — |
Model . | # params . | Time (days) . | Error rate (CIFAR-10) . | Error rate (CIFAR-100) . |
---|---|---|---|---|
CGP-CNN (ConvSet) | 1.50M | 15.6 | — | |
with RichInit (DenseNet) | 2.01M | 14.5 | — | |
CGP-CNN (ConvSet) | 2.04M | 13.0 | — | |
with RichInit (DenseNet) | 2.95M | 15.7 | — | |
CGP-CNN (ResSet) | 3.52M | 14.7 | — | |
with RichInit (ResNet) | 2.72M | 12.4 | — | |
CGP-CNN (ResSet) | 3.43M | 10.9 | — | |
with RichInit (ResNet) | 4.34M | 11.7 | — |
We additionally conducted the Wilcoxon rank sum tests for two independent samples to analyze the effects of rich initialization. Based on a statistical test with the significance level of , the -values for ResSet with/without rich initialization on CIFAR-10 and 100 were and , respectively, and the values for ConvSet with/without the rich initialization on CIFAR-10 and 100 were and , respectively. The -values for both ResSet and ConvSet were less than , and thus, rich initialization can significantly improve the performance over that of the original models. We, however, notice that the rich initialization of the architecture has a possibility of becoming stuck in a local optimum solution. Therefore, we may need to take into account the diversity of the population when rich initialization is used.
4.5 The Effect of the Early Termination
We conducted an experiment introducing early termination of network training on the CIFAR-10 dataset to check the effect. The parameter for determining the termination, , is varied as and 10. The parameter setting is the same as that described in Subsection 4.2. Table 7 shows the error rates and the computational times of the proposed method with/without early termination. The average numbers of epochs for network training are also listed in Table 7. We also report the case when we use both speed-up techniques, rich initialization and early termination, in ResSet.
Method . | Error rate . | Time (days) . | Average # epochs . |
---|---|---|---|
CGP-CNN (ConvSet) | 15.6 | 50.0 | |
with early termination () | 3.32 | 8.97 | |
with early termination () | 6.73 | 18.8 | |
with early termination () | 11.4 | 30.6 | |
CGP-CNN (ResSet) | 14.7 | 50.0 | |
with early termination () | 2.39 | 7.96 | |
with early termination () | 4.93 | 15.1 | |
with early termination () | 6.05 | 23.9 | |
CGP-CNN (ResSet) with RichInit | 11.7 | 50.0 | |
with early termination () | 1.29 | 6.88 | |
with early termination () | 2.86 | 12.4 | |
with early termination () | 6.36 | 22.2 |
Method . | Error rate . | Time (days) . | Average # epochs . |
---|---|---|---|
CGP-CNN (ConvSet) | 15.6 | 50.0 | |
with early termination () | 3.32 | 8.97 | |
with early termination () | 6.73 | 18.8 | |
with early termination () | 11.4 | 30.6 | |
CGP-CNN (ResSet) | 14.7 | 50.0 | |
with early termination () | 2.39 | 7.96 | |
with early termination () | 4.93 | 15.1 | |
with early termination () | 6.05 | 23.9 | |
CGP-CNN (ResSet) with RichInit | 11.7 | 50.0 | |
with early termination () | 1.29 | 6.88 | |
with early termination () | 2.86 | 12.4 | |
with early termination () | 6.36 | 22.2 |
From Table 7, early termination reduces the optimization time without significant performance deterioration and the optimization time decreases as decreases. The average number of epochs is half that of the number in the original method when and decreases approximately linearly as decreases. The computational time also decreased as the average number of epochs decreases. In the CGP-CNN (ResSet) with , the computational time becomes 2.39 days (i.e., of that of the original CGP-CNN (ResSet)). This is reasonable for execution by most users. Additionally, we observe that the combination of early termination and rich initialization is more effective considering both computational time and performance; that is, the performances are better than those of the architectures without rich initialization.
In addition, we conducted Kruskal-Wallis rank sum tests for four independent samples—the CGP-CNN with early terminations of and 10, and the one without early termination—to analyze the effect of the early termination technique. For ConvSet, ResSet and ResSet with RichInit, the -values were , and , respectively. The -values for all cases were large (), indicating that each model achieved similar classification performance. As a result, our early termination technique can reduce the computational time without performance deterioration.
4.6 Result on the Small-Data Scenario
In the small-data scenario, we compare our method with VGG and ResNet. We trained the VGG and ResNet models with the same settings used in the retraining process of the proposed method. Table 8 shows the comparison of error rates in the small-data scenario. We observe that our methods, CGP-CNN (ConvSet) and CGP-CNN (ResSet), can find better architectures than VGG and ResNet.
Model . | Error rate . | # params . |
---|---|---|
VGG (Simonyan and Zisserman, 2015) | 15.2M | |
ResNet (He et al., 2016) | 1.70M | |
CGP-CNN (ConvSet) | 1.94M | |
CGP-CNN (ResSet) | 0.92M |
VGG and ResNet are designed and tuned for a relatively large amount of data and have millions of parameters to train. Therefore, the number of samples in the small-data scenario seems to be too small to prevent overfitting. Meanwhile, our method can automatically tune the architecture depending on the dataset and achieve better performance even on the small datasets. The numbers of learnable weight parameters are relatively small, suggesting that the proposed method finds the compact architectures to prevent overfitting. Figure 6 illustrates the architectures constructed by using the proposed method. From this figure, our method found relatively wide structure compared with the architectures that appeared in Figure 5. The computational time of the proposed method in the small-data scenario is a few days using a NVIDIA GeForce GTX 1080 Ti.
We additionally conducted a Wilcoxon rank sum test for comparison of our models and other models. Based on the statistical test with the significance level of , the p-value between ResSet and VGG (Simonyan and Zisserman, 2015) was , and the p-value between ResSet and ResNet (He et al., 2016) was . The p-value for ConvSet compared to VGG and ResNet were and , respectively.
5 Conclusion
In this article, we proposed a CGP-based approach for designing deep CNN architectures and verified its potential. The proposed method generates the CNN architectures based on the CGP encoding scheme with highly functional modules and uses the modified evolutionary algorithm to efficiently find good architectures. Moreover, we introduced simple speed-up techniques, rich initialization and early termination of network training, to reduce the computational time. We constructed CNN architectures for the image classification task with the CIFAR-10 and CIFAR-100 datasets and considered two different data size settings. The experimental results showed that the proposed method could find competitive CNN architectures compared with the state-of-the-art models. Regarding the speed-up techniques, rich initialization can improve the discovered architecture performance, and early termination has succeeded to reduce the computational time without significant performance deterioration. By combining early termination with rich initialization, the computational time can be reduced, and the performance improved. In the small-data scenario, the proposed method can also find better and compact architectures compared with the existing hand-designed architectures.
The bottleneck of the architecture search of the DNN is the computational cost. Another possible speed-up technique is that we start with a small data size and increase the training data for the neural networks as the generation progresses. Moreover, to simplify and compact the CNN architectures, we may introduce regularization techniques to the architecture search process. Or, we may be able to manually simplify the obtained CNN architectures by removing redundant or less effective layers.
Another possible future topic would be to apply other evolutionary algorithms such as the standard genetic algorithm used in Real et al. (2017) to our proposed method. While we employed a simple evolution strategy in this article, we did not discuss how other evolutionary algorithms affect the performance on the architecture search. Thus, we would like to investigate this point in the future.