Abstract
Evolution-based neural architecture search methods have shown promising results, but they require high computational resources because these methods involve training each candidate architecture from scratch and then evaluating its fitness, which results in long search time. Covariance Matrix Adaptation Evolution Strategy (CMA-ES) has shown promising results in tuning hyperparameters of neural networks but has not been used for neural architecture search. In this work, we propose a framework called CMANAS which applies the faster convergence property of CMA-ES to the deep neural architecture search problem. Instead of training each individual architecture seperately, we used the accuracy of a trained one shot model (OSM) on the validation data as a prediction of the fitness of the architecture, resulting in reduced search time. We also used an architecture-fitness table (AF table) for keeping a record of the already evaluated architecture, thus further reducing the search time. The architectures are modeled using a normal distribution, which is updated using CMA-ES based on the fitness of the sampled population. Experimentally, CMANAS achieves better results than previous evolution-based methods while reducing the search time significantly. The effectiveness of CMANAS is shown on two different search spaces using four datasets: CIFAR-10, CIFAR-100, ImageNet, and ImageNet16-120. All the results show that CMANAS is a viable alternative to previous evolution-based methods and extends the application of CMA-ES to the deep neural architecture search field.
1 Introduction
Evolutionary algorithm (EA)-based NAS updates a population of architectures on the basis of performance of the architectures from the performance estimation process. Reinforcement learning (RL)-based NAS has a RL agent sampling architecture in the search space, which is updated depending on the performance of the architecture determined the performance estimation process. Both types of methods require huge computational resources, resulting in long search time. For example, the method proposed in Real et al. (2019) required 3,150 GPU days of evolution, and that discussed in Zoph et al. (2018) required 1,800 GPU days of RL search. This is attributed to the performance estimation process (Figure 1) wherein each architecture is trained from scratch for a certain number of epochs in order to evaluate its performance on the validation data. Recently proposed gradient-based methods such as Liu et al. (2019), Dong and Yang (2019b), Xie et al. (2019), Dong and Yang (2019a), and Chen et al. (2019) have reduced the search time by sharing weights among the architectures. However, these gradient-based methods are highly dependent on the given search space and suffer from premature convergence to the local optimum as shown in Chen et al. (2019) and Zela et al. (2020).
Our contributions could be summarized as follows:
We designed a framework of applying the covariance matrix adaptation evolution strategy (CMA-ES) to the NAS problem where the architecture is represented by a 2D matrix. The entries in the matrix select an architecture by giving higher weights to that architecture in the search space. The matrix is updated using CMA-ES with more details given in Section 3. Instead of training each architecture in the population from scratch, we used a trained one shot model (OSM) (a supergraph that treats all architectures as subgraphs) for evaluating the performance/fitness of an architecture, resulting in reduced computational requirements.
We used an architecture-fitness table (AF table) for maintaining the records of the already evaluated architectures in order to skip the process of re-evaluating an already evaluated architecture and thus reducing the search time.
We also used a NAS benchmark, NAS-Bench-201 (Dong and Yang, 2020), which provides the fitness value of each architecture in the search space. This allowed us to simulate the process of using our method without OSM and guide the search process by training and evaluating each architecture in the population from scratch.
We also created a visualization of the architecture search performed by CMANAS to get insights into the search process. We found that the search begins with giving equal weights to all architectures in the search space and as the search progresses and converges to an architecture, CMANAS increases the weights to the converged architecture. We also found that the first phase of the search is predominantly an exploration phase wherein CMANAS explores the given search space. This is followed by an exploitation phase (i.e., convergence to an architecture).
The code for our paper can be found here: https://github.com/nightstorm0909/CMANAS.
2 Related Work
Searching the neural architecture automatically by using an algorithm (i.e., NAS) is an alternative to the architectures designed by humans, and in recent years, these NAS methods have attracted increasing interest because of its promise of an automatic and efficient search of architectures specific to a task. Early NAS approaches (Stanley and Miikkulainen, 2002; Stanley et al., 2009), optimized both the neural architectures and the weights of the network using evolution. However, their usage was limited to shallow networks. Recent NAS methods (Zoph and Le, 2016; Pham et al., 2018; Real et al., 2019; Zoph et al., 2018; Real et al., 2017; Liu, Simonyan, et al., 2018; Xie and Yuille, 2017) perform the architecture search separately while using gradient descent for optimizing the weights of the architecture for its evaluation, which has made the search of deep networks possible. The various NAS methods can be classified into two categories on the basis of the different methods used in the search strategy in Figure 1. These are gradient-based methods and non-gradient-based methods.
Gradient-Based Methods: These methods begin with a random neural architecture, which is then updated using the gradient information on the basis of its performance on the validation data. In general, these methods (Liu et al., 2019; Dong and Yang, 2019b; Xie et al., 2019; Dong and Yang, 2019a), relax the discrete architecture search space to a continuous search space by using a one shot model (OSM). The performance of the OSM on the validation data is used for updating the architecture using gradients. As the OSM shares weights among all architectures in the search space, these methods take lesser time in the performance estimation process in Figure 1 and thus shorter search time. However, these methods suffer from the overfitting problem wherein the resultant architecture shows good performance on the validation data but exhibits poor performance on the test data. This can be attributed to its preference for parameterless operations in the search space, as it leads to rapid gradient descent (Chen et al., 2019). Some regularization techniques have been introduced to tackle this problem, such as early stopping (Zela et al., 2020), search space regularization (Chen et al., 2019) and architecture refinement (Chen et al., 2019). In contrast to these gradient-based methods, our method does not suffer from the overfitting problem because of its stochastic nature and does not need any regularization to arrive at a good solution.
Non-Gradient-Based Methods: These methods include reinforcement learning (RL) methods and evolutionary algorithm (EA) methods. In the RL methods, an agent is used for the generating neural architecture. The agent is then trained to generate architectures in order to maximize its expected accuracy on the validation data (calculated in the performance estimation process in Figure 1). In Zoph and Le (2016) and Zoph et al. (2018), a recurrent neural network (RNN) is used as an agent for sampling the neural architectures. These sampled architectures are then trained from scratch to convergence in order to get their accuracies on the validation data (i.e., performance estimation process in Figure 1). These accuracies are then used for updating the weights of the RNN agent by using policy gradient methods. Because of the huge computational requirement of training the architectures from scratch in the performance estimation process, both of these methods suffered from long search time. This was improved in Pham et al. (2018) by using a single directed acyclic graph (DAG) for sharing the weights among all the sampled architectures, thus resulting in reduced computational resources.
The EA-based NAS methods begin with a population of architectures, each of which is evaluated on the basis of its performance on the validation data (performance estimation process in Figure 1). The population is then evolved on the basis of the performance of the population. Methods such as those proposed in Real et al. (2019) and Xie and Yuille (2017) used gradient descent for optimizing the weights of each architecture in the population from scratch in order to determine their accuracies on the validation data as their fitness during the performance estimation process, resulting in huge computational requirements. In order to speed up the training process, Real et al. (2017) introduced weight inheritance wherein the architectures in the next generation population inherit the weights of the previous generation population, resulting in bypassing the training from scratch. However, the speed up gained is less, as it still needs to optimize the weights of the architecture. Methods such as that proposed in Sun, Wang, et al. (2019) use a random forest for predicting the performance of the architecture during the performance estimation process, resulting in a high speed up as compared to previous EA methods. However, its performance was far from the state-of-the-art results. In contrast, our method achieved better results than previous EA methods while using significantly less computational resources. CMA-ES has shown good performance in many high-dimensional continuous optimization problems such as fine-tuning the hyperparameters of the CNN (Loshchilov and Hutter, 2016). However, to the best of our knowledge, CMA-ES has not been applied to the NAS problem because of the discrete nature of the problem.
3 Proposed Approach
3.1 Search Space
3.1.1 Search Space 1 (S1)
S1 is similar to that used in Liu et al. (2019), which allows us to compare the performance of our method with other NAS methods. Here, we search for both normal and reduction cells in Figure 3a, where each node maps two inputs to one output. The two inputs for in cell are picked from the outputs from previous nodes in cell (i.e., ), output from previous cell , and output from the previous-to-previous cell .
3.1.2 Search Space 2 (S2)
S2 is a smaller search space with a total of 15,625 architectures in the search space and is similar to that used in NAS-Bench-201 (Dong and Yang, 2020), where we search only for the normal cell in Figure 3a. Here, each node is connected to the previous node (i.e., ). NAS-Bench-201 provides a unified benchmark for almost any up-to-date NAS algorithms by providing the results of each architecture in the search space on CIFAR-10, CIFAR-100, and ImageNet16-120. It provides an API that can be used to query accuracies on both validation and test sets for all the architectures in the search space. The API provides two types of accuracies for each architecture, that is, accuracy after training the architecture for 12 epochs and 200 epochs. The accuracies of the architectures after 200 epochs are used as the performance measurement of various NAS algorithms. NAS-Bench-201 (Dong and Yang, 2020) (i.e., S2) provides the search results for two types of NAS methods: weight-sharing-based and non-weight-sharing-based. In the weight-sharing-based NAS methods, all the architectures in the search space share their weights to reduce the search time (e.g., Pham et al., 2018; Liu et al., 2019; Dong and Yang, 2019b, 2019a; Li and Talwalkar, 2020). In the non-weight-sharing-based NAS methods (e.g., Real et al., 2019; Bergstra and Bengio, 2012; Williams, 1992; Falkner et al., 2018), the architectures in the search space do not share their weights, and during the architecture search, the performance of each architecture is evaluated on the basis of the accuracy on the validation data after training for 12 epochs, which is provided by the API.
3.2 Representation of Architecture
As illustrated in Figure 3b, a cell in the architecture is represented by an architecture parameter, . Each for a normal cell and a reduction cell is represented by a matrix with columns representing the weights of different operations from the operation space (i.e., the search space of NAS) and rows representing the edge between two nodes. For example, in Figure 3b, represents the edge between node 0 and node 1 and the entries in the row represent the weights given to the three different operations. The operations shown in Figure 3 are generic operations to provide an overview. So, for number of operations in the search space and edges between nodes in the cell, has parameters. is modelled with a multivariate normal distribution, , with a mean vector, , of size , representing the parameters in , and a covariance matrix, C, of size , which is used for guiding the search process using CMA-ES. So, the representation for the search spaces are as follows:
Search space 1 (S1): Here, each cell has 7 nodes with the first 2 nodes being the output from previous cells and last node as output node, resulting in 14 edges () among them. There are eight operations () considered in S1, which are as follows: and dilated separable convolutions, and separable convolutions, max pooling, average pooling, skip connect, and zero. Therefore, an architecture is represented by two matrices, one each for a normal cell and a reduction cell. The values in these two matrices are modeled with a multivariate normal distribution with a mean vector, , of size 224, and a covariance matrix, C, of size .
Search space 2 (S2): Here, each cell has 4 nodes with the first node as the input node and last node as the output node, resulting in six edges () among them. The five operations () considered in S2 are as follows: and convolutions, average pooling, skip connect, and zero. Therefore, an architecture is represented by a matrix for the normal cell. The values in the matrix are modeled with a multivariate normal distribution, , with a mean vector, , of size 30, and a covariance matrix, C, of size .
An architecture is derived from for the two search spaces through a mapping process (discussed in Section 3.3).
3.3 Performance Estimation
The trained OSM from Algorithm 1 is then used to evaluate an architecture on the basis of its accuracy on the validation data, also known as the fitness of the architecture. The process of evaluation follows two steps sequentially (as illustrated in Figure 4):
Mapping process: Here, the architecture, , is derived from the architecture parameter , on the basis of the search space used.
Search space 1 (S1): Each node maps two inputs to one output, so for each node, the top two distinct input nodes are chosen from all previous nodes on the basis of the weights of all the operations in the search space. For example, in Figure 4, the bottom 4 rows in represent all possible connections to node 2 from the previous nodes using all three generic operations. In the mapping process, node 2 is connected to node 0 and node 1 though operation because they have the top two weights.
Search space 2 (S2): Each node is connected to all previous nodes, so for each edge between any two nodes, the top operation is chosen on the basis of weights of all the operations in the search space. For example, in Figure 4, the bottom 3 rows in represent all possible connections to node 3 from previous nodes using all three generic operations. In the mapping process, node 3 is connected to node 0 though , node 1 though and node 2 though because they have the highest weights in each row.
Figure 4 illustrates the mapping process with three operations in both S1 and S2 and three nodes in S1 and four nodes in S2.
- Discretization process: The derived architecture, , from the mapping process is then used to create a new architecture parameter called discrete architecture parameter, , with the following entries:where represents an operation in the operation space , is used to select an operation from between node and , and is used to select an architecture in the one shot model. For example, in discretization for S2 in Figure 4, the row represents the three different operations that can be chosen between node 0 and 1. Since is present in the architecture between node 0 and 1 (as seen in the mapping part of Figure 4), only the entry for in the row for is 1. is then sent to the OSM for evaluation of its accuracy on the validation data, which is the fitness of the architecture . This process gives higher equal weight to the architecture operations while giving lower equal weights to the other architecture operations. This results in the higher contribution from the architecture and very low contribution by other architectures during the fitness evaluation.(2)
3.4 CMANAS
4 Experiments and Results
4.1 Baselines
In order to illustrate the effectiveness of CMANAS, we compared the architecture returned by CMANAS with the other architectures reported in various peer-reviewed NAS methods. These peer-reviewed NAS methods are broadly classified into five categories: architectures designed by humans (reported as manual), RL-based methods (reported as RL), gradient-based methods (reported as grad. based), EA-based methods (reported as EA), and others. The others include the random search and sequential model-based optimization (SMBO) wherein the architecture is searched in the increasing order of complexity of its structure. The effectiveness of the reported architectures is measured in terms of the classification accuracy, and the computational requirement is measured in terms of search time on a single GPU reported as GPU days/GPU hours.
4.2 Dataset Settings
Both CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) have 50,000 training images and 10,000 testing images and are classified into 10 classes and 100 classes respectively. ImageNet (Deng et al., 2009) is a popular benchmark for image classification and contains 1.28 million training images and 50,000 test images, which are classified into 1,000 classes. ImageNet-16-120 (Chrabaszcz et al., 2017) is a down-sampled version of ImageNet wherein the images in the original ImageNet dataset are downsampled to pixels with 120 classes to construct the ImageNet-16-120 dataset. The settings used for the datasets in S1 are as follows:
CIFAR-10: We split 50,000 training images into two sets of size 25,000 each, with one set acting as the training set and the other set as the validation set.
CIFAR-100: We split 50,000 training images into two sets. One set of size 40,000 images becomes the training set and the other set of size 10,000 images becomes the validation set.
We followed the settings used in Dong and Yang (2020) for the datasets in S2, which are as follows:
CIFAR-10: The same settings as those used for S1 are used here as well.
CIFAR-100: The 50,000 training images remain as the training set, and the 10,000 testing images are split into two sets of size 5,000 each, with one set acting as the validation set and the other set as the test set.
ImageNet-16-120: It has 151,700 training images, 3,000 validation images, and 3,000 test images.
A tabular version of the dataset split for all the tasks in S1 and S2 is provided in the supplementary appendix. The training set is used for training the OSM and the validation set is used for estimating the fitness of the sampled architecture during the search process (Section 3.3).
4.3 Implementation Details
4.3.1 Training Settings
The training process is executed two times in our method, as follows:
One shot model (OSM) training: In general, the OSM suffers from high memory requirements, which makes it difficult to fit it in a single GPU. For S1, we follow Liu et al. (2019) and Li and Talwalkar (2020) and use a smaller OSM, called proxy model, which is created with 8 stacked cells and 16 initial channels for both CIFAR-10 and CIFAR-100 datasets. It is then trained with SGD for 100 epochs on both CIAFR-10 and CIFAR-100 with the same settings, that is, batch size of 96, weight decay , cutout (DeVries and Taylor, 2017), initial learning rate (annealed down to 0 by using a cosine schedule without restart; Loshchilov and Hutter, 2017) and momentum . For S2, we do not use a proxy model, as the size of the OSM is sufficiently small to be fitted in a single GPU. So, the OSM in S2 is created by stacking 5 normal cells for all three datasets. For training, we follow the same settings as those used in S1 for CIFAR-10, CIFAR-100, and ImageNet16-120 except batch size of 256.
Architecture evaluation: Here, the discovered architecture, (i.e., discovered cells), at the end of the architecture search is trained on the dataset to evaluate its performance for comparison with other NAS methods. For S1, we follow the training settings used in DARTS (Liu et al., 2019). Here, is created with 20 stacked cells and 36 initial channels for both CIFAR-10 and CIFAR-100 datasets. It is then trained for 600 epochs on both the datasets with the same settings as the ones used in the OSM training above. Following recent works (Pham et al., 2018; Real et al., 2019; Zoph et al., 2018; Liu et al., 2019; Liu, Zoph, et al., 2018), we use an auxiliary tower with 0.4 as its weights, path dropout probability of 0.2, and cutout (DeVries and Taylor, 2017) for additional enhancements. For ImageNet, is created with 14 cells and 48 initial channels in the mobile setting, wherein the input image size is 224 × 224 and the number of multiply-add operations in the model is restricted to less than 600M. It is trained on 8 NVIDIA V100 GPUs by following the training settings used in Chen et al. (2019).
4.3.2 Architecture Search Settings
4.4 Results
4.4.1 Search Space 1 (S1)
. | Top-1 . | Params . | GPU . | # Arch. . | Search . |
---|---|---|---|---|---|
Architecture . | Acc. (%) . | (M) . | Days . | Evaluated . | Method . |
ResNet (He et al., 2016) | 95.39 | 1.7 | - | - | manual |
DenseNet-BC (Huang et al., 2017) | 96.54 | 25.6 | - | - | manual |
ShuffleNet (Zhang et al., 2018) | 90.87 | 1.06 | - | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 96.59 | 3.2 | 225 | - | SMBO |
RSPS (Li and Talwalkar, 2020) | 97.14 | 4.3 | 2.7 | - | random |
NASNet-A (Zoph et al., 2018) | 97.35 | 3.3 | 1800 | - | RL |
ENAS (Pham et al., 2018) | 97.14 | 4.6 | 0.45 | - | RL |
DARTS (Liu et al., 2019) | 97.24 | 3.3 | 4 | - | grad. based |
GDAS (Dong and Yang, 2019b) | 97.07 | 3.4 | 0.83 | - | grad. based |
SNAS (Xie et al., 2019) | 97.15 | 2.8 | 1.5 | - | grad. based |
SETN (Dong and Yang, 2019a) | 97.31 | 4.6 | 1.8 | - | grad. based |
AmoebaNet-A (Real et al., 2019) | 96.66 | 3.2 | 3150 | 20,000 | EA |
Large-scale Evo. (Real et al., 2017) | 94.60 | 5.4 | 2750 | - | EA |
Hierarchical Evo. (Liu, Simonyan, et al., 2018) | 96.25 | 15.7 | 300 | 7,000 | EA |
CNN-GA (Sun et al., 2020) | 96.78 | 2.9 | 35 | 400 | EA |
CGP-CNN (Suganuma et al., 2017) | 94.02 | 1.7 | 27 | 600 | EA |
AE-CNN (Sun, Xue, et al., 2019) | 95.7 | 2.0 | 27 | 400 | EA |
AE-CNN+E2EPP (Sun, Wang, et al., 2019) | 94.70 | 4.3 | 7 | 400 | EA |
EvNAS (Sinha and Chen, 2021) | 97.37 | 3.4 | 3.83 | 10,000 | EA |
SI-ENAS (Zhang et al., 2020) | 95.93 | - | 1.8 | - | EA |
CMANAS-C10A | 97.44 | 3.8 | 0.45 | 1,021 | EA |
CMANAS-C10B | 97.35 | 3.2 | 0.45 | 1,040 | EA |
CMANAS-C10C | 97.35 | 3.3 | 0.45 | 1,052 | EA |
CMANAS-C10rand | 97.11 | 3.11 | 0.66 | 2,000 | random |
. | Top-1 . | Params . | GPU . | # Arch. . | Search . |
---|---|---|---|---|---|
Architecture . | Acc. (%) . | (M) . | Days . | Evaluated . | Method . |
ResNet (He et al., 2016) | 95.39 | 1.7 | - | - | manual |
DenseNet-BC (Huang et al., 2017) | 96.54 | 25.6 | - | - | manual |
ShuffleNet (Zhang et al., 2018) | 90.87 | 1.06 | - | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 96.59 | 3.2 | 225 | - | SMBO |
RSPS (Li and Talwalkar, 2020) | 97.14 | 4.3 | 2.7 | - | random |
NASNet-A (Zoph et al., 2018) | 97.35 | 3.3 | 1800 | - | RL |
ENAS (Pham et al., 2018) | 97.14 | 4.6 | 0.45 | - | RL |
DARTS (Liu et al., 2019) | 97.24 | 3.3 | 4 | - | grad. based |
GDAS (Dong and Yang, 2019b) | 97.07 | 3.4 | 0.83 | - | grad. based |
SNAS (Xie et al., 2019) | 97.15 | 2.8 | 1.5 | - | grad. based |
SETN (Dong and Yang, 2019a) | 97.31 | 4.6 | 1.8 | - | grad. based |
AmoebaNet-A (Real et al., 2019) | 96.66 | 3.2 | 3150 | 20,000 | EA |
Large-scale Evo. (Real et al., 2017) | 94.60 | 5.4 | 2750 | - | EA |
Hierarchical Evo. (Liu, Simonyan, et al., 2018) | 96.25 | 15.7 | 300 | 7,000 | EA |
CNN-GA (Sun et al., 2020) | 96.78 | 2.9 | 35 | 400 | EA |
CGP-CNN (Suganuma et al., 2017) | 94.02 | 1.7 | 27 | 600 | EA |
AE-CNN (Sun, Xue, et al., 2019) | 95.7 | 2.0 | 27 | 400 | EA |
AE-CNN+E2EPP (Sun, Wang, et al., 2019) | 94.70 | 4.3 | 7 | 400 | EA |
EvNAS (Sinha and Chen, 2021) | 97.37 | 3.4 | 3.83 | 10,000 | EA |
SI-ENAS (Zhang et al., 2020) | 95.93 | - | 1.8 | - | EA |
CMANAS-C10A | 97.44 | 3.8 | 0.45 | 1,021 | EA |
CMANAS-C10B | 97.35 | 3.2 | 0.45 | 1,040 | EA |
CMANAS-C10C | 97.35 | 3.3 | 0.45 | 1,052 | EA |
CMANAS-C10rand | 97.11 | 3.11 | 0.66 | 2,000 | random |
. | Top-1 . | Params . | GPU . | Search . |
---|---|---|---|---|
Architecture . | Acc. (%) . | (M) . | Days . | Method . |
ResNet (He et al., 2016) | 77.90 | 1.7 | - | manual |
DenseNet-BC (Huang et al., 2017) | 82.82 | 25.6 | - | manual |
ShuffleNet (Zhang et al., 2018) | 77.14 | 1.06 | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 80.47 | 3.2 | 225 | SMBO |
MetaQNN (Baker et al., 2017) | 72.86 | 11.2 | 90 | RL |
ENAS (Pham et al., 2018) | 80.57 | 4.6 | 0.45 | RL |
DARTS (Liu et al., 2019) | 82.46 | 3.3 | 4 | grad. based |
GDAS (Dong and Yang, 2019b) | 81.62 | 3.4 | 0.83 | grad. based |
SETN (Dong and Yang, 2019a) | 82.75 | 4.6 | 1.8 | grad. based |
AmoebaNet-A (Real et al., 2019) | 81.07 | 3.2 | 3150 | EA |
Large-scale Evo. (Real et al., 2017) | 77.00 | 40.4 | 2750 | EA |
CNN-GA (Sun et al., 2020) | 79.47 | 4.1 | 40 | EA |
AE-CNN (Sun, Xue, et al., 2019) | 79.15 | 5.4 | 36 | EA |
Genetic CNN (Xie and Yuille, 2017) | 70.95 | - | 17 | EA |
AE-CNN+E2EPP (Sun, Wang, et al., 2019) | 77.98 | 20.9 | 10 | EA |
EvNAS (Sinha and Chen, 2021) | 83.14 | 3.4 | 3.83 | EA |
SI-ENAS (Zhang et al., 2020) | 81.36 | - | 1.8 | EA |
CMANAS-C100A | 83.24 | 3.4 | 0.60 | EA |
CMANAS-C100B | 83.09 | 3.47 | 0.63 | EA |
CMANAS-C100C | 82.73 | 2.97 | 0.62 | EA |
CMANAS-C100rand | 82.35 | 3.17 | 0.67 | random |
. | Top-1 . | Params . | GPU . | Search . |
---|---|---|---|---|
Architecture . | Acc. (%) . | (M) . | Days . | Method . |
ResNet (He et al., 2016) | 77.90 | 1.7 | - | manual |
DenseNet-BC (Huang et al., 2017) | 82.82 | 25.6 | - | manual |
ShuffleNet (Zhang et al., 2018) | 77.14 | 1.06 | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 80.47 | 3.2 | 225 | SMBO |
MetaQNN (Baker et al., 2017) | 72.86 | 11.2 | 90 | RL |
ENAS (Pham et al., 2018) | 80.57 | 4.6 | 0.45 | RL |
DARTS (Liu et al., 2019) | 82.46 | 3.3 | 4 | grad. based |
GDAS (Dong and Yang, 2019b) | 81.62 | 3.4 | 0.83 | grad. based |
SETN (Dong and Yang, 2019a) | 82.75 | 4.6 | 1.8 | grad. based |
AmoebaNet-A (Real et al., 2019) | 81.07 | 3.2 | 3150 | EA |
Large-scale Evo. (Real et al., 2017) | 77.00 | 40.4 | 2750 | EA |
CNN-GA (Sun et al., 2020) | 79.47 | 4.1 | 40 | EA |
AE-CNN (Sun, Xue, et al., 2019) | 79.15 | 5.4 | 36 | EA |
Genetic CNN (Xie and Yuille, 2017) | 70.95 | - | 17 | EA |
AE-CNN+E2EPP (Sun, Wang, et al., 2019) | 77.98 | 20.9 | 10 | EA |
EvNAS (Sinha and Chen, 2021) | 83.14 | 3.4 | 3.83 | EA |
SI-ENAS (Zhang et al., 2020) | 81.36 | - | 1.8 | EA |
CMANAS-C100A | 83.24 | 3.4 | 0.60 | EA |
CMANAS-C100B | 83.09 | 3.47 | 0.63 | EA |
CMANAS-C100C | 82.73 | 2.97 | 0.62 | EA |
CMANAS-C100rand | 82.35 | 3.17 | 0.67 | random |
We also provide the number of architectures evaluated during the search process for the EA-based methods in Table 1 (reported as “# Arch. Evaluated”), as it is the bottleneck that increases the computational cost (measured in GPU days). From Table 1, we can see that the accuracy of the searched architecture increases with the increase in the number of the evaluated architectures, AmoebaNet-A (Real et al., 2019) and Hierarchical Evolution (Liu, Simonyan, et al., 2018). In comparison, our method gives better results while evaluating significantly fewer architectures, which results in significant reduction of the search time. Also, note that methods like CNN-GA and CGP-CNN evaluate fewer architectures as compared to our method but have larger search time as compared to our method. This is because methods like CNN-GA and CGP-CNN train each architecture from scratch for evaluating each architecture, whereas our method trains the one shot model once and uses it to evaluate an architecture. The top cells discovered by CMANAS on CIFAR-10 and CIFAR-100 (i.e., CMANAS-C10A and CMANAS-C100A) are shown in Figure 7. The cells discovered by the other runs of CMANAS on CIFAR-10 and CIFAR-100 are provided in the supplementary appendix.
We followed Pham et al. (2018), Real et al. (2019), Zoph et al. (2018), Liu et al. (2019), and Liu, Zoph, et al. (2018) to compare the transfer capability of CMANAS with that of the other NAS methods, wherein the discovered architecture on a dataset was transferred to another dataset (i.e., ImageNet) by retraining the architecture from scratch on the new dataset. The best discovered architectures from the architecture search on CIFAR-10 and CIFAR-100 (i.e., CMANAS-C10A and CMANAS-C100A) were then evaluated on the ImageNet dataset in mobile setting, and the results are provided in Table 3. The results show that the cells discovered by CMANAS on CIFAR-10 and CIFAR-100 can be successfully transferred to ImageNet, achieving better results than those of human-designed, RL-based, gradient-based, and EA-based methods while using significantly less computational time.
. | Test Accuracy (%) . | Params . | + × . | GPU . | Search . | |
---|---|---|---|---|---|---|
Architecture . | top 1 . | top 5 . | (M) . | (M) . | Days . | Method . |
MobileNet-V2 (Sandler et al., 2018) | 72.0 | 91.0 | 3.4 | 300 | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 74.2 | 91.9 | 5.1 | 588 | 225 | SMBO |
NASNet-A (Zoph et al., 2018) | 74.0 | 91.6 | 5.3 | 564 | 1800 | RL |
NASNet-B (Zoph et al., 2018) | 72.8 | 91.3 | 5.3 | 488 | 1800 | RL |
NASNet-C (Zoph et al., 2018) | 72.5 | 91.0 | 4.9 | 558 | 1800 | RL |
DARTS (Liu et al., 2019) | 73.3 | 91.3 | 4.7 | 574 | 4 | grad. based |
GDAS (Dong and Yang, 2019b) | 74.0 | 91.5 | 5.3 | 581 | 0.83 | grad. based |
SNAS (Xie et al., 2019) | 72.7 | 90.8 | 4.3 | 522 | 1.5 | grad. based |
SETN (Dong and Yang, 2019a) | 74.3 | 92.0 | 5.4 | 599 | 1.8 | grad. based |
AmoebaNet-A (Real et al., 2019) | 74.5 | 92.0 | 5.1 | 555 | 3150 | EA |
AmoebaNet-B (Real et al., 2019) | 74.0 | 91.5 | 5.3 | 555 | 3150 | EA |
AmoebaNet-C (Real et al., 2019) | 75.7 | 92.4 | 6.4 | 570 | 3150 | EA |
NSGANetV1-A2 (Lu et al., 2020) | 74.5 | 92.0 | 4.1 | 466 | 27 | EA |
EvNAS (Sinha and Chen, 2021) | 74.9 | 92.2 | 4.9 | 547 | 3.83 | EA |
CMANAS-C10A | 75.3 | 92.6 | 5.3 | 589 | 0.45 | EA |
CMANAS-C100A | 74.8 | 92.2 | 4.8 | 531 | 0.60 | EA |
. | Test Accuracy (%) . | Params . | + × . | GPU . | Search . | |
---|---|---|---|---|---|---|
Architecture . | top 1 . | top 5 . | (M) . | (M) . | Days . | Method . |
MobileNet-V2 (Sandler et al., 2018) | 72.0 | 91.0 | 3.4 | 300 | - | manual |
PNAS (Liu, Zoph, et al., 2018) | 74.2 | 91.9 | 5.1 | 588 | 225 | SMBO |
NASNet-A (Zoph et al., 2018) | 74.0 | 91.6 | 5.3 | 564 | 1800 | RL |
NASNet-B (Zoph et al., 2018) | 72.8 | 91.3 | 5.3 | 488 | 1800 | RL |
NASNet-C (Zoph et al., 2018) | 72.5 | 91.0 | 4.9 | 558 | 1800 | RL |
DARTS (Liu et al., 2019) | 73.3 | 91.3 | 4.7 | 574 | 4 | grad. based |
GDAS (Dong and Yang, 2019b) | 74.0 | 91.5 | 5.3 | 581 | 0.83 | grad. based |
SNAS (Xie et al., 2019) | 72.7 | 90.8 | 4.3 | 522 | 1.5 | grad. based |
SETN (Dong and Yang, 2019a) | 74.3 | 92.0 | 5.4 | 599 | 1.8 | grad. based |
AmoebaNet-A (Real et al., 2019) | 74.5 | 92.0 | 5.1 | 555 | 3150 | EA |
AmoebaNet-B (Real et al., 2019) | 74.0 | 91.5 | 5.3 | 555 | 3150 | EA |
AmoebaNet-C (Real et al., 2019) | 75.7 | 92.4 | 6.4 | 570 | 3150 | EA |
NSGANetV1-A2 (Lu et al., 2020) | 74.5 | 92.0 | 4.1 | 466 | 27 | EA |
EvNAS (Sinha and Chen, 2021) | 74.9 | 92.2 | 4.9 | 547 | 3.83 | EA |
CMANAS-C10A | 75.3 | 92.6 | 5.3 | 589 | 0.45 | EA |
CMANAS-C100A | 74.8 | 92.2 | 4.8 | 531 | 0.60 | EA |
4.4.2 Search Space 2 (S2)
We performed architecture search on the CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets for both the types of NAS methods given in S2 (Dong and Yang, 2020):
Weight-sharing-based NAS: Here, the architecture evaluation in CMANAS used the trained OSM (as discussed in Section 3.3). Following Dong and Yang (2020), we performed the architecture search three times on all three datasets and compared the results with those of other weight-sharing-based NAS methods because of the weight-sharing nature of the OSM. The results are reported as CMANAS in Table 4. In order to make a fair comparision in terms of search time, we re-ran all the methods using OSM on our setup, that is, RTX 3090, for the search space S2 and compare it with our method in Figure 8b. From the figure, we find that CMANAS is able to find good solutions with less search cost as compared to other methods using OSM. In Figure 9, we compare the progression of the search of the weight-sharing-based CMANAS with other NAS methods and found that CMANAS converges to a good solution at a much faster rate.
Non-weight-sharing-based NAS: Here, the fitness of the architecture was evaluated to be the accuracy on the validation data after training for 12 epochs from scratch, which is provided by the API in S2. This allows us to simulate the process of using CMANAS without the OSM, wherein each architecture in the population is trained from scratch for 12 epochs and then evaluated on the validation data. So, CMANAS updates the architecture parameter, , using the validation accuracy provided by the API in S2. Following Dong and Yang (2020), we performed the architecture search 500 times on all three datasets for 25 generations each and compared the results with those of the other non-weight-sharing-based NAS methods; the corresponding results are reported as CMANAS-h12-Ep25 in Table 4. We also performed another architecture search 500 times on all the three datasets for 100 generations each and reported the corresponding results as CMANAS-h12-Ep100 in Table 4; we found no significant improvement over the 25-generation version.
. | Search . | CIFAR-10 . | CIFAR-100 . | ImageNet-16-120 . | Search . | |||
---|---|---|---|---|---|---|---|---|
Method . | (seconds) . | validation . | test . | validation . | test . | validation . | test . | Method . |
RSPS (Li and Talwalkar, 2020) | 7587.12 | random | ||||||
DARTS-V1 (Liu et al., 2019) | 10889.87 | grad. based | ||||||
DARTS-V2 (Liu et al., 2019) | 29901.67 | grad. based | ||||||
GDAS (Dong and Yang, 2019b) | 28925.91 | grad. based | ||||||
SETN (Dong and Yang, 2019a) | 31009.81 | grad. based | ||||||
ENAS (Pham et al., 2018) | 13314.51 | 0.00 | RL | |||||
CMANAS | 13896 | 89.060.4 | 92.050.26 | 67.430.42 | 67.810.15 | 39.540.91 | 39.770.57 | EA |
AmoebaNet (Real et al., 2019) | 0.02 | EA | ||||||
RS (Bergstra and Bengio, 2012) | 0.01 | random | ||||||
REINFORCE (Williams, 1992) | 0.12 | RL | ||||||
BOHB (Falkner et al., 2018) | 3.59 | grad. based | ||||||
CMANAS-h12-Ep25 | 3.64 | 91.230.40 | 94.000.39 | 72.161.19 | 72.111.10 | 45.690.84 | 45.70.79 | EA |
CMANAS-h12-Ep100 | 7.12 | 91.280.35 | 94.060.34 | 72.400.96 | 72.260.84 | 45.740.86 | 45.690.88 | EA |
ResNet | N/A | 90.83 | 93.97 | 70.42 | 70.86 | 44.53 | 43.63 | manual |
Optimal | N/A | 91.61 | 94.37 | 73.49 | 73.51 | 46.77 | 47.31 | N/A |
. | Search . | CIFAR-10 . | CIFAR-100 . | ImageNet-16-120 . | Search . | |||
---|---|---|---|---|---|---|---|---|
Method . | (seconds) . | validation . | test . | validation . | test . | validation . | test . | Method . |
RSPS (Li and Talwalkar, 2020) | 7587.12 | random | ||||||
DARTS-V1 (Liu et al., 2019) | 10889.87 | grad. based | ||||||
DARTS-V2 (Liu et al., 2019) | 29901.67 | grad. based | ||||||
GDAS (Dong and Yang, 2019b) | 28925.91 | grad. based | ||||||
SETN (Dong and Yang, 2019a) | 31009.81 | grad. based | ||||||
ENAS (Pham et al., 2018) | 13314.51 | 0.00 | RL | |||||
CMANAS | 13896 | 89.060.4 | 92.050.26 | 67.430.42 | 67.810.15 | 39.540.91 | 39.770.57 | EA |
AmoebaNet (Real et al., 2019) | 0.02 | EA | ||||||
RS (Bergstra and Bengio, 2012) | 0.01 | random | ||||||
REINFORCE (Williams, 1992) | 0.12 | RL | ||||||
BOHB (Falkner et al., 2018) | 3.59 | grad. based | ||||||
CMANAS-h12-Ep25 | 3.64 | 91.230.40 | 94.000.39 | 72.161.19 | 72.111.10 | 45.690.84 | 45.70.79 | EA |
CMANAS-h12-Ep100 | 7.12 | 91.280.35 | 94.060.34 | 72.400.96 | 72.260.84 | 45.740.86 | 45.690.88 | EA |
ResNet | N/A | 90.83 | 93.97 | 70.42 | 70.86 | 44.53 | 43.63 | manual |
Optimal | N/A | 91.61 | 94.37 | 73.49 | 73.51 | 46.77 | 47.31 | N/A |
The results show that CMANAS outperforms most of the weight-sharing NAS methods except GDAS (Dong and Yang, 2019b). However, GDAS performs worse when the size of the search space increases, as can be seen for S1 in Table 1. CMANAS also dominates the non-weight-sharing-based NAS methods. Notably, the non-weight-sharing-based CMANAS (i.e., CMANAS-h12-Ep25) also dominates the weight-sharing-based CMANAS, which shows that the trained OSM provides a noisy estimate of the fitness/performance of an architecture.
4.4.3 CMANAS vs Gradient-Based Methods
In Figure 9a, we compare the progression of the search of the weight-sharing-based CMANAS with that of the other gradient-based NAS methods. The gradient-based method, like DARTS (Liu et al., 2019), suffers from the overfitting problem, wherein it converges to operations that give faster gradient descent, that is, skip-connect operation due to its parameter-less nature, as reported in Chen et al. (2019), Zela et al. (2020), and Dong and Yang (2020). This leads to higher number of skip-connect in the final discovered cell, a local optimum (as shown in Figure 9a). To resolve the overfitting problem, Chen et al. (2019) used a regularization method of restricting the number of skip-connect to a specific number in the final normal cell for the search space S1. This optimal number of skip-connect is a search-space dependent value and thus, the same method cannot be applied to the search space S2. In contrast, CMANAS does not have to worry about the overfitting problem due to its stochastic nature and does not need any regularization specific to a search space, thus making it independent of the search space.
5 Further Analysis
In the following sections, we will analyze the different aspects of CMANAS.
5.1 Visualizing the Architecture Search
For analyzing the search, we visualize the search process by using the following techniques:
We plotted the number of unique architectures sampled in every generation for both weight-sharing-based CMANAS and non-weight-sharing-based NAS (i.e., CMANAS-h12-Ep100) in S2 for CIFAR-10 and the weight-sharing-based CMANAS in S1 for both CIFAR-10 and CIFAR-100, as shown in Figure 10. From the figure, we made the following observations:
- The number of unique architectures sampled in the beginning part of the architecture search was equal to the population size, . This part could be considered as the exploration phase of CMANAS, wherein the algorithm is exploring the search space by sampling unique architectures.Figure 10:
As the search progressed, the number of unique sampled architectures decreased because of the dominance of some architecture solutions in the population. This part could be considered the exploitation phase of CMANAS, wherein the algorithm keeps on sampling the already evaluated good solution.
The non-weight-sharing-based CMANAS (i.e., CMANAS-h12-Ep100) converged considerably faster to a better solution than the weight-sharing-based CMANAS in S2. This showed that the trained OSM provided a noisy estimate of the fitness of an architecture.
Because of the bigger size of S1 than that of S2, the exploration phase of CMANAS was larger in S1 than in S2.
- As the architecture parameter, , is modeled using the mean of the normal distribution , we can visualize the progression of by visualizing the mean, , throughout the architecture search process. In Figure 11, the mean, , of the distribution in a generation is visualized using a bar plot, wherein each bar in the plot represents the edge between two nodes in the cell (x-axis). All the operations in the search space are represented by different colors, and the weight associated with any operation between two nodes (i.e., ) is represented by the width of the color associated with that operation in the bar for that specific edge. From the figure, we observed that the search began with equal weights to all the operations in all the edges at generation 0, and as the search progressed, CMANAS changes the weights according to the fitness estimation of the architectures in the population. As the search converged to an architecture, CMANAS increased the weights of the operations of this architecture. Figure 11 shows the search progression for both weight-sharing-based CMANAS and non-weight-sharing-based CMANAS for S2 on CIFAR-10 dataset only. For CIFAR-100 and ImageNet16-120 datasets, please refer to the supplementary appendix.Figure 11:
5.2 Observation on Sampled Architectures
The API provided in the search space, S2, (i.e., NAS-Bench-201, Dong and Yang, 2020) allows a faster way to analyze a NAS algorithm. To discover the patterns associated with the sampled architectures, we plotted the frequency of all the operations in the search space S2 for the top 20 sampled architectures (in terms of their estimated fitness) in both the non-weight-sharing-based CMANAS (i.e., CMANAS-h12-Ep100) and the weight-sharing-based CMANAS as shown in Figure 12. They were compared with the frequency of all the operations in the search space S2 for the top 20 architectures (in terms of the test accuracy after 200 epochs), and we observed the following:
- The operation convolutions dominated the top 20 architectures of S2, which was also seen in the non-weight-sharing-based CMANAS, but the weight-sharing-based CMANAS showed a marginal dominance.Figure 12:
The frequency of operation average pooling was very low in the top 20 architectures of S2 and was also seen in the non-weight-sharing-based CMANAS, but the weight-sharing-based CMANAS showed higher frequency.
All these observations show that the trained OSM provides a noisy estimate of the fitness of an architecture and with a better estimator, CMANAS can yield better results.
5.3 Ablation
Comparison with Random Search: Here, we do not the update the normal distribution, , after estimating the fitness of the individual architectures in the population (i.e., no “update distribution” block as shown in Figure 2). The sampled architecture with the best fitness is returned as the searched architecture after 100 generations. This is essentially a random search and is reported in Table 1 for CIFAR-10 as CMANAS-C10rand and in Table 2 for CIFAR-100 as CMANAS-C100rand. We found that CMANAS-C10rand shows similar results to those reported in RSPS (Li and Talwalkar, 2020) (random search with parameter sharing). We also found that the CMANAS took less time than the random search for both datasets while outperforming the random search in both the datasets. The reason for the random search taking longer than the CMANAS is due to the higher number of architectures evaluated during the search process (Table 1). Also, in Figure 9c, we compare the progression of the search of the weight-sharing-based CMANAS with that of the random search (Li and Talwalkar, 2020) in S2 and found that a mere random search in S2 cannot provide a good solution.
Effectiveness of the AF Table: To illustrate the effectiveness of the AF table used during the evaluation of the fitness of the architecture, we plotted the total number of architectures evaluated during the CMANAS search with and without the AF table as shown in Figure 13, and made the following observations:
- On average, the use of the AF table reduces the architecture evaluation by 50% for S1 and 90% for S2 (as shown in Figure 13). This results in reducing the search time for CMANAS in both the search spaces on all the datasets and has been summarized in Table 5.Figure 13:Table 5:
. S1 . S2 . . Search time in GPU days . Search time in seconds . . CIFAR-10 . CIFAR-100 . CIFAR-10 . CIFAR-100 . ImageNet16-120 . Without AF table 0.66 0.67 21807 27648 14220 With AF table 0.45 0.62 13824 25992 13536 . S1 . S2 . . Search time in GPU days . Search time in seconds . . CIFAR-10 . CIFAR-100 . CIFAR-10 . CIFAR-100 . ImageNet16-120 . Without AF table 0.66 0.67 21807 27648 14220 With AF table 0.45 0.62 13824 25992 13536 Because of the smaller size of S2 than that of S1, CMANAS requires a smaller number of architecture evaluations in order to converge on an optimal solution in S2 than in S1.
The speedup is much bigger for CIFAR-10 as compared to the other datasets in both S1 and S2 because of the bigger size of the validation data used for CIFAR-10 (Section 4.2), which results in longer fitness evaluation time.
6 Conclusion and Future Work
The goal of this paper was to develop a framework for utilizing the faster convergence property of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and extending its applicability to the NAS problem while using significantly less computational time than the previous evolution-based NAS methods. This was achieved by using a trained one shot model (OSM) for evaluating the architectures in the population, which allowed us to skip the training of each individual architecture from scratch for its fitness evaluation. We applied our method (CMANAS) to two different search spaces to show its effectiveness in generalizing to any cell-based search space, that is, search-space agnostic. Experimentally, CMANAS reduced the architecture search time significantly by factors of ten to a thousand, while achieving better results on CIFAR-10, CIFAR-100, and ImageNet datasets than the previous evolutionary algorithms. The reduction in search time was achieved due to the use of one shot model, AF table, and the ability of CMA-ES to converge to a solution using a smaller number of architecture evaluations. CMANAS also solves the overfitting problem present in the gradient-based NAS methods because of its stochastic nature. We also created a visualization of the architecture search performed by CMANAS and found that the search begins with giving equal weights to all architectures in the search space and increases the weights of the converged architecture as the search progresses. We also analyzed the search process and found that the first part of the architecture search in the CMANAS acts as the exploration phase, wherein it explores the search space, and the later part acts as the exploitation phase, wherein the search converges to an architecture. A possible future direction to improve the performance of the algorithm is to use a better fitness estimator, as we found that a better estimator would allow CMANAS to achieve its full potential.
Acknowledgments
This work was supported in part by the National Science and Technology Council of Taiwan (111-2628-E-A49 -003 -MY2 and 111-2634-F-A49-010-). Furthermore, we are grateful to the National Center for High-Performance Computing for computer time and facilities.
Based on the implementation of SGD with nesterov momentum provided in PyTorch.