The paper introduces NAS-Bench-101, the first public architecture dataset for NAS research. 423k unique convolutional architectures were trained and evaluated three times on CIFAR-10 and saved as tabular data. This allows querying the precomputed dataset of a diverse range of models in milliseconds. Because NAS-Bench-101 exhaustively evaluates a search space, it permits, for the first time, a comprehensive analysis of a NAS search space as a whole. Also, NAS-Bench-101 provides benchmark results, including evolutionary approaches, random search, and Bayesian optimization.
URL: arxiv.org/pdf/1902.09635.pdf
Author: Chris Ying et al. (Google Brain)
Conference: ICML 2019
Code: github.com/google-research/nasbench
1. NASBench Overview
The NAS-Bench-101 dataset is a table, which maps neural network architectures to their training and evaluation metrics.
1.1 Architectures
We stack each cell 3 times, followed by a downsampling layer, in which the size is halved via max-pooling and the channel count is doubled. We repeat this pattern 3 times, followed by global average pooling and a final dense softmax layer. The initial layer of the model is a stem consisting of one 3 × 3 convolutions with 128 output channels. The variation in the architectures arises from the cells which are repeated in stacks.
The space of cell architectures consists of all possible directed acyclic graphs on V nodes, where each possible node has one of L labels, representing the corresponding operation. Two of the vertices are specially labeled as operation IN and OUT, respectively. All convolutions utilize batch normalization followed by ReLU. We intentionally use convolutions instead of separable convolutions. In order to limit the size of the space to allow exhaustive enumeration, we impose the following constraints:
• We set L = 3, using only the following operations:
– 3 × 3 convolution
– 1 × 1 convolution
– 3 × 3 max-pool
• We limit V ≤ 7.
• We limit the maximum number of edges to 9.
1.2 Cell Encoding
For most of our experiments, we chose to use a very general encoding: a 7-vertex directed acyclic graph, represented by a 7×7 upper-triangular binary matrix, and a list of 5 labels, one for each of the 5 intermediate vertices (recall that the input and output vertices are fixed) Since there are 21 possible edges in the matrix and 3 possible operations for each label, there are $2^21 ∗ 3^5$ ≈ 510M total unique models in this encoding.
However, a large number of models in this space are invalid (i.e., there is no path from the input vertex, or the number of total edges exceeds 9). Also, different graphs in this encoding may not be computationally unique. After the de-duplication, there are approximately 423k unique graphs in the search space.
1.3 Training
All models are trained and evaluated on CIFAR-10 (40k training examples, 10k validation examples, 10k testing examples). The learning rate is annealed via cosine decay (Loshchilov & Hutter, 2017) to 0. Training is performed via RMSProp (Tieleman & Hinton, 2012) on the cross-entropy loss with L2 weight decay. For 423,624 models, each has been trained for 4, 12, 36, and 108 epochs in 3 times (to count for variation), and report training accuracy, validation accuracy, testing accuracy, parameter numbers, and training time. Only metrics on the training and validation set should be used to search models within a single NAS algorithm, and testing accuracy should only be used for an offline evaluation.
1.4 On benchmark methods
The goal of NAS algorithms is to find architectures that have high testing accuracy at epoch $E_{\max } .$ To do this, we repeatedly query the dataset at $\left(A, E_{\text {stop }}\right)$ pairs, where $A$ is an architecture in the search space and $E_{\text {stop }}$ is an allowed number of epochs $\left(E_{\text {stop }} \in\{4,12,36,108\}\right)$. Each query does a look-up using a random trial index, drawn uniformly at random from {1, 2, 3}, to simulate the stochasticity of SGD training.
While searching, we keep track of the best architecture $\hat{A}_{i}$ the algorithm has found after each function evaluation $i$, as ranked by its validation accuracy. To best simulate real-world computational constraints, we stop the search run when the total "training time" exceeds a fixed limit. After each complete search rollout, we query the corresponding mean test accuracy $f(\hat{A}_{i})$ for that model. Then we compute the immediate test regret: $r(\hat{A}_{i})=f(\hat{A}_{i}) - f(A^*)$, where $A^{*}$ denotes the model with the highest mean test accuracy in the entire dataset. This regret becomes the score for the search run.
2. NASBench as a Dataset
In this section, we analyze the NAS-Bench-101 dataset as a whole to gain some insight into the role of neural network operations and cell topology in the performance of convolutional neural networks. In doing so, we hope to shed light on the loss landscape that is traversed by NAS algorithms.
2.1 Dataset statistics
First, we study the empirical cumulative distribution (ECDF) of various metrics across all architectures in Figure 2. The validation accuracy and test accuracy are both above 90% for a majority of models. The best architecture in our dataset (Figure 1) achieved a mean test accuracy of 94.32%. We observed that the correlation between validation and test accuracy is extremely high (r = 0.999) at 108 epochs which suggests that strong optimizers are unlikely to overfit on the validation error. Due to the stochastic nature of the training process, training and evaluating the same architecture will generally lead to a small amount of noise in the accuracy. We also observe, as expected, that the noise between runs is lower at longer training epochs.
Figure 3 investigates the relationship between the number of parameters, training time, and validation accuracy of models in the dataset. The left plot suggests that there is a positive correlation between all of these quantities. However, parameter count and training time are not the only factors since the best cell in the dataset is not the most computationally intensive one.
2.2 Locality
NASBench exhibits locality, a property by which architectures that are “close by” tend to have similar performance metrics. We define “closeness” in terms of edit-distance: the smallest number of changes required to turn one architecture into another. A popular measure of locality is the random-walk autocorrelation (RWA), defined as the autocorrelation of the accuracies of points visited as we perform a long walk of random changes through space (Weinberger, 1990; Stadler, 1996).
The RWA (Figure 6, left) shows high correlations for lower distances, indicating locality. The correlations become indistinguishable from noise beyond a distance of about 6. While the RWA aggregates across the whole space, we can also consider regions of particular interest. For example, Figure 3 (right) displays the neighbors of the Inception-like cell, indicating a degree of locality too, especially in terms of accuracy.
3. Comparing NAS algorithms
We benchmarked a small set of NAS and hyperparameter optimization (HPO) algorithms with publicly available implementations: random search (RS) (Bergstra & Bengio, 2012), regularized evolution (RE) (Real et al., 2018), SMAC (Hutter et al., 2011), TPE (Bergstra et al., 2011), Hyperband (HB) (Li et al., 2018), and BOHB (Falkner et al., 2018). We also include our own implementation of reinforcement learning (RL) as an additional baseline, since an official implementation is not available. However, instead of using an LSTM controller, which we found to perform worse, we used a categorical distribution for each parameter and optimized the probability values directly with REINFORCE. Supplement S2 has additional implementation details for all methods. NAS algorithms based on weight sharing (Pham et al., 2018; Liu et al., 2018b) or network morphisms (Cai et al., 2018; Elsken et al., 2018) cannot be directly evaluated on the dataset, so we did not include them.
Figure 7 (left) shows the mean performance of each of these NAS/HPO algorithms across 500 independent trials. The x-axis shows the estimated wall-clock time, counting the evaluation of each architecture with the time that the corresponding training run took. Figure 7 (right) shows the empirical cumulative distribution of the regret after 10M seconds across all 500 runs of each method for comparing the robustness. RE, BOHB, and SMAC showed the most robust performance. Now, we make the following observation:
• RE, BOHB, and SMAC perform best and start to outperform RS after roughly 50,000 TPU seconds (the equivalent of roughly 25 evaluated architectures);
• The other Bayesian optimization method, TPE, struggles with this benchmark, with its performance falling back to random search.
• Even though RL starts outperforming RS at roughly the same time as the other methods, it converges much slower towards the global optimum.