본문 바로가기

Deep Learning/NAS

[2020.06] FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining

728x90

기존에 제안된 Neural Architecture Search (NAS) 방법들은 주로 좋은 성능을 내는 Architecture를 찾는 것에 집중합니다. 또한 Hyperparameter Optimization (HPO) 에서는 좋은 성능을 내는 Learning Rate, Weight Decay와 같은 Hyperparameter들을 자동으로 찾는 문제를 풉니다. 본 논문에서는 (최초인지는 잘 모르겠지만), NAS와 HPO 두 가지를 동시에 찾는 방법을 제시한다는 점에서 기존 방법들과는 큰 차이가 있습니다. 참고로 본 논문에서는 Hyperparameter를 Train Recipe 라고 명칭하고 있습니다.

더 넓은 Search Space에서 Architecture-Recipe를 잘 찾을 수만 있다면, 당연히 더 좋은 성능을 보일 것입니다. 하지만, Tradeoff를 만족하는 선에서 Search Cost를 어떻게 효율적으로 줄이는 것이 관건이고, 본 논문에서는 이를 만족하는 진화 알고리즘에 기반한 Accuracy Predictor 방법을 제안합니다. 사실, Accuracy Predictor 방법이 기존에 많이 제안되었고 혹자는 Novelty가 없을 수도 있다고 할 수 있겠지만, Practical한 측면에서는 본 논문에서 제안하는 방법은 충분히 가치가 있다고 생각합니다. 제안된 FBNetV3는 Architecture-Recipe Pair를 Jointly하게 Scoring 해주는 Accuracy Predictor를 학습합니다. 이때 FLOP이나 Parameter 수를 Label로 Accuracy Predictor를 Pretrain하면 이후에 Search Efficiency와 Prediction 성능이 약 5배 정도 향상된다고 합니다. 학습이된 Predictor를 바탕으로, Evolutionary Search에 기반한 Iterative Optimization Process를 통해서 최적의 Architect-Recipe Pair를 빠른 시간안에 찾을 수 있습니다.

FBNetV3 (이전에 나온 FBNetV1, FBNetV2와는 큰 관계가 없음), EfficientNet보다 2.0x, ResNet보다 7.1x 적은 FLOPs로 동일한 성능을 보입니다. 또한, Object Detection Task에서도 EfficientNet 기반의 Architecture들 보다 더 적은 Computation으로 좋은 성능을 내는 것을 보였습니다.

Figure 1: ImageNet accuracy vs. model FLOPs comparison of FBNetV3 with other efficient convolutional neural networks. FBNetV3 achieves 80.8% (82.8%) top-1 accuracy with 557M (2.1G) FLOPs, setting a new SOTA for accuracy-efficiency trade-offs.

URL: https://arxiv.org/pdf/2006.02049.pdf

1. Introduction

One category of NAS is a differentiable neural architecture search (DNAS). These path-finding algorithms are efficient, however, DNAS cannot search for non-architecture hyperparameters, which are crucial to the model’s performance. Furthermore, supernet-based NAS methods suffer from a limited search space, as the entire supergraph must fit into memory to avoid slow convergence [5] or paging. Other methods include reinforcement learning (RL) [45], and evolutionary algorithms (ENAS) [41]. However, these methods share several drawbacks: 1) Ignore associated training hyperparameters (i.e., "training recipe")  2) Support only one-time use and 3) Prohibitively large search space to search if including "training recipe".

Table 1: Different training recipes could switch the ranking of architectures.

To overcome these challenges, we propose Neural Architecture-Recipe Search (NARS) to address the above limitations. Our insight is three-fold: (1) To support re-use of NAS results for multiple resource constraints, we train an accuracy predictor, then use the predictor to find architecture recipe pairs for new resource constraints in just CPU minutes. (2) To avoid the pitfalls of architecture-only or recipe-only searches, this predictor scores both training recipes and architectures simultaneously. (3) To avoid prohibitive growth in predictor training time, we pretrain the predictor on proxy datasets to predict architecture statistics (e.g., FLOPs, #Parameters) from architecture representations. After sequentially performing predictor pretraining, constrained iterative optimization, and predictor-based evolutionary search, NARS produces generalizable training recipes and compact models that attain state-of-the-art performance on ImageNet, outperforming all the other automatically searched neural networks. 

2. Methods

Our goal is to find the most accurate architecture and training recipe combination. To address the large search space from this, we train an accuracy predictor that accepts architecture and training recipe representations (Sec 2.1). To do so, we employ a three-stage pipeline (Algorithm 1): (1) Pretrain the predictor using architecture statistics, significantly improving its accuracy and sample efficiency (Sec 2.2). (2) Train the predictor using constrained iterative optimization (Sec 2.3). (3) For each set of resource constraints, run the predictor-based evolutionary search in just CPU minutes to produce high-accuracy architecture-recipe pairs (Sec 2.4).

2.1. Predictor

Our predictor aims to predict accuracy given representations of architecture and a training recipe. Both architecture and training recipes are encoded using one-hot categorical variables (e.g., for block types) and min-max normalized continuous values (e.g., for channel counts). See the full search space in Table 2. 

Our search space consists of both training recipes and architecture configurations. The search space for training recipes features optimizer type, initial learning rate, weight decay, mixup ratio, drop out ratio, stochastic depth drop ratio [20], and whether or not to use model exponential moving average (EMA). Our architecture configuration search space is based on the inverted residual block and includes input resolution, kernel size, expansion, number of channels per layer, and depth, as detailed in Table 2. In recipe-only experiments, we only tune training recipes on a fixed architecture. However, for joint search, we search both training recipes and architectures, within the search space in Table 2. Overall, space contains $10^17$ architecture candidates with $10^7$ possible training recipes. 

Table 2: The network architecture configuration and search space in our experiments. 

2.2. Stage 1: Predictor pretraining

Figure 3: Pretrain to predict architecture statistics (top). Train to predict accuracy from architecture-recipe pairs (bottom)

Training an accuracy predictor can be computationally expensive. To alleviate this, our insight is to first pretrain on a proxy task. The pretraining step can help the predictor to form a good internal representation of the inputs, therefore reducing the number of accuracy-architecture-recipe samples needed. To construct a proxy task for pretraining, we can use a “free” source of labels for architectures: namely, FLOPs and numbers of parameters. After this pretraining step, we transfer the pre-trained embedding layer to initialize the accuracy predictor (Fig. 3). This leads to significant improvements in the final predictor’s sample efficiency and prediction reliability. For example, to reach the same prediction mean square error (MSE), the pretrained predictor only requires 5× fewer samples than its counterpart without pretraining.

The predictor architecture is a multi-layer perceptron (Fig. 3) consisting of several fully-connected layers and two heads: (1) An auxiliary “proxy” head, used for pretraining the encoder, predicts architecture statistics (e.g., FLOPs and #Parameters) from architecture representations; and (2) the accuracy head, fine-tuned in constrained iterative optimization (Sec 2.3), predicts accuracy from joint representations of the architecture and training recipe.

2.3. Stage 2: Training Predictor

In this step, we train the predictor and generate a set of high-promise candidates. We formulate the architecture search as a constrained optimization problem:

$\max _{(A, h) \in \Omega} \operatorname{acc}(A, h)$, s.t. $g_{i}(A) \leqslant C_{i}, \quad i=1, \ldots, \gamma \quad (1) $

where $A, h$, and $\Omega$ refer to the neural network architecture, training recipe, and designed search space, respectively. $g_{i}(A)$ and $\gamma$ refer to the formula and count of resource constraints, such as computational cost, storage cost, and run-time latency. 

Constrained iterative optimization: We first use Quasi Monte-Carlo (QMC) [37] sampling to generate a sample pool of architecture-recipe pairs from the search space. Then, we train the predictor iteratively: We (a) shrink the candidate space by selecting a subset of favorable candidates based on predicted accuracy, (b) train and evaluate the candidates using an early-stopping heuristic, and (c) fine-tune the predictor with the Huber loss. This iterative shrinking of the candidate space avoids unnecessary evaluations and improves exploration efficiency.

2.4. Stage 3: Using Predictor

The third stage of the proposed method is an iterative process based on adaptive genetic algorithms [44]. The best performing architecture-recipe pairs from the second stage are inherited as part of the first generation candidates. In each iteration, we introduce mutations to the candidates and generate a set of children $\mathcal{C} \subset \Omega$ subject to given constraints. We evaluate the score for each child with the pretrained accuracy predictor $u$ and select the top $K$ highest-scoring candidates for the next generation. We compute the gain of the highest score after each iteration and terminate the loop when the improvement saturates. 

* [44] Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Trans. Systems, Man, and Cybernetics, 1994

3. Experiments

In this section, we first validate our search method in a narrowed search space to discover the training recipe for a given network. Then, we evaluate our search method for joint search over architecture and training recipes. In the search process, we randomly sample 200 classes from the entire dataset to reduce the training time. Then, we randomly withhold 10K images from the 200-class training set as the validation set.

3.1. Recipe-only Search

To establish that even modern NAS-produced architecture’s performance can be further improved with better training recipes, we optimize over training recipes for other fixed architecture. We extend the NARS-searched training recipe to other commonly-used neural networks to further validate its generality. Although the NARS-searched training recipe was tailored to FBNetV2-L3, it generalizes surprisingly well to other popular networks. Notably, it is possible to achieve even better performance by searching for specific training recipes for each neural network, which would increase the search cost.

3.2. Neural Architecture-Recipe Search (NARS)

Next, we perform a joint search of architecture and training recipes to discover compact neural networks. We pretrain the architecture embedding layer using 80% of the sample pool which contains 20K samples, and plot the validation on the rest 20% in Fig. 4. In the predictor-based evolutionary search, we set four different FLOPs constraints: 450M, 550M, 650M, and 750M and discover four models (namely FBNetV3-B/C/D/E) with the same accuracy predictor. We further scale down and up the minimum and maximum models and generate FBNetV3-A and FBNetV3- F/G to fit more use scenarios, respectively, with compound scaling proposed in [46].

We compare our searched model against other relevant NAS baselines and hand-crafted compact neural networks in Fig. 1 and list the detailed performance metrics comparison in Table 4, where we group the models by their top-1 accuracy. On a low computation cost regime, FBNetV3-A achieves 79.1% top-1 accuracy with only 357M FLOPs (2.5% higher accuracy than MobileNetV3-1.25x with similar FLOPs). On high accuracy regime, FBNetV3-E achieves 0.2 higher accuracies with over 7× fewer FLOPs compared to ResNeSt-50. Note that we have further improved the accuracy of FBNetV3 by using larger teacher models for distillation, as shown in Appendix A.7.