Theory-Inspired Network Architecture Search

By: Pan Zhou

TL;DR: We theoretically analyze the differential architecture search (DARTS) for understanding the role and impact of skip connections, which inspires a new method for Neural Architecture Search (NAS) using group-structured sparse gates and path-depth-wise regularization to overcome the limitation of existing NAS methods for AutoML.


In our work [1], we investigate the problem of Network architecture search (NAS), an important family of techniques for Automated Machine Learning (AutoML). We theoretically analyze the performance of one popular NAS method, namely differential architecture search (DARTS) [2], which inspires us to design a more effective NAS method. For the first time, our work theoretically proves that skip (identity) connections can help networks converge faster and have superior competitive advantages over other types of operations. This well explains why DARTS often selects network architectures with dominated skip connections which often leads to unsatisfactory performance. Based on our theory, we propose a new path-regularized DARTS with two novel components, group-structured sparse gate and path-depth-wise regularization, which both alleviate the unfair competition among operations and thus help finding better network architectures. Our work advances NAS in both theory and practical algorithm design to advance the frontiers of automated machine learning.


1. What is network architecture search (NAS)?

Network architecture search (NAS) is an important branch of machine learning techniques in AutoML. As shown in Fig. 1, given a network graph in the left figure, NAS aims to address the problem of automatically searching the best network, that is, how to automatically select proper operations from a predefined operation set (e.g., 3x3/5x5 convolution/pooling, skip connection, zero operation, etc) between any two data nodes in the graph. After the NAS search, one can expect to obtain the well-defined network in the right figure of Fig. 1 without human experts.

image.png


2. Why network architecture search (NAS) is important?
Network architecture search (NAS) [3] is an effective approach for automating network architecture design, with many successful applications witnessed to image recognition and language modelling. Unlike expert-designed architectures which require substantial efforts from experts by trial and error, NAS can automatically design the network architectures and thus greatly alleviates the design efforts of experts. Moreover, NAS can also alleviate the design bias brought by experts which could prohibit achieving better performance.  This is demonstrated by the fact that the networks designed by NAS outperform these designed by human experts and achieves state-of-the-art performance in many applications, e.g. image classification.

3. What is the issue in the differential architecture search (DARTS)?
DARTS [2] is a recently developed leading NAS approach. Previous reinforcement learning (RL) and evolutionary algorithm (EA) based NAS methods reformulate the network search problem as a discrete optimization problem, since one needs to select proper operations from a predefined operation sets to connect any two data nodes in the network. However, such discrete optimization has extremely search space (total 1e+16 operation choices for the network of 7 nodes), and suffers from very high computational cost (thousands of GPU days). To solve this issue, DARTS relaxes the discrete network architecture search process into finding continuous architecture parameters. That is,  DARTS defines weights for each operations between any two nodes, and optimizes these operation weights to decide the final network architecture.  In this way, it can optimize the architecture parameters via gradient descent and thus greatly reduces the high search cost in RL and EA searching approaches. However, this differential NAS family, including DARTS and its variants, typically selects many skip connections (identity mapping, one kinds of operations for connecting two data nodes) which dominate over other types of operations [4,5,6]. Consequently, the searched networks are observed to have unsatisfactory performance.  

4. What is the intrinsic theoretical reason for the dominated skip connection in DARTS?
Though the dominance of skip connections in DARTS is observed in many works [4,5,6], theoretical understandings on this issue remain absent, hindering the development of more advanced methods in a principled way. In this work, we solve this problem by theoretically analyzing the effects of various types of operations, e.g. convolution, skip connection and zero operation, to the network optimization.  By imposing proper conditions on the activation functions, network parameter initializations, and input data, we prove that the architectures with more skip connections can converge faster than the other candidates, and thus are selected by DARTS. This result, for the first time, theoretically and explicitly reveals the impact of skip connections to fast network optimization and its competitive advantage over other types of operations in DARTS.

5. What method can resolve the issue of the dominated skip connection in DARTS?
Inspired by our theory, we propose a path-regularized DARTS that consists of two key modules: (i) a differential group-structured sparse binary gate introduced for each operation to avoid unfair competition among operations, and (ii) a path-depth-wise regularization used to incite search exploration for deep architectures that often converge slower than shallow ones and are not well explored during search.

(i) Differential group-structured sparse binary gate
By analysis, we find that the reason for the dominated skip connection is that skip connection and other types of operations share one softmax distribution (their sum is one). When the weights of skip connections become larger, the weights of other operations become smaller.  To resolve this issue, we introduce independent binary gate implemented by Bernoulli distribution for each operation between two nodes to avoid the direct competition between skip connection and other operations. If the binary gate is one for the operation, then the operation will be used to connect two data nodes. Otherwise, the operation will not be used. However, we prove that independent distribution for each operation gives dense architectures which will lead to information loss when pruning the network to a small network after the search.  

To solve this issue, we propose a group-structured sparse binary gate for each operation. Specifically, we first define one binary Bernoulli gate variable for each operation. Then we pass the gate variable into a hard thresholding function such that when the gate variable is smaller than a threshold, it becomes zero. In this way, the gate variables in the network become sparse. Moreover, we can compute the probability of activated gates of the whole network. So we collect all skip connections in the cell as a skip-connection group and take the remaining operations into non-skip-connection group. Then we compute the average activation probability of these two groups and respectively penalize them to obtain sparse selected network. As a result, as shown in Fig. 1 (a) and (b), penalizing skip connections heavier than other types of operations can rectify the competitive advantage of skip connections over other operations and avoids skip-connection-dominated cell.

Snip20200706_7.png

Moreover, the above group-structured sparse binary gates can gradually and automatically prune redundancy and unnecessary connections, which reduces the information loss of pruning at the end of search. Finally,  sparse binary gates defined on the whole cell can encourage global competition and cooperation of all operations in the cell, which differs from DARTS that only introduces local competition among the operations between two nodes.  

(ii) Path-depth-wise Regularizer on Operation Gates
Except for the above advantages, the above independent sparse gates also introduce one issue: they prohibit the method to select deep networks. Without dominated skip connections in the network, other types of operations, e.g. zero operation, become freer and are widely used. Accordingly, the the search algorithm can easily transform a deep network to a shallow network. For example, it can use  skip connections to connect the intermediate nodes to input nodes and cut off the connection between intermediate neighbouring nodes. Meanwhile, we prove that gradient descent algorithm prefers shallow networks than deep ones, as shallow networks often have more smooth function landscape and can be faster optimized. So these two factors together lead to a bias of search algorithm to shallow cells.  

To resolve this network-selection bias, we propose a path-depth-wise regularization to rectify the unfair competition between shallow and deep networks. We first compute the probability that any two neighbouring nodes are connected with each other by parameterized operations, namely various types of convolutions. Then to rectify the competitive advantage of shallow networks over deep ones, we impose path-depth-wised regularization on the stochastic gates to encourage more exploration to deep networks and then decide the depth of networks instead of greedily choosing shallow cell at the beginning of search. As a result, as shown in Fig. 1 (a) and (c), our path-depth-wise regularization can help find deeper networks which usually have better data fitting ability than shallow networks. Finally, by combining group-structured sparse gates and the path-depth-wise regularization, we can obtain our final model which is shown to be superior than DARTS in Fig. 1 (d) and (e).


6. What results the developed method has achieved?
To evaluate the performance of our method, following conventional setting we train our model on the search the CIFAR10 dataset to find network architecture, and then test the selected network on the CIFAR10, CIFAR 100 and ImageNet datasets. The classification results in Table 1 show that our method achieves the state-of-the-art results on both CIFAR10 and CIFAR100.

Snip20200706_8.png


The results on ImageNet in Table 2 also demonstrate the superiority of our method. All these results show the superiority, robustness and transferability of the selected network architecture by our method.

Snip20200706_9.png


7. Takeaway Remarks and What is Next?

This work first provides theoretical understanding of existing DARTS methods, which is rarely investigated though heavily desired. Then we develop an effective network architecture search approach which well resolves the issue in DARTS.  Our work pushes the frontiers of both theoretical performance analysis and practical algorithm design of NAS, which not only helps us better theoretically understand current NAS approaches and also brings new practical design insights for NAS. We will release our code and models for anyone who is interested in NAS and AutoML research and applications.

If you are interested in learning more, please check out our paper and feel free to contact us.

Resources:
Paper: Theory-Inspired Path-Regularized Differential Network Architecture Search
Code: We will release the codes and models soon.

When referencing this work, please cite:
@inproceedings{zhou2020nas,
 title={Theory-Inspired Path-Regularized Differential Network Architecture Search},
 author={Pan Zhou, Caiming Xiong, Richard Socher, and Steven Hoi},
 booktitle={Neural Information Processing Systems},
 year={2020}
}

References
[1] P. Zhou, C. Xiong, R. Socher, S. Hoi, Theory-Inspired Path-Regularized Differential Network Architecture Search, Neural Information Processing Systems (NeurIPS), 2020 (oral).

[2] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture search. In Int’l Conf. Learning Representations, 2018.

[3] B. Zoph and Q. Le. Neural architecture search with reinforcement learning. In Int’l Conf. Learning Representations, 2017.

[4] X. Chen, L. Xie, J. Wu, and Q. Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In IEEE International Conference on Computer Vision, pages 1294–1303, 2019.

[5] X. Chu, T. Zhou, B. Zhang, and J. Li. Fair DARTS: Eliminating unfair advantages in differentiable architecture search. arXiv preprint arXiv:1911.12126, 2019.

[6] T. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter. Understanding and robustifying differentiable architecture search. In Int’l Conf. Learning Representations, 2020.