Download full version of this notes with more details and images here.

EfficientNet V1 (2019)

Idea & Background

The paper[0] proposed that, modern CNNs were developed with more layers(deeper), more channels(wider), and higher quality of input images(hgher resolutions). However, scaling up any of the parameters mentioned monotonically to a large number would not very much benefits the model, in particular, the model might in this case, reaches an accuracy saturation.

The paper evaluated CNN models that monotonically scalling up its width $w$, depth $d$ and resolution $r$, the improvements is significant until the scaling reaches a certain limit.

The paper concludes that:

  • Observation 1 – Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models.

Intuitively, increased input resolutions needs wider networks that are able to capture more fine-grained patterns with more pixels, as well as the higher depth such that the larger receptive fields would help capture similar features that include more pixels.

These result lead us to the second observation:

  • Observation 2 – In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.

Since the current existing methods are adjusting width, depth, resolutions manually, the paper hence propose an new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.

Compound Coefficient Scaling

Define a compound coefficient \(\phi\) that is used to uniformly scales network width, depth, and resolution in a principled way:

depth: \(d = \alpha^\phi\)

width: \(w = \beta^\phi\)

resolution: \(r = \gamma^\phi\)

s.t. \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\)

\[\alpha \ge 1, \beta \ge 1, \gamma \ge 1\]

From which, the \(\alpha,\beta,\gamma\) are constants that can be determined by a small grid search, \(\phi\) is a user-defined coefficient that controls how many more resources are available for model scalling.

The paper point out that, the FLOPSof a regular convolution operation is proportional to \(d,w^2,r^2\), that is in other words, 2X network depth will gain 2X FLOPS, but 2X network width or resolution will increase FLOPS by 4X. Also, because convolutional layers are usually dominate the computation cost in ConvNets, scalling a ConvNet with above equation will approximately increase total FLOPS by \((\alpha \cdot \beta^2 \cdot \gamma^2)^\phi\), the method constraints \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\) so that the total FLOPS will approximately increase by \(2^\phi\).

EfficientNet Architecture

The EfficientNet is based on MnasNet (by Platform-aware Neural Architecture Search), except EfficientNet-B0 is slightly bigger with FLOPS targets set to 400M.

In which, the MBConv block is mobile inverted bottleneck, the SE block is added into each block for optimization.

Starting with EfficientNet-B0, the compound scaling method is performed with 2 steps:

  • STEP 1: first fix φ = 1, assuming twice more resources available, and do a small grid search of \(\alpha,\beta,\gamma\) based on equation shown above. In particular, the paper found the best values for EfficientNet-B0 are \(α = 1.2\), \(β = 1.1\), \(γ = 1.15\), under constraint of \(α · β^2 · γ^2 ≈ 2\).
  • STEP 2: fix \(\alpha,\beta,\gamma\) as constants and scale up baseline network with different \(\phi\) using above equation, to obtain EfficientNet-B1 to B7.

EfficientNet V2 (2021)

Background & Idea

EfficientNet V2 has got improved training speed and better performance than EfficientNet V1. In this upgraded version, we focus not only on the accuracy and #parameters/FLOPs, but jointly focusing on the training efficiency as well.

The paper[1] identifies several problems of the previouse EfficientNet V1:

  • Training with very large image sizes is slow. (Proposed Solution: Progressive Learning)
  • Depthwise Convolutions are slow in early layers but effective in later layers, since the depthwise convolutions often cannot fully utilize modern accelerators. (Proposed Solution: Replacing MBConv layers by Fused-MBConv via NAS)
  • Equally scaling up every stage is sub-optimal. (Proposed Solution: New Scaling Rule and Restriction)

The Fused-MBConv (better utilize mobile or server accelerators) is replacing the original expension layer and depthwise convolution layer of MBConv by a single 3x3 convolution, the paper evaluated that by using Fused-MBConv in early stage(1-3) helps accelerate the training step with a small overhead on parameters and FLOPs. NAS is used to automatically search for the best combination.

Training-Aware NAS and Scaling

The Training-Aware NAS is based on Platform-Aware NAS[2], which its search space is also a stage-based factorized space, with the following search options:

Convolution Ops : {MBConv, Fused-MBConv}

Kernel Size: {3x3, 5x5}

Expension Ratio: {1,4,6}

However, in Training-Aware NAS, the paper point out that, they removed unnecessary search options like skip ops, and resued the same channel sizes from the backbone as they aree already searched. In addition, the search reward in Training-Aware NAS conbines model accuracy \(A\), the normalized traning step time \(S\), and the parameter size \(P\), by using a simple weighted product \(A \cdot S^w \cdot P^v\), where \(w=-.0.07\) and \(v=-0.05\), empirically determined to balance the trade-offs similar to [2].

As for the scaling part, EfficientNetV2-S is scaled up to EfficientNetV2-M/L using compound scaling with optimizations: (1) Restrict maximum inference image size to 480. (2) Gradually add more layers to later stages to increase the network capacity without adding much runtime overhead.

Progressive Learning

The main idea of progressive learning is to increase image size and regularzation magnitude at the same time during the training stage. The paper agure that the loss of accuracy of only progressively enlarge input image size during training drops due to unbalanced regularization.

References

[0] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR

[1] Tan, M., & Le, Q. (2021, July). Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning (pp. 10096-10106). PMLR.

[2] Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820-2828).

[3] Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021, July). High-performance large-scale image recognition without normalization. In International Conference on Machine Learning (pp. 1059-1071). PMLR.