A visual deep-dive into the building blocks of MobileNetV3 -

Reading Time: 6 minutes

Table of Contents

Introduction

The purpose of this post is to isolate and understand the main layers constituting the MobileNetV3 (MNV3) architecture, released by Google in November 2019. The more I study Deep Learning models, the more I find that the say “a picture is worth a 1000 words” is a truly blessed piece of truth. Reading a description of a layer’s structure versus looking at its inner workings in a diagram is orders of magnitude less effective. Therefore, as I already did for the TabNet article, I invested in loaded and (hopefully) clear visuals to shed light on the MNV3 architecture. Let’s get started.

As the paper’s introduction puts it: “Efficient neural networks are becoming ubiquitous in mobile applications enabling entirely new on-device experiences. They are also a key enabler of personal privacy allowing a user to gain the benefits of neural networks without needing to send their data to the server to be evaluated. Advances in neural network efficiency not only improve user experience via higher accuracy and lower latency, but also help preserve battery life through reduced power consumption. This paper describes the approach we took to develop MobileNetV3 Large and Small models in order to deliver the next generation of high accuracy efficient neural network models to power on-device computer vision.”

Figure 1: (directly from the paper) Imagenet Top-1 accuracy (*y-axis*) VS #multiply-add operations (*x-axis*) VS model size as #params (*bubbles*).

First, I’d like to spend a couple of words on what this post is NOT. This write-up doesn’t have the ambition of explaining WHY MNV3 is more efficient compared to its siblings, nor HOW researchers leveraged Neural Architecture Search (NAS) to come up with its specific scheme. Those are obviously very interesting topics, which I might address in future articles.

To put things in context, it is important to keep in mind that most of what comes next is not a novelty introduced in the MNV3 paper. The swish nonlinearity, squeeze-and-excitation layers, depthwise and pointwise convolutions were already there. As the authors write: “For MobileNetV3, we use a combination of these layers as building blocks in order to build the most effective models.”

Researchers didn’t reinvent the wheel and instead focused primarily on tweaking and optimizing what already existed. Their major contributions consist of

the introduction of the h-swish nonlinearity (Figure 3)
a slightly novel NAS approach
re-designing the computationally-expensive layers at the beginning and the end of the network (Figure 2)
tuning the inner workings of specific layers. E.g. moving the Squeeze-And-Excite layer inside the Residual connection (Figure 6) as opposed to outside of it.

Figure 2: (directly from the paper) *“Comparison of original last stage and efficient last stage. This more efficient last stage is able to drop three expensive layers at the end of the network at no loss of accuracy.”*

The following sections illustrate the necessary components to build up the main block of MobileNetV3. We start from

defining the hard(h)-swish nonlinearity, a computationally more effective alternative to swish.
We then move to the Squeeze-And-Excite layer. You can check this wonderful post for an in-depth analysis of the operation.
We cover Depthwise and Pointwise Convolutions and their difference with the standard (more expensive) Conv2D. Take a look at this video for an analysis of the computational cost of those layers.
We eventually put everything together in the main building block of the MNV3 architecture, and show how stacking those blocks leads to the Large version of the model.

As already stated, I will mostly leave the diagrams speak for themselves and just add a couple of supporting words if needed.

The h-swish nonlinearity

From the paper again: “In […] a nonlinearity called swish was introduced that when used as a drop-in replacement for ReLU, that significantly improves the accuracy of neural networks. The nonlinearity is defined as $swish(x) = x \cdot \sigma(x)$ . While this nonlinearity improves accuracy, it comes with non-zero cost in embedded environments as the sigmoid function is much more expensive to compute on mobile devices. […] We replace sigmoid function with its piece-wise linear hard analog: $\frac{ReLu6(x+3)}{6}$ . The minor difference is we use ReLU6 rather than a custom clipping constant. Similarly, the hard version of swish becomes: $h-swish(x) = x \cdot \frac{ReLu6(x+3)}{6}$ . A similar version of hard-swish was also recently proposed in […]. Our choice of constants was motivated by simplicity and being a good match to the original smooth version. In our experiments, we found hard-version of all these functions to have no discernible difference in accuracy, but multiple advantages from a deployment perspective. First, optimized implementations of ReLU6 are available on virtually all software and hardware frameworks. Second, in quantized mode, it eliminates potential numerical precision loss caused by different implementations of the approximate sigmoid. Finally, in practice, h-swish can be implemented as a piece-wise function to reduce the number of memory accesses driving the latency cost down substantially.”

Squeeze-And-Excite (SE) layer

This post from Aman Arora is a very good starting point to deep-dive on the SE layer. In a nutshell, it is generally used to process the output of a Convolutional block (as the one below of shape $1 \times 16 \times 56 \times 56$ ) by explicitly modeling channel interdependencies. This is obtained by squeezing the Conv output to one number per filter (AvgPooling), passing the result through a series of Linear layers plus nonlinearities, and multiplying what comes out of it with the original input (excitation).

Figure 4: Visualized tensor’s shapes are not made up. They are actually coming from one of the layers of the MNV3-Small architecture, and they are hard-coded to make the slide more concrete and easy to grasp.

Depthwise + Pointwise Convolutions

Depthwise followed by Pointwise $1 \times 1$ Convolutions are a less expensive alternative to the standard Conv2D (they use fewer parameters and operations). The principle is illustrated in Figure 5, with this video walking through them in greater detail. This post, instead, provides a nice overview of the most commonly used Convolutional layers in Deep Learning.

Figure 5: Visualized tensor’s shapes are not made up. They are actually coming from one of the layers of the MNV3-Small architecture, and they are hard-coded to make the slide more concrete and easy to grasp.

MobileNetV3 main block

We are now able to better understand MNV3 main building block. As shown in Figure 6 with sample tensors, the idea is to

expand the input feature space with a $1 \times 1$ Convolution (from $40 \times 14 \times 14$ to $240 \times 14 \times 14$ below).
Pass the result through a Depthwise Conv with variable kernel size ( $3 \times 3$ or $5 \times 5$ ).
Optionally through a SE layer, alternatively through an Identity.
Then apply a Pointwise Conv to shrink the feature space back (from $240 \times 14 \times 14$ to $40 \times 14 \times 14$ below).
In case input and output tensors have equal shape, e.g. when the Expansion and Pointwise steps go back and forth to-from the same feature space, a ResNet-style skip connection is added to the mix.

Figure 6: Visualized tensor’s shapes are not made up. They are actually coming from one of the layers of the MNV3-Small architecture, and they are hard-coded to make the slide more concrete and easy to grasp.

Putting everything together

MobileNetV3 (Large version showed below in Figure 7) is obtained by stacking together:

a $3 \times 3$ Conv2D.
multiple “main blocks”, as explained in the previous section, using variably ReLU6/h-swish, $3 \times 3$ or $5 \times 5$ kernels in the Depthwise step and optionally SE layers.
a head-classifier constituted of $1 \times 1$ Convolutions, mapping to the desired final number of classes.