[toc]

卷积神经网络(CNN)已经普遍应用在计算机视觉领域,并且已经取得了不错的效果。图 1 为近几年来 CNN 在 ImageNet 竞赛的表现,可以看到为了追求分类准确度,模型深度越来越深,模型复杂度也越来越高,如深度残差网络(ResNet)其层数已经多达 152 层。

然而,在某些真实的应用场景如移动或者嵌入式设备,如此大而复杂的模型是难以被应用的。首先是模型过于庞大,面临着内存不足的问题,其次这些场景要求低延迟,或者说响应速度要快,想象一下自动驾驶汽车的行人检测系统如果速度很慢会发生什么可怕的事情。所以,研究小而高效的 CNN 模型在这些场景至关重要,至少目前是这样,尽管未来硬件也会越来越快。目前的研究总结来看分为两个方向:一是对训练好的复杂模型进行压缩得到小模型;二是直接设计小模型并进行训练。不管如何,其目标在保持模型性能(accuracy)的前提下降低模型大小(parameters size),同时提升模型速度(speed, low latency)。本文的主角MobileNet属于后者,其是 Google 最近提出的一种小巧而高效的 CNN 模型,其在 accuracy 和 latency 之间做了折中。

DW & PW 卷积

MobileNet_V1 引入DWPW卷积,减少了计算量和参数个数。

在这里插入图片描述

深度可分离卷积

MobileNet 的基本单元是深度级可分离卷积(depthwise separable convolution)

深度级可分离卷积其实是一种可分解卷积操作(factorized convolutions),其可以分解为两个更小的操作:depthwise convolutionpointwise convolution,如Figure 2所示。

Depthwise convolution和标准卷积不同,对于标准卷积其卷积核是用在所有的输入通道上(input channels),而 depthwise convolution 针对每个输入通道采用不同的卷积核,就是说一个卷积核对应一个输入通道,所以说 depthwise convolution 是 depth 级别的操作。而pointwise convolution其实就是普通的卷积,只不过其采用$1x1$的卷积核。

在这里插入图片描述

上图中更清晰地展示了两种操作,左边为DW卷积,右边为PW卷积。

对于depthwise separable convolution,其首先是采用depthwise convolution对不同输入通道分别进行卷积,然后采用pointwise convolution将上面的输出再进行结合,这样其实整体效果和一个标准卷积是差不多的,但是会大大减少计算量和模型参数量。

假设卷积核大小为$D_k\times D_k$,输入通道数为$M$,输出通道数为$N$,特征图的大小为$D_F \times D_F$。

计算量压缩:

1
2
3
4
$$
\frac{D_k\cdot D_k\cdot M\cdot D_F\cdot D_F + M\cdot N\cdot D_F\cdot D_F}{D_k\cdot D_k\cdot M\cdot N\cdot D_F\cdot D_F} = \frac{1}{N}+\frac{1}{D_K^2}
$$
可以得出采用$3\times 3$卷积一般可以减少8~9倍的计算量

参数压缩:

1
2
3
4
$$
\frac{D_K^2\times M + 1 \times M\times N}{D_K^2\times M\times N} = \frac{1}{N}+\frac{1}{D_k^2}
$$
若 $D_k$ = 3,参数量大约会减少到原来的 1/8 ~ 1/9

Conv 卷积结构

论文中全部采用 Conv 实现,没有采用池化层,减少了一定的计算量
下面是 MobileNet 的卷积结构:

在这里插入图片描述

Conv/s2,即步长为 2 的卷积代替Maxpooling + Conv,使得参数数量不变,计算量变为原来的 1/4 左右,且省去了 MaxPool 的计算量

MobileNet 模型精简

前面说的MobileNet的基准模型,但是有时候你需要更小的模型,那么就要对MobileNet瘦身了。
这里引入了两个超参数:width multiplierresolution multiplier

  • Width Multiplier($\alpha$): Thinner Models

    第一个参数width multiplier主要是按比例减少通道数,该参数记为 $\alpha$ ,其取值范围为$(0,1]$,那么输入与输出通道数将变成 $\alpha M$ 和 $\alpha N$ ,对于depthwise separable convolution,其计算量变为:

    $$
    D_{K}\times D_{K}\times \alpha M\times D_{F}\times D_{F}+ \alpha M\times \alpha N\times D_{F}\times D_{F}
    $$

    • 所有层的 通道数(channel)乘以$\alpha$(四舍五入),模型大小近似下降到原来的$\alpha^2$倍,计算量下降到原来的$\alpha^2$倍
    • $\alpha \in (0,1]$,典型值为 1, 0.75, 0.5, 0.25,降低模型的宽度
  • Resolution Multiplier($\rho$): Reduced Representation

    因为主要计算量在后一项,所以width multiplier可以按照比例降低计算量,其是参数量也会下降。第二个参数resolution multiplier主要是按比例降低特征图的大小,记为 $\rho$ ,比如原来输入特征图是$224\times 224$,可以减少为$192\times 192$,加上resolution multiplierdepthwise separable convolution的计算量为:

    $$
    D_{K}\times D_{K}\times \alpha M\times \rho D_{F}\times \rho D_{F}+ \alpha M\times \alpha N\times \rho D_{F}\times \rho D_{F}
    $$

    • 输入层的分辨率乘以$\rho$参数(四舍五入),等价于所有层的分辨率乘以$\rho$,模型大小不变,计算朗下降到原来的$\rho^2$倍

    -$\rho \in (0,1]$,降低输入图像的分辨率

要说明的是,resolution multiplier仅仅影响计算量,但是不改变参数量。引入两个参数会给肯定会降低MobileNet的性能,具体实验分析可以见 paper,总结来看是在accuracycomputation,以及accuracymodel size之间做折中。

MobileNet_V1 网络结构

在这里插入图片描述

MobileNetV2:Inverted Residuals and Linear Bottlenecks

主要改进点

  • 引入倒残差结构,先升维再降维,增强梯度的传播,显著减少推理期间所需的内存占用(Inverted Residuals)
  • 去掉 Narrow layer(low dimension or depth) 后的 ReLU,保留特征多样性,增强网络的表达能力(Linear Bottlenecks)
  • 网络为全卷积,使得模型可以适应不同尺寸的图像;使用 RELU6(最高输出为 6)激活函数,使得模型在低精度计算下具有更强的鲁棒性
  • MobileNetV2 Inverted residual block 如下所示,若需要下采样,可在 DW 时采用步长为 2 的卷积
  • 小网络使用小的扩张系数(expansion factor),大网络使用大一点的扩张系数(expansion factor),推荐是 5~10,论文中 t = 6 t = 6t=6

倒残差结构(Inverted residual block)

ResNet的 Bottleneck 结构是降维->卷积->升维,是两边细中间粗

MobileNetV2是先升维(6 倍)-> 卷积 -> 降维,是沙漏形。

在这里插入图片描述

区别于 MobileNetV1, MobileNetV2 的卷积结构如下:

在这里插入图片描述

因为 DW 卷积不改变通道数,所以如果上一层的通道数很低时,DW 只能在低维空间提取特征,效果不好。所以 V2 版本在 DW 前面加了一层 PW 用来升维。

同时 V2 去除了第二个 PW 的激活函数改用线性激活,因为激活函数在高维空间能够有效地增加非线性,但在低维空间时会破坏特征。由于第二个 PW 主要的功能是降维,所以不宜再加 ReLU6。
在这里插入图片描述

strides=1且输入特征矩阵与输出特征矩阵shape相同时才有 shortcut 连接

在这里插入图片描述

MobileNet_V2 realized by tensorflow2

1
2
3
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers, Sequential, Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class ConvBNReLU(layers.Layer):
def __init__(self, out_channel, kernel_size=3, strides=1, **kwargs):
super(ConvBNReLU, self).__init__(**kwargs)
self.conv = layers.Conv2D(filters=out_channel,
kernel_size=kernel_size,
strides=strides,
padding='SAME',
use_bias=False,
name='Conv2d')
self.bn = layers.BatchNormalization(momentum=0.9, epsilon=1e-5, name='BatchNorm')
self.activation = layers.ReLU(max_value=6.0) # ReLU6

def call(self, inputs, training=False, **kargs):
x = self.conv(inputs)
x = self.bn(x, training=training)
x = self.activation(x)

return x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class InvertedResidualBlock(layers.Layer):
def __init__(self, in_channel, out_channel, strides, expand_ratio, **kwargs):
super(InvertedResidualBlock, self).__init__(**kwargs)
self.hidden_channel = in_channel * expand_ratio
self.use_shortcut = (strides == 1) and (in_channel == out_channel)

layer_list = []
# first bottleneck does not need 1*1 conv
if expand_ratio != 1:
# 1x1 pointwise conv
layer_list.append(ConvBNReLU(out_channel=self.hidden_channel, kernel_size=1, name='expand'))
layer_list.extend([

# 3x3 depthwise conv
layers.DepthwiseConv2D(kernel_size=3, padding='SAME', strides=strides, use_bias=False, name='depthwise'),
layers.BatchNormalization(momentum=0.9, epsilon=1e-5, name='depthwise/BatchNorm'),
layers.ReLU(max_value=6.0),

#1x1 pointwise conv(linear)
# linear activation y = x -> no activation function
layers.Conv2D(filters=out_channel, kernel_size=1, strides=1, padding='SAME', use_bias=False, name='project'),
layers.BatchNormalization(momentum=0.9, epsilon=1e-5, name='project/BatchNorm')
])

self.main_branch = Sequential(layer_list, name='expanded_conv')

def call(self, inputs, **kargs):
if self.use_shortcut:
return inputs + self.main_branch(inputs)
else:
return self.main_branch(inputs)

MobileNet_V2 网络结构

在这里插入图片描述

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def _make_divisible(ch, divisor=8, min_ch=None):
"""
This function is taken from the original tf repo.
It ensures that all layers have a channel number that is divisible by 8
It can be seen here:
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
"""
if min_ch is None:
min_ch = divisor
new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_ch < 0.9 * ch:
new_ch += divisor
return new_ch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def MobileNet_V2(im_height=224,
im_width=224,
num_classes=1000,
alpha=1.0,
round_nearest=8,
include_top=True):

block = InvertedResidualBlock
input_channel = _make_divisible(32*alpha, round_nearest)
last_channel = _make_divisible(1280*alpha, round_nearest)
# t, c, n, s
inverted_residual_setting = [
[1, 16, 1, 1],
[6, 24, 2, 2],
[6, 32, 3, 2],
[6, 64, 4, 2],
[6, 96, 3, 1],
[6, 160, 3, 2],
[6, 320, 1, 1]
]

input_image = layers.Input(shape=(im_height, im_width, 3), dtype=tf.float32)
# conv1
x = ConvBNReLU(input_channel, strides=2, name='Conv')(input_image)
# building inverted residual blocks
for idx, (t, c, n, s) in enumerate(inverted_residual_setting):
output_channel = _make_divisible(c*alpha, round_nearest)
for i in range(n):
strides = s if i == 0 else 1
x = block(x.shape[-1], # n h w c
output_channel,
strides,
expand_ratio=t)(x)

# building last several layers
x = ConvBNReLU(last_channel, kernel_size=1, name='Conv_1')(x)

if include_top is True:
# building classifier
x = layers.GlobalAveragePooling2D()(x) # pool + flatten
x = layers.Dropout(0.2)(x)
output = layers.Dense(num_classes, name='Logits')(x)
else:
output = x

model = Model(inputs=input_image, outputs=output)

return model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def main():
#
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print(gpu)
except RuntimeError as e:
print(e)
exit(-1)

model = MobileNet_V2(im_height=224,
im_width=224,
num_classes=1000,
alpha=0.75,
round_nearest=8,
include_top=True)
model.build((None, 224, 224, 3))
model.summary()

data = tf.random.normal([4, 224, 224, 3])
pred = model.predict(data)
print("input shape:\n", data.shape)
print("output shape:\n", pred.shape)
1
2
if __name__ == '__main__':
main()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
Conv (ConvBNReLU) (None, 112, 112, 24) 744
_________________________________________________________________
inverted_residual_block_52 ( (None, 112, 112, 16) 760
_________________________________________________________________
inverted_residual_block_53 ( (None, 56, 56, 24) 5568
_________________________________________________________________
inverted_residual_block_54 ( (None, 56, 56, 24) 9456
_________________________________________________________________
inverted_residual_block_55 ( (None, 28, 28, 24) 9456
_________________________________________________________________
inverted_residual_block_56 ( (None, 28, 28, 24) 9456
_________________________________________________________________
inverted_residual_block_57 ( (None, 28, 28, 24) 9456
_________________________________________________________________
inverted_residual_block_58 ( (None, 14, 14, 48) 13008
_________________________________________________________________
inverted_residual_block_59 ( (None, 14, 14, 48) 32736
_________________________________________________________________
inverted_residual_block_60 ( (None, 14, 14, 48) 32736
_________________________________________________________________
inverted_residual_block_61 ( (None, 14, 14, 48) 32736
_________________________________________________________________
inverted_residual_block_62 ( (None, 14, 14, 72) 39744
_________________________________________________________________
inverted_residual_block_63 ( (None, 14, 14, 72) 69840
_________________________________________________________________
inverted_residual_block_64 ( (None, 14, 14, 72) 69840
_________________________________________________________________
inverted_residual_block_65 ( (None, 7, 7, 120) 90768
_________________________________________________________________
inverted_residual_block_66 ( (None, 7, 7, 120) 185520
_________________________________________________________________
inverted_residual_block_67 ( (None, 7, 7, 120) 185520
_________________________________________________________________
inverted_residual_block_68 ( (None, 7, 7, 240) 272400
_________________________________________________________________
Conv_1 (ConvBNReLU) (None, 7, 7, 960) 234240
_________________________________________________________________
global_average_pooling2d_3 ( (None, 960) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 960) 0
_________________________________________________________________
Logits (Dense) (None, 1000) 961000
=================================================================
Total params: 2,264,984
Trainable params: 2,238,984
Non-trainable params: 26,000
_________________________________________________________________
input shape:
(4, 224, 224, 3)
output shape:
(4, 1000)