0%

ResNet(Deep Residual Learning for Image Recognition)

论文链接:https://arxiv.org/abs/1512.03385

如出现图像显示不完整,或者公式显示不完整,可访问如下博客

CSDN:http://blog.csdn.net/chunfengyanyulove/article/details/79253656

Resnet是2015年ImageNet比赛的冠军,不仅在分类上标线优秀,在目标检测中同样取得好成绩,Resnet将网络层数进一步加深,甚至达到1000+层。

1、Degradation

  • 根据经验,如果没有发生梯度消失、弥散现象,网络层数越深效果会越好,但是作者实验发现,虽然网络层数增加,但是训练会出现饱和现象,精度反而没有浅层网络精度高了,作者将这种现象称为Degradation问题,如下图所示。
    http://img.blog.csdn.net/20151216155525063

  • 如果仅仅在浅层网络后面增加几层,在不出现过拟合的情况下,效果应该会比浅层网络效果好,但是实验结果却不一定好,这也表明,不是所有的网络都容易优化到最好。

  • 为解决这个问题:本文引入一个深度残差学习框架。此框架不需要每一层能直接吻合一个映射,而是让这些层去吻合残差映射。比如:用H(X)来表示最优解映射,但本文去拟合另一个映射F(X) = H(X) - X , 此时原最优解映射H(X)就可以改写成F(X)+X。这里残差映射跟原映射相比更容易被优化。极端情况下,如果一个映射是可优化的,那也会很容易将残差推至0,把残差推至0和把此映射逼近另一个非线性层相比要容易的多。

  • 残差网络通过如下结构实现。

http://img.blog.csdn.net/20151216160852064

2、Deep Residual Learning

Identity Mapping by Shortcuts

如上图是一个building block,公式定义如下:

$$y=F(x,W_{i})+x$$

  • 在上图例子中,由于block中包含2层,所以

$$F=\omega_2\sigma(\omega_1x)$$

其中,σ表示RELU,这里省略了偏置项。

  • 当F()与x的维度不同时,可以通过线性映射进行调整,公式如下:

$$y=F(x,W_{i}) + \omega_s x$$

Network Architectures

Plain Network主要是受 VGG 网络启发,主要采用3*3滤波器,但是本网络与VGG相比,滤波器要少,复杂度要小,网络特征如下:

  • 对于相同输出特征图尺寸,卷积层有相同个数的滤波器。
  • 如果特征图尺寸缩小一半,滤波器个数加倍以保持每个层的计算复杂度。通过步长为2的卷积来进行降采样。一共34个权重层。

ResNet在Plain Network的基础上,我们插入了快捷连接,将网络转化为其对应的残差版本。

对于shortcut的方式,作者提出了三个选项:

  1. 使用恒等映射,如果residual block的输入输出维度不一致,对增加的维度用0来填充;
  2. 在block输入输出维度一致时使用恒等映射,不一致时使用线性投影以保证维度一致;
  3. 对于所有的block均使用线性投影。

http://img.blog.csdn.net/20151216164510071

Implementation

  • 图片resize:短边长random.randint(256,480),裁剪到224*224,随机采样,含水平翻转,减均值。
  • conv和activation间加batch normalization
  • minibatch-size:256
  • learning-rate: 初始0.1, error平了lr就除以10
  • weight decay:0.0001
  • momentum:0.9
  • 没用dropout

实验

http://img.blog.csdn.net/20151217083446928

作者通过实验证明:

1、34层与18层网络比较:

  • 训练过程中, 34层plain net 比18层plain net的error大 。
  • 34层residual net 比18层residual net的error小,更比34层plain net小了3.5%
  • 18层residual net比18层plain net收敛快

2、Residual function的设置比较:

A、在H(x)与x维度不同时, 用0充填补足
B、在H(x)与x维度不同时, 带WT
C、任何shortcut都带WT
loss效果: A>B>C

3、将两层网络变为三层网络:

三层分别是1×1、3×3,和1×1的卷积层,其中1×1层负责先减少后增加(恢复)尺寸的,使3×3层具有较小的输入/输出尺寸瓶颈,如下图所示:
这里写图片描述
识别精度如下图所示:
http://img.blog.csdn.net/20151217083726957
http://img.blog.csdn.net/20151217083734116

进一步的,作者在CIFAR-10数据集进行分析。同时,作者又搭建了更加变态的1202层的网络,对于这么深的网络,优化依然并不困难,但是出现了过拟合的问题,这是很正常的,作者也说了以后会对这个1202层的模型进行进一步的改进。

分析结果如下图:
https://upload-images.jianshu.io/upload_images/145616-8bb29032c4783264.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700

附录

resnet caffe 实现:https://github.com/KaimingHe/deep-residual-networks


补充resnet代码实现MxNet版

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

def residual_unit(data, num_filter, stride, dim_match, name, bottle_neck=True, bn_mom=0.9, workspace=256, memonger=False):
"""Return ResNet Unit symbol for building ResNet
Parameters
----------
data : str
Input data
num_filter : int
Number of output channels
bnf : int
Bottle neck channels factor with regard to num_filter
stride : tuple
Stride used in convolution
dim_match : Boolean
True means channel number between input and output is the same, otherwise means differ
name : str
Base name of the operators
workspace : int
Workspace used in convolution operator
"""

### resnet50之后,采用bottle_neck结构

if bottle_neck:
# the same as https://github.com/facebook/fb.resnet.torch#notes, a bit difference with origin paper
bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn1')
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
no_bias=True, workspace=workspace, name=name + '_conv1')
bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn2')
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2')
bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn3')
act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + '_relu3')
conv3 = mx.sym.Convolution(data=act3, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
workspace=workspace, name=name + '_conv3')
if dim_match:
shortcut = data
else:
shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_sc')
if memonger:
shortcut._set_attr(mirror_stage='True')
return conv3 + shortcut
else:
bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, momentum=bn_mom, eps=2e-5, name=name + '_bn1')
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
conv1 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv1')
bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, momentum=bn_mom, eps=2e-5, name=name + '_bn2')
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
conv2 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2')
if dim_match:
shortcut = data
else:
shortcut = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_sc')
if memonger:
shortcut._set_attr(mirror_stage='True')
return conv2 + shortcut

def resnet(units, num_stages, filter_list, num_classes, image_shape, bottle_neck=True, bn_mom=0.9, workspace=256, dtype='float32', memonger=False):
"""Return ResNet symbol of
Parameters
----------
units : list
Number of units in each stage
num_stages : int
Number of stage
filter_list : list
Channel size of each stage
num_classes : int
Ouput size of symbol
dataset : str
Dataset type, only cifar10 and imagenet supports
workspace : int
Workspace used in convolution operator
dtype : str
Precision (float32 or float16)
"""
num_unit = len(units)
assert(num_unit == num_stages)
data = mx.sym.Variable(name='data')
if dtype == 'float32':
data = mx.sym.identity(data=data, name='id')
else:
if dtype == 'float16':
data = mx.sym.Cast(data=data, dtype=np.float16)
data = mx.sym.BatchNorm(data=data, fix_gamma=True, eps=2e-5, momentum=bn_mom, name='bn_data')
(nchannel, height, width) = image_shape
if height <= 32: # such as cifar10
body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(3, 3), stride=(1,1), pad=(1, 1),
no_bias=True, name="conv0", workspace=workspace)
else: # often expected to be 224 such as imagenet
body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3),
no_bias=True, name="conv0", workspace=workspace)
body = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=2e-5, momentum=bn_mom, name='bn0')
body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
body = mx.sym.Pooling(data=body, kernel=(3, 3), stride=(2,2), pad=(1,1), pool_type='max')


###此处,filter_list参数,由于第一个block中第一个是特征图大小是56*56,不需要改变特征图大小,所以这里用 *1 if i==0 else 2*
for i in range(num_stages):
body = residual_unit(body, filter_list[i+1], (1 if i==0 else 2, 1 if i==0 else 2), False,
name='stage%d_unit%d' % (i + 1, 1), bottle_neck=bottle_neck, workspace=workspace,
memonger=memonger)
for j in range(units[i]-1):
body = residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2),
bottle_neck=bottle_neck, workspace=workspace, memonger=memonger)
bn1 = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=2e-5, momentum=bn_mom, name='bn1')
relu1 = mx.sym.Activation(data=bn1, act_type='relu', name='relu1')
# Although kernel is not used here when global_pool=True, we should put one
pool1 = mx.sym.Pooling(data=relu1, global_pool=True, kernel=(7, 7), pool_type='avg', name='pool1')
flat = mx.sym.Flatten(data=pool1)
fc1 = mx.sym.FullyConnected(data=flat, num_hidden=num_classes, name='fc1')
if dtype == 'float16':
fc1 = mx.sym.Cast(data=fc1, dtype=np.float32)
return mx.sym.SoftmaxOutput(data=fc1, name='softmax')


def get_symbol(num_classes, num_layers, image_shape, conv_workspace=256, dtype='float32', **kwargs):
"""
Adapted from https://github.com/tornadomeet/ResNet/blob/master/train_resnet.py
Original author Wei Wu
"""
image_shape = [int(l) for l in image_shape.split(',')]
(nchannel, height, width) = image_shape
if height <= 28:
num_stages = 3
if (num_layers-2) % 9 == 0 and num_layers >= 164:
per_unit = [(num_layers-2)//9]
filter_list = [16, 64, 128, 256]
bottle_neck = True
elif (num_layers-2) % 6 == 0 and num_layers < 164:
per_unit = [(num_layers-2)//6]
filter_list = [16, 16, 32, 64]
bottle_neck = False
else:
raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
units = per_unit * num_stages
else:
if num_layers >= 50:
filter_list = [64, 256, 512, 1024, 2048]
bottle_neck = True
else:
filter_list = [64, 64, 128, 256, 512]
bottle_neck = False
num_stages = 4
if num_layers == 18:
units = [2, 2, 2, 2]
elif num_layers == 34:
units = [3, 4, 6, 3]
elif num_layers == 50:
units = [3, 4, 6, 3]
elif num_layers == 101:
units = [3, 4, 23, 3]
elif num_layers == 152:
units = [3, 8, 36, 3]
elif num_layers == 200:
units = [3, 24, 36, 3]
elif num_layers == 269:
units = [3, 30, 48, 8]
else:
raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))

return resnet(units = units,
num_stages = num_stages,
filter_list = filter_list,
num_classes = num_classes,
image_shape = image_shape,
bottle_neck = bottle_neck,
workspace = conv_workspace,
dtype = dtype)