語義分割是深度學(xué)習(xí)中的一個(gè)重要應(yīng)用領(lǐng)域。自Unet提出到現(xiàn)在已經(jīng)過去了8年,期間有很多創(chuàng)新式的語義分割模型。簡(jiǎn)單的總結(jié)了Unet++、Unet3+、HRNet、LinkNet、PSPNet、DeepLabv3、多尺度attention、HarDNet、SegFormer、SegNeXt等10個(gè)語義分割模型的基本特性。并對(duì)這些模型的創(chuàng)新點(diǎn)進(jìn)行分類匯總。
1、拓?fù)浣Y(jié)構(gòu)改進(jìn)
1.1 UNet++
相比于unet,增加了內(nèi)部的跳躍連接,使模型具備了更多的Unet集合網(wǎng)絡(luò),并提出了深度監(jiān)督在unet++上的使用(在新增的不做下采樣的x0級(jí)別的內(nèi)部跳躍連接添加conv1x1,并連接的輸出中)
同時(shí)提出了,該結(jié)構(gòu)的unet可以對(duì)模型剪枝進(jìn)行最大適配。
1.2 Unet3+
UNet 3+,通過引入全尺度跳躍連接來充分利用多尺度特征,該連接將低層次細(xì)節(jié)與全尺度特征圖中的高層次語義相結(jié)合,相比UNet++參數(shù)較少; (ii) 使用深度監(jiān)督以從全尺寸聚合特征圖中學(xué)習(xí)層次表示,從而優(yōu)化混合損失函數(shù)以增強(qiáng)器官邊界; (iii) 提出一個(gè)分類引導(dǎo)模塊,通過與圖像級(jí)分類聯(lián)合訓(xùn)練來減少非器官圖像<噪聲數(shù)據(jù)>的過度分割。
給出了 UNet、UNet++ 和新提出 UNet 3+ 的簡(jiǎn)化概述。與 UNet 和 UNet++ 相比,UNet 3+ 通過重新設(shè)計(jì)跳躍連接(刪除了unet++中的短連接)以及利用全尺度深度監(jiān)督結(jié)合了多尺度特征,提供更少的參數(shù)但產(chǎn)生更準(zhǔn)確的位置感知和邊界增強(qiáng)分割圖。
內(nèi)容參考:https://hpg123.blog.csdn.net/article/details/125950195
分類引導(dǎo)模塊
因?yàn)橛跋裰写嬖诒尘霸肼?,可能存在干擾[對(duì)于沒有器官的圖像,模型可能會(huì)把背景識(shí)別為器官
]。故對(duì)每一層次的特征圖,都使用Classification-guided Module進(jìn)行分類引導(dǎo)訓(xùn)練。
經(jīng)過 dropout、conv、maxpooling 和 sigmoid 等一系列操作后,從最深層產(chǎn)生一個(gè)二維張量,每個(gè)張量代表 有/沒有 器官的概率。受益于最豐富的語義信息,分類結(jié)果可以進(jìn)一步指導(dǎo)每個(gè)分割側(cè)輸出分兩步。首先,在 argmax 函數(shù)的幫助下,二維張量被轉(zhuǎn)換為 {0,1} 的單個(gè)輸出,表示 有/沒有 器官。隨后,我們將單個(gè)分類輸出與側(cè)分割輸出相乘。由于二元分類任務(wù)的簡(jiǎn)單性,該模塊在二元交叉熵?fù)p失函數(shù)的優(yōu)化下毫不費(fèi)力地達(dá)到了準(zhǔn)確的分類結(jié)果,實(shí)現(xiàn)了對(duì)彌補(bǔ)無器官圖像過分割缺陷的指導(dǎo)。
原文鏈接:https://blog.csdn.net/a486259/article/details/125950195
1.3 HRNet
HRNet通過并行多個(gè)分辨率的分支,加上不斷進(jìn)行不同分支之間的信息交互,同時(shí)達(dá)到強(qiáng)語義信息和精準(zhǔn)位置信息的目的。在需要精準(zhǔn)分割定位的任務(wù)中可以取得良好效果
HRNet對(duì)尺度間信息通道的看法如下,相比于Nvidia在2020年提出的多尺度attenion 語義分割模型【https://hpg123.blog.csdn.net/article/details/126385231】,在各個(gè)支路的forword過程中,持續(xù)保證了信息在不同尺度間的流通。
HRNet經(jīng)歷了三個(gè)版本的發(fā)展,HRNetV1只有頂層支路有輸出,HRNetV2的輸出concat了4個(gè)支路,HRNetV2p為了使用目標(biāo)檢測(cè)實(shí)現(xiàn)了特征金字塔,將頂層輸出由降采樣為4個(gè)尺度。
Decode相關(guān)代碼
hardnet作者提出模型時(shí)僅用于圖像分類訓(xùn)練,paddleseg在使用hardnet做語義分割時(shí),主要是對(duì)decode進(jìn)行了設(shè)計(jì),可以看到其實(shí)現(xiàn)的解碼器,僅是使用HarDBlock實(shí)現(xiàn)類似于unet的對(duì)稱網(wǎng)絡(luò)。
class Decoder(nn.Layer):
"""The Decoder implementation of FC-HardDNet 70.
Args:
n_blocks (int): The number of blocks in the Encoder module.
in_channels (int): The number of input channels.
skip_connection_channels (tuple|list): The channels of shortcut layers in encoder.
grmul (float): The channel multiplying factor in HarDBlock, which is m in the paper.
gr (tuple|list): The growth rate in each HarDBlock, which is k in the paper.
n_layers (tuple|list): The number of layers in each HarDBlock.
"""
def __init__(self,
n_blocks,
in_channels,
skip_connection_channels,
gr,
grmul,
n_layers,
align_corners=False):
super().__init__()
prev_block_channels = in_channels
self.n_blocks = n_blocks
self.dense_blocks_up = nn.LayerList()
self.conv1x1_up = nn.LayerList()
for i in range(n_blocks - 1, -1, -1):
cur_channels_count = prev_block_channels + skip_connection_channels[
i]
conv1x1 = layers.ConvBNReLU(
cur_channels_count,
cur_channels_count // 2,
kernel_size=1,
bias_attr=False)
blk = HarDBlock(
base_channels=cur_channels_count // 2,
growth_rate=gr[i],
grmul=grmul,
n_layers=n_layers[i])
self.conv1x1_up.append(conv1x1)
self.dense_blocks_up.append(blk)
prev_block_channels = blk.get_out_ch()
self.out_channels = prev_block_channels
self.align_corners = align_corners
def forward(self, x, skip_connections):
for i in range(self.n_blocks):
skip = skip_connections.pop()
x = F.interpolate(
x,
size=paddle.shape(skip)[2:],
mode="bilinear",
align_corners=self.align_corners)
x = paddle.concat([x, skip], axis=1)
x = self.conv1x1_up[i](x)
x = self.dense_blocks_up[i](x)
return x
def get_out_channels(self):
return self.out_channels
網(wǎng)絡(luò)構(gòu)建代碼
通過仔細(xì)觀察HarDNet中forword流程,可以發(fā)現(xiàn)第一步是stem的預(yù)處理(包含了一次下采樣),關(guān)于編碼器與解碼器的特征連接方式,可以看到是樸素的unet結(jié)構(gòu),若拓?fù)浣Y(jié)構(gòu)跟換為unet3+的的形式,HarDNet獲取會(huì)迎來下一次的飛躍。
class HarDNet(nn.Layer):
"""
[Real Time] The FC-HardDNet 70 implementation based on PaddlePaddle.
The original article refers to
Chao, Ping, et al. "HarDNet: A Low Memory Traffic Network"
(https://arxiv.org/pdf/1909.00948.pdf)
Args:
num_classes (int): The unique number of target classes.
stem_channels (tuple|list, optional): The number of channels before the encoder. Default: (16, 24, 32, 48).
ch_list (tuple|list, optional): The number of channels at each block in the encoder. Default: (64, 96, 160, 224, 320).
grmul (float, optional): The channel multiplying factor in HarDBlock, which is m in the paper. Default: 1.7.
gr (tuple|list, optional): The growth rate in each HarDBlock, which is k in the paper. Default: (10, 16, 18, 24, 32).
n_layers (tuple|list, optional): The number of layers in each HarDBlock. Default: (4, 4, 8, 8, 8).
align_corners (bool): An argument of F.interpolate. It should be set to False when the output size of feature
is even, e.g. 1024x512, otherwise it is True, e.g. 769x769. Default: False.
pretrained (str, optional): The path or url of pretrained model. Default: None.
"""
def __init__(self,
num_classes,
stem_channels=(16, 24, 32, 48),
ch_list=(64, 96, 160, 224, 320),
grmul=1.7,
gr=(10, 16, 18, 24, 32),
n_layers=(4, 4, 8, 8, 8),
align_corners=False,
pretrained=None):
super().__init__()
self.align_corners = align_corners
self.pretrained = pretrained
encoder_blks_num = len(n_layers)
decoder_blks_num = encoder_blks_num - 1
encoder_in_channels = stem_channels[3]
self.stem = nn.Sequential(
layers.ConvBNReLU(
3, stem_channels[0], kernel_size=3, bias_attr=False),
layers.ConvBNReLU(
stem_channels[0],
stem_channels[1],
kernel_size=3,
bias_attr=False),
layers.ConvBNReLU(
stem_channels[1],
stem_channels[2],
kernel_size=3,
stride=2,
bias_attr=False),
layers.ConvBNReLU(
stem_channels[2],
stem_channels[3],
kernel_size=3,
bias_attr=False))
self.encoder = Encoder(encoder_blks_num, encoder_in_channels, ch_list,
gr, grmul, n_layers)
skip_connection_channels = self.encoder.get_skip_channels()
decoder_in_channels = self.encoder.get_out_channels()
self.decoder = Decoder(decoder_blks_num, decoder_in_channels,
skip_connection_channels, gr, grmul, n_layers,
align_corners)
self.cls_head = nn.Conv2D(
in_channels=self.decoder.get_out_channels(),
out_channels=num_classes,
kernel_size=1)
self.init_weight()
def forward(self, x):
input_shape = paddle.shape(x)[2:]
x = self.stem(x)
x, skip_connections = self.encoder(x)
x = self.decoder(x, skip_connections)
logit = self.cls_head(x)
logit = F.interpolate(
logit,
size=input_shape,
mode="bilinear",
align_corners=self.align_corners)
return [logit]
def init_weight(self):
if self.pretrained is not None:
utils.load_entire_model(self, self.pretrained)
1.4 LinkNet
網(wǎng)絡(luò)結(jié)構(gòu)如下所示,是一個(gè)簡(jiǎn)單的u型網(wǎng)絡(luò)。與unet相比就是只是將通道concat更改為add操作,該操作可以一定程度上減少解碼過程中的計(jì)算量和參數(shù)量。網(wǎng)絡(luò)的編碼器從一個(gè)stem開始,stem對(duì)輸入圖像進(jìn)行卷積,其核大小為7×7,步幅為2,該塊還在3×3的區(qū)域中執(zhí)行空間最大池化,步幅為2。論文解讀:https://hpg123.blog.csdn.net/article/details/125849870
特點(diǎn)
將unet中concat更改為add操作,節(jié)省了解碼時(shí)的計(jì)算量和參數(shù)量
單個(gè)block內(nèi)跳躍連接,充分利用了resnet的結(jié)構(gòu)
在第一個(gè)block前進(jìn)行4x的下采樣,減少了map的size,提升運(yùn)算速度。
2、特征上下文改進(jìn)
對(duì)于多尺度特征的融合方式,圖像金字塔、解碼器u網(wǎng)絡(luò)(使用編碼器特征進(jìn)行跳躍連接)、孔洞卷積、SPP(空間金字塔池化)
2.1 PSPNet
金字塔場(chǎng)景解析 在四個(gè)不同的金字塔尺度下融合特征。用紅色突出顯示的最粗級(jí)別是單個(gè)輸出的全局池化。下面的金字塔層將特征圖劃分為不同的子區(qū)域,并形成不同位置的集合表示。金字塔池模塊中不同級(jí)別的輸出包含不同大小的特征圖。為了保持全局特征的權(quán)重,如果金字塔的水平大小為N,在每個(gè)金字塔水平后使用1層×1卷積層,將上下文表示的維數(shù)降低為原始表示的1/N。然后,直接采樣低維特征地圖得到相同大小的特征作為原始特征地圖通過雙線性插值。最后,將不同層次的特征連接為最終的金字塔池全局特征。 內(nèi)容參考自:https://hpg123.blog.csdn.net/article/details/125810356
金字塔場(chǎng)景解析模塊實(shí)現(xiàn)代碼,模塊包含多個(gè)pool_scales[預(yù)設(shè)的pool_scales=(1, 2, 3, 6)],每個(gè)pool_scale的信息流程為AdaptiveAvgPool2d[自適應(yīng)池化,將數(shù)據(jù)池化到特定size]->ConvModule[conv1x1進(jìn)行通道間信息融合]->resize[進(jìn)行上采樣]
class PPM(nn.ModuleList):
"""Pooling Pyramid Module used in PSPNet.
Args:
pool_scales (tuple[int]): Pooling scales used in Pooling Pyramid
Module.
in_channels (int): Input channels.
channels (int): Channels after modules, before conv_seg.
conv_cfg (dict|None): Config of conv layers.
norm_cfg (dict|None): Config of norm layers.
act_cfg (dict): Config of activation layers.
align_corners (bool): align_corners argument of F.interpolate.
"""
def __init__(self, pool_scales, in_channels, channels, conv_cfg, norm_cfg,
act_cfg, align_corners, **kwargs):
super(PPM, self).__init__()
self.pool_scales = pool_scales
self.align_corners = align_corners
self.in_channels = in_channels
self.channels = channels
self.conv_cfg = conv_cfg
self.norm_cfg = norm_cfg
self.act_cfg = act_cfg
for pool_scale in pool_scales:
self.append(
nn.Sequential(
nn.AdaptiveAvgPool2d(pool_scale),
ConvModule(
self.in_channels,
self.channels,
1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg,
**kwargs)))
def forward(self, x):
"""Forward function."""
ppm_outs = []
for ppm in self:
ppm_out = ppm(x)
upsampled_ppm_out = resize(
ppm_out,
size=x.size()[2:],
mode='bilinear',
align_corners=self.align_corners)
ppm_outs.append(upsampled_ppm_out)
return ppm_outs
PSPHead實(shí)現(xiàn)代碼
@HEADS.register_module()
class PSPHead(BaseDecodeHead):
"""Pyramid Scene Parsing Network.
This head is the implementation of
`PSPNet <https://arxiv.org/abs/1612.01105>`_.
Args:
pool_scales (tuple[int]): Pooling scales used in Pooling Pyramid
Module. Default: (1, 2, 3, 6).
"""
def __init__(self, pool_scales=(1, 2, 3, 6), **kwargs):
super(PSPHead, self).__init__(**kwargs)
assert isinstance(pool_scales, (list, tuple))
self.pool_scales = pool_scales
self.psp_modules = PPM(
self.pool_scales,
self.in_channels,
self.channels,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg,
align_corners=self.align_corners)
self.bottleneck = ConvModule(
self.in_channels + len(pool_scales) * self.channels,
self.channels,
3,
padding=1,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
def _forward_feature(self, inputs):
"""Forward function for feature maps before classifying each pixel with
``self.cls_seg`` fc.
Args:
inputs (list[Tensor]): List of multi-level img features.
Returns:
feats (Tensor): A tensor of shape (batch_size, self.channels,
H, W) which is feature map for last layer of decoder head.
"""
x = self._transform_inputs(inputs)
psp_outs = [x]
psp_outs.extend(self.psp_modules(x))
psp_outs = torch.cat(psp_outs, dim=1)
feats = self.bottleneck(psp_outs)
return feats
def forward(self, inputs):
"""Forward function."""
output = self._forward_feature(inputs)
output = self.cls_seg(output)
return output
2.3 DeepLabv3
標(biāo)準(zhǔn)孔洞卷積
孔洞卷積在模型深處的應(yīng)用 block5,block6,block7是block4的副本
multi-grid方法:即對(duì)block5,block6,block7使用不同的孔洞率。
使用孔洞卷積做特征金字塔池化
內(nèi)容參考自:https://hpg123.blog.csdn.net/article/details/125853032
模型評(píng)估方法
一旦模型被訓(xùn)練好,我們就會(huì)在推理過程中應(yīng)用輸出步幅=8。如表6所示,采用輸出步幅=8比使用輸出步幅=16提高了1.3%,采用多尺度輸入和添加左右翻轉(zhuǎn)圖像,性能分別進(jìn)一步提高了0.94%和0.32%。ASPP的最佳模型達(dá)到79.77%的性能,優(yōu)于級(jí)聯(lián)孔洞卷積模塊的最佳模型(79.35%),因此選擇我們作為最終的模型進(jìn)行測(cè)試集評(píng)估。
此外還該文章實(shí)驗(yàn)了輸出步幅度對(duì)miou的影響(resize images評(píng)估iou)、batchsize的影響、crop size的影響等實(shí)驗(yàn)。
2.3 多尺度attention
多尺度推理是提高語義分割結(jié)果的常用方法。多個(gè)圖像尺度通過網(wǎng)絡(luò),然后將結(jié)果與平均或最大池化相結(jié)合。在這項(xiàng)工作中,NVIDIA提出了一種基于注意力的方法來結(jié)合多尺度預(yù)測(cè)。其所注意力機(jī)制是分層的,這使得它的訓(xùn)練效率比其他最近的方法高4倍。在訓(xùn)練過程中,給定的輸入圖像按因子r進(jìn)行縮放,其中r=0.5表示2因子的降采樣,r=2.0表示2倍的上采樣,r=1表示不操作。在訓(xùn)練中,選擇了r=0.5和r=1.0。然后將r=1和r=0.5的兩幅圖像通過共享網(wǎng)絡(luò)主干通道,產(chǎn)生語義邏輯值L和每個(gè)尺度的一個(gè)注意掩碼(α),用于融合兩個(gè)尺度之間的語義邏輯值L。在測(cè)試階段,將尺度設(shè)為了0.5、1.0、2.0。相比于單尺度訓(xùn)練,多尺度attention的方式節(jié)省了大量的訓(xùn)練成本。內(nèi)容參考自:https://hpg123.blog.csdn.net/article/details/126385231
不同尺度的目標(biāo),在進(jìn)行語義分割時(shí),可能需要不同size的輸入。其實(shí)質(zhì)在于,在模型的forword流程中,不同尺度輸入中,相同位置的conv的感受野不一樣。對(duì)于大目標(biāo),需要縮小尺寸來擴(kuò)大conv的相對(duì)全局感受野(固定深度的conv的感受野是不變的,但是相對(duì)于全局的的感受野比例是會(huì)變化的);而對(duì)于小目標(biāo),則需要放大圖像size,縮小conv的相對(duì)全局感受野,讓conv能聚焦到物體內(nèi)部的信息,忽略掉外部干擾。
實(shí)現(xiàn)過程中潛在的問題
該方法主要是通過將普通語義分割模型轉(zhuǎn)換為孿生網(wǎng)絡(luò)模型,并在原先的模型上添加spatial attention分支(預(yù)測(cè)當(dāng)前尺度下對(duì)應(yīng)位置結(jié)果融合的概率)
spatial attention分支所生成的attention map為b,1,w,h即可,作者并未指定spatial attention的實(shí)現(xiàn)方式,所以不一定需要安裝傳統(tǒng)attention中的QVK格式實(shí)現(xiàn)(傳統(tǒng)格式需要生成b,wxh,wxh的attn_map,會(huì)帶來巨大的顯存需求)?;蛟S,可以使用普通的卷積頭來實(shí)現(xiàn)這個(gè)attention map;也可以使用Criss-Cross Attention或Interlaced Sparse Self-Attention的方式實(shí)現(xiàn)attention。關(guān)于attention具體可以參考 https://hpg123.blog.csdn.net/article/details/126538242。
3、網(wǎng)絡(luò)結(jié)構(gòu)上改進(jìn)
3.1 HarDNet
屬于通用性的網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)改進(jìn),對(duì)resnet、densenet中的跳躍連接進(jìn)行研究,提出低MACs(內(nèi)存訪問成本)和低內(nèi)存流量的指標(biāo)需求;并定義了輸入/輸出(CIO),這是一個(gè)簡(jiǎn)單的關(guān)于每個(gè)卷積層中輸入tensor大小和輸出tensor總和,用于近似實(shí)際DRAM流量值。使用大量的大型卷積內(nèi)核可以很容易地實(shí)現(xiàn)最小化的CIO。然而,它也降低了計(jì)算效率,并最終導(dǎo)致超過增益的顯著延遲開銷。因此,HarDNet認(rèn)為保持較高的計(jì)算效率仍然是必要的,只有當(dāng)一個(gè)層,MACs超過CIO (MoC),且低于計(jì)算平臺(tái)的某個(gè)性能指標(biāo)下,CIO才能主導(dǎo)推理時(shí)間。
HarDNet是一種通用型的網(wǎng)絡(luò)拓?fù)涓倪M(jìn),可以用于各種模型。在paddleseg中的輕量化模型中,HarDNet取得了優(yōu)異的成績(jī)。在paddleseg中給出的精度 vs 速度性能表中,一度超越了SegFormerb1,更多的實(shí)驗(yàn)對(duì)比細(xì)節(jié)可以參考https://gitee.com/paddlepaddle/PaddleSeg/blob/release/2.5/docs/model_zoo_overview_cn.md
完整實(shí)現(xiàn)代碼在https://gitee.com/paddlepaddle/PaddleSeg/blob/release/2.5/paddleseg/models/hardnet.py
3.2 SegFormer
SegFormer將Transformers與輕量級(jí)多層感知器(MLP)解碼器統(tǒng)一起來。SegFormer有兩個(gè)吸引人的特點(diǎn):1)SegFormer包括一個(gè)新的層次結(jié)構(gòu)Transformers編碼器,輸出多尺度特征。它不需要位置編碼,從而避免了位置碼的插值,當(dāng)測(cè)試分辨率與訓(xùn)練不同時(shí),會(huì)導(dǎo)致性能下降。2)SegFormer避免了復(fù)雜的解碼器。所提出的MLP解碼器聚合了來自不同層的信息,從而結(jié)合了局部注意和全局注意,呈現(xiàn)出強(qiáng)大的表示。SegFormer論文翻譯可以查看:https://hpg123.blog.csdn.net/article/details/126040514
SegFormer使用Overlapped Patch Merging實(shí)現(xiàn)圖像分辨率的下采樣,其具體實(shí)現(xiàn)代碼如下。其實(shí)質(zhì)就是在進(jìn)行patch編碼時(shí)采用stride控制下采樣的效果。同時(shí)從網(wǎng)絡(luò)結(jié)構(gòu)圖中可以看到,SegFormer中的第一個(gè)OverlapPatchEmbed是進(jìn)行的1/4下采樣。
class OverlapPatchEmbed(nn.Layer):
""" Image to Patch Embedding
"""
def __init__(self,
img_size=224,
patch_size=7,
stride=4,
in_chans=3,
embed_dim=768):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
self.img_size = img_size
self.patch_size = patch_size
self.H, self.W = img_size[0] // patch_size[0], img_size[
1] // patch_size[1]
self.num_patches = self.H * self.W
self.proj = nn.Conv2D(
in_chans,
embed_dim,
kernel_size=patch_size,
stride=stride,
padding=(patch_size[0] // 2, patch_size[1] // 2))
self.norm = nn.LayerNorm(embed_dim)
self.apply(self._init_weights)
def forward(self, x):
x = self.proj(x)
x_shape = paddle.shape(x)
H, W = x_shape[2], x_shape[3]
x = x.flatten(2).transpose([0, 2, 1])
x = self.norm(x)
return x, H, W
SegFormer中使用的Efficient Self-Attention與PVT中的spatial-reduction attention是一模一樣的[具體參考https://hpg123.blog.csdn.net/article/details/126538242
],其通過將K在WH上的的信息移動(dòng)到channel維度上(通過除以R來實(shí)現(xiàn)),減少了K*V進(jìn)行矩陣乘法時(shí)的空間復(fù)雜度。第一階段到第四階段將R設(shè)置為[64,16,4,1]
mmseg中對(duì)于SegFormer中Efficient Self-Attention的實(shí)現(xiàn)如下,可以看到是通過控制sr_conv的步長(zhǎng)來對(duì)數(shù)據(jù)進(jìn)行壓縮,通過該操作可以大幅降低輸入KV中數(shù)據(jù)的大小。
self.sr_ratio = sr_ratio
if sr_ratio > 1:
self.sr = Conv2d(
in_channels=embed_dims,
out_channels=embed_dims,
kernel_size=sr_ratio,
stride=sr_ratio)
#將W*H的數(shù)據(jù)通過卷積變?yōu)?W/sr_ratio)*(H/sr_ratio)
# The ret[0] of build_norm_layer is norm name.
self.norm = build_norm_layer(norm_cfg, embed_dims)[1]
# handle the BC-breaking from https://github.com/open-mmlab/mmcv/pull/1418 # noqa
from mmseg import digit_version, mmcv_version
if mmcv_version < digit_version('1.3.17'):
warnings.warn('The legacy version of forward function in'
'EfficientMultiheadAttention is deprecated in'
'mmcv>=1.3.17 and will no longer support in the'
'future. Please upgrade your mmcv.')
self.forward = self.legacy_forward
def forward(self, x, hw_shape, identity=None):
x_q = x
if self.sr_ratio > 1:
x_kv = nlc_to_nchw(x, hw_shape)
x_kv = self.sr(x_kv)
x_kv = nchw_to_nlc(x_kv)
x_kv = self.norm(x_kv)
else:
x_kv = x
if identity is None:
identity = x_q
# Because the dataflow('key', 'query', 'value') of
# ``torch.nn.MultiheadAttention`` is (num_query, batch,
# embed_dims), We should adjust the shape of dataflow from
# batch_first (batch, num_query, embed_dims) to num_query_first
# (num_query ,batch, embed_dims), and recover ``attn_output``
# from num_query_first to batch_first.
if self.batch_first:
x_q = x_q.transpose(0, 1)
x_kv = x_kv.transpose(0, 1)
out = self.attn(query=x_q, key=x_kv, value=x_kv)[0]
if self.batch_first:
out = out.transpose(0, 1)
return identity + self.dropout_layer(self.proj_drop(out))
3.3 SegNeXt
SegNeXt是一個(gè)簡(jiǎn)單的用于語義分割的卷積網(wǎng)絡(luò)架構(gòu),通過對(duì)傳統(tǒng)卷積結(jié)構(gòu)的改進(jìn),在一定的參數(shù)規(guī)模下超越了transformer模型的性能,同等參數(shù)規(guī)模下在 ADE20K, Cityscapes,COCO-Stuff, Pascal VOC, Pascal Context, 和 iSAID數(shù)據(jù)集上的miou比transformer模型高2個(gè)點(diǎn)以上。其優(yōu)越之處在對(duì)編碼器(backbone)的的改進(jìn),將transformer中模型的一些特殊結(jié)構(gòu)(將PatchEmbed引入傳統(tǒng)卷積、將MLP引入傳統(tǒng)卷積)引入了傳統(tǒng)卷積中,并提出了MSCAAttention結(jié)構(gòu),在語義分割中的空間attenion中占據(jù)一定優(yōu)勢(shì)。參考地址 https://blog.csdn.net/a486259/article/details/129402562
其模型實(shí)現(xiàn)代碼如下,對(duì)語義分割中特征提取未做過多研究,僅是堆疊OverlapPatchEmbed與MSCAN模塊所實(shí)現(xiàn)。在特征融合時(shí),也就是類似SegFormer一樣,級(jí)聯(lián)了3個(gè)尺度的特征圖。使用了輕量級(jí)的Hamburger來進(jìn)一步建模全局上下文。文章來源:http://www.zghlxwxcb.cn/news/detail-444338.html
@ BACKBONES.register_module()
class MSCAN(BaseModule):
def __init__(self,
in_chans=3,
embed_dims=[64, 128, 256, 512],
mlp_ratios=[4, 4, 4, 4],
drop_rate=0.,
drop_path_rate=0.,
depths=[3, 4, 6, 3],
num_stages=4,
norm_cfg=dict(type='SyncBN', requires_grad=True),
pretrained=None,
init_cfg=None):
super(MSCAN, self).__init__(init_cfg=init_cfg)
assert not (init_cfg and pretrained), \
'init_cfg and pretrained cannot be set at the same time'
if isinstance(pretrained, str):
warnings.warn('DeprecationWarning: pretrained is deprecated, '
'please use "init_cfg" instead')
self.init_cfg = dict(type='Pretrained', checkpoint=pretrained)
elif pretrained is not None:
raise TypeError('pretrained must be a str or None')
self.depths = depths
self.num_stages = num_stages
dpr = [x.item() for x in torch.linspace(0, drop_path_rate,
sum(depths))] # stochastic depth decay rule
cur = 0
for i in range(num_stages):
if i == 0:
patch_embed = StemConv(3, embed_dims[0], norm_cfg=norm_cfg)
else:
patch_embed = OverlapPatchEmbed(patch_size=7 if i == 0 else 3,
stride=4 if i == 0 else 2,
in_chans=in_chans if i == 0 else embed_dims[i - 1],
embed_dim=embed_dims[i],
norm_cfg=norm_cfg)
block = nn.ModuleList([Block(dim=embed_dims[i], mlp_ratio=mlp_ratios[i],
drop=drop_rate, drop_path=dpr[cur + j],
norm_cfg=norm_cfg)
for j in range(depths[i])])
norm = nn.LayerNorm(embed_dims[i])
cur += depths[i]
setattr(self, f"patch_embed{i + 1}", patch_embed)
setattr(self, f"block{i + 1}", block)
setattr(self, f"norm{i + 1}", norm)
def forward(self, x):
B = x.shape[0]
outs = []
for i in range(self.num_stages):
patch_embed = getattr(self, f"patch_embed{i + 1}")
block = getattr(self, f"block{i + 1}")
norm = getattr(self, f"norm{i + 1}")
x, H, W = patch_embed(x)
for blk in block:
x = blk(x, H, W)
x = norm(x)
x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous()
outs.append(x)
return outs
SegNeXT在內(nèi)存上的消耗是巨大的,雖然其在flop上占據(jù)優(yōu)勢(shì),但在訓(xùn)練與部署上并不占據(jù)優(yōu)勢(shì)。博主使用MSCAN-S簡(jiǎn)單進(jìn)行訓(xùn)練測(cè)試,發(fā)現(xiàn)加載模型后顯存消耗650MiB,對(duì)單個(gè)512x512 的圖像進(jìn)行forword后顯存消耗為2315MiB,對(duì)512x512的圖像進(jìn)行訓(xùn)練,batchsize為8時(shí)顯存為1200MiB。這種規(guī)模的內(nèi)存消耗對(duì)于工程實(shí)踐而言簡(jiǎn)直是災(zāi)難性的,從內(nèi)存消耗與網(wǎng)絡(luò)結(jié)構(gòu)上看,應(yīng)該是全連接所導(dǎo)致的。隨后通過分析代碼,發(fā)現(xiàn)就是MLP層中dwconv所導(dǎo)致的,其所包含的全連接有[64, 128, 256, 512]x4 ,其中最大的全連接層為2048*2048,參數(shù)量極為龐大。文章來源地址http://www.zghlxwxcb.cn/news/detail-444338.html
(mlp): Mlp(
(fc1): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1))
(dwconv): Conv2d(2048, 2048, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2048)
(act): GELU()
(fc2): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1))
(drop): Dropout(p=0.0, inplace=False)
)
到了這里,關(guān)于語義分割中的一些模型的分類匯總的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!