[論文筆記](méi) Swin UNETR 論文筆記: MRI 圖像腦腫瘤語(yǔ)義分割

這篇具有很好參考價(jià)值的文章主要介紹了[論文筆記](méi) Swin UNETR 論文筆記: MRI 圖像腦腫瘤語(yǔ)義分割。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

[論文筆記](méi) Swin UNETR 論文筆記: MRI 圖像腦腫瘤語(yǔ)義分割

Author: Sijin Yu

[1] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R. Roth, and Daguang Xu. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. MICCAI, 2022.

??開(kāi)源代碼鏈接

1. Abstract

腦腫瘤的語(yǔ)義分割是一項(xiàng)基本的醫(yī)學(xué)影像分析任務(wù), 涉及多種 MRI 成像模態(tài), 可協(xié)助臨床醫(yī)生診斷病人并隨后研究惡性實(shí)體的進(jìn)展.
近年來(lái), 完全卷積神經(jīng)網(wǎng)絡(luò) (Fully Convolutional Neural Networks, FCNNs) 方法已成為 3D 醫(yī)學(xué)影像分割的事實(shí)標(biāo)準(zhǔn).
流行的 “U形” 網(wǎng)絡(luò)架構(gòu)在不同的 2D 和 3D 語(yǔ)義分割任務(wù)以及各種成像模式上實(shí)現(xiàn)了最先進(jìn)的性能基準(zhǔn).
然而, 由于 FCNNs 中卷積層的核大小有限, 它們?cè)诮ｉL(zhǎng)距離信息方面的性能是次優(yōu)的, 這可能導(dǎo)致在分割大小不一的腫瘤時(shí)出現(xiàn)缺陷.
另一方面, Transformer 模型在多個(gè)領(lǐng)域展示了捕獲長(zhǎng)距離信息的卓越能力, 包括自然語(yǔ)言處理和計(jì)算機(jī)視覺(jué).
受 ViT 及其變體成功的啟發(fā), 我們提出了一種名為 Swin UNEt TRansformers (Swin UNETR) 的新型分割模型.
具體來(lái)說(shuō), 3D 腦腫瘤語(yǔ)義分割任務(wù)被重新定義為序列到序列預(yù)測(cè)問(wèn)題, 其中多模態(tài)輸入數(shù)據(jù)被投影成一維嵌入序列, 并用作層級(jí) Swin 變換器編碼器的輸入.
Swin Transformer 編碼器使用移位窗口計(jì)算自注意力, 在五個(gè)不同的分辨率上提取特征, 并通過(guò)跳躍連接在每個(gè)分辨率上連接到基于FCNN 的解碼器.
我們參加了 2021 年 BraTS 分割挑戰(zhàn)賽, 我們提出的模型在驗(yàn)證階段位列表現(xiàn)最佳的方法之一.

2. Motivation & Contribution

2.1 Motivation

在醫(yī)療保健的人工智能領(lǐng)域, 特別是腦腫瘤分析中, 需要更先進(jìn)的分割技術(shù)來(lái)準(zhǔn)確劃定腫瘤, 以便診斷和術(shù)前規(guī)劃.
當(dāng)前基于 CNN 的腦腫瘤分割方法由于其小感受野, 難以捕捉長(zhǎng)距離依賴(lài)關(guān)系.
ViTs 在捕捉各種領(lǐng)域的長(zhǎng)距離信息方面顯示出潛力, 暗示其在改善醫(yī)學(xué)圖像分割中的適用性.

2.2 Contribution

提出了一種新型架構(gòu), Swin UNEt TRansformers (Swin UNETR), 結(jié)合了 Swin Transformer 編碼器與 U 形 CNN 解碼器, 用于多模態(tài)三維腦腫瘤分割.
在 2021 年多模態(tài)腦腫瘤分割挑戰(zhàn) (BraTS) 中展示了 Swin UNETR 模型的有效性, 驗(yàn)證階段取得了排名靠前的成績(jī), 并在測(cè)試中表現(xiàn)出競(jìng)爭(zhēng)力.

3. Model

swin unetr: swin transformers for semantic segmentation of brain tumors in m,# 醫(yī)學(xué)AI論文筆記,Deep Learning 論文筆記,論文閱讀

將輸入的圖像打成 Patch.

輸入的圖像為 $X\in\mathbb R^{H\times W\times D\times S}$ . 一個(gè) Patch 的分辨率為 $(H^{'}, W^{'}, D^{'})$ , 一個(gè) Patch 的形狀為 $\mathbb R^{H'\times W'\times D'\times S}$ .

則圖像變?yōu)橐粋€(gè) Patch 的序列, 序列長(zhǎng)度為 $\lceil\frac{H}{H'}\rceil\times\lceil\frac{W}{W'}\rceil\times\lceil\frac{D}{D'}\rceil$ .

在本文中, Patch size 為 $(H^{'}, W^{'}, D^{'}) = (2, 2, 2)$ .

對(duì)于每個(gè) patch, 將其映射為一個(gè)嵌入維度為 $C$ 的 token. 因此, 最終得到分辨率為 $(\lceil\frac{H}{H'}\rceil,\lceil\frac{W}{W'}\rceil,\lceil\frac{D}{D'}\rceil)$ 的 3D tokens.
對(duì) 3D tokens 應(yīng)用 Swin Transformer.

一層 Swin Transformer Block 由兩個(gè)子層組成: W-MSA, SW-MSA.

經(jīng)過(guò)一層 Swin Transformer Block, 一個(gè) 3D tokens 每個(gè)方向上的分辨率變?yōu)樵瓉?lái)的 $\frac12$ , 通道數(shù)變?yōu)樵瓉?lái)的 $2$ 倍. 見(jiàn) Fig.1 的左下角.

W-MSA 和 SW-MSA 分別是規(guī)則的、循環(huán)移動(dòng)的 partitioning multi-head self-attention, 如下圖所示.

swin unetr: swin transformers for semantic segmentation of brain tumors in m,# 醫(yī)學(xué)AI論文筆記,Deep Learning 論文筆記,論文閱讀

4. Experiment

4.1 Dataset

BraTS 2021

4.2 對(duì)比實(shí)驗(yàn)

swin unetr: swin transformers for semantic segmentation of brain tumors in m,# 醫(yī)學(xué)AI論文筆記,Deep Learning 論文筆記,論文閱讀

5. Code

以下鏈接提供了使用Swin UNETR模型進(jìn)行BraTS21腦腫瘤分割的教程:

下面是部分核心代碼注釋:

5.1 數(shù)據(jù)預(yù)處理和增強(qiáng)

from monai import transforms

train_transform = transforms.Compose(
  [	
  	# 讀入圖像
    transforms.LoadImaged(keys=["image", "label"]),
    
		# 將單通道的標(biāo)簽圖像轉(zhuǎn)換成多通道格式, 每個(gè)通道表示不同的腫瘤類(lèi)別. (轉(zhuǎn)換前是所有類(lèi)別標(biāo)簽圖共用一個(gè)單通道圖像)    transforms.ConvertToMultiChannelBasedOnBratsClassesd(keys="label"),
		# 裁剪掉圖像周?chē)谋尘皡^(qū)域
    transforms.CropForegroundd(
        keys=["image", "label"],
        source_key="image",
        k_divisible=[roi[0], roi[1], roi[2]],
    ),
    # 將圖像隨機(jī)裁剪為指定大小
    transforms.RandSpatialCropd(
        keys=["image", "label"],
        roi_size=[roi[0], roi[1], roi[2]],
        random_size=False,
    ),
    # 在0軸方向上隨機(jī)翻轉(zhuǎn)
    transforms.RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),
    # 在1軸方向上隨機(jī)翻轉(zhuǎn)
    transforms.RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=1),
    # 在2軸方向上隨機(jī)翻轉(zhuǎn)
    transforms.RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=2),
    # 對(duì)每個(gè)單獨(dú)通道, 進(jìn)行強(qiáng)度歸一化, 且忽略0值
    transforms.NormalizeIntensityd(keys="image", nonzero=True, channel_wise=True),
    # 隨機(jī)調(diào)整圖像的強(qiáng)度, img = img * (1 + eps)
    transforms.RandScaleIntensityd(keys="image", factors=0.1, prob=1.0),
    # 隨機(jī)調(diào)整圖像的強(qiáng)度, img = img + eps
    transforms.RandShiftIntensityd(keys="image", offsets=0.1, prob=1.0),
	]
)

val_transform = transforms.Compose(
	[
    transforms.LoadImaged(keys=["image", "label"]),
    transforms.ConvertToMultiChannelBasedOnBratsClassesd(keys="label"),
    transforms.NormalizeIntensityd(keys="image", nonzero=True, channel_wise=True),
  ]
)

5.2 Swin UNETR 模型架構(gòu)

def forward(self, x_in):
  if not torch.jit.is_scripting():
    self._check_input_size(x_in.shape[2:])
  hidden_states_out = self.swinViT(x_in, self.normalize)
  enc0 = self.encoder1(x_in)
  enc1 = self.encoder2(hidden_states_out[0])
  enc2 = self.encoder3(hidden_states_out[1])
  enc3 = self.encoder4(hidden_states_out[2])
  dec4 = self.encoder10(hidden_states_out[4])
  dec3 = self.decoder5(dec4, hidden_states_out[3])
  dec2 = self.decoder4(dec3, enc3)
  dec1 = self.decoder3(dec2, enc2)
  dec0 = self.decoder2(dec1, enc1)
  out = self.decoder1(dec0, enc0)
  logits = self.out(out)
  return logits

組件的定義如下:文章來(lái)源地址http://www.zghlxwxcb.cn/news/detail-851399.html

self.normalize = normalize

self.swinViT = SwinTransformer(
  in_chans=in_channels,
  embed_dim=feature_size,
  window_size=window_size,
  patch_size=patch_sizes,
  depths=depths,
  num_heads=num_heads,
  mlp_ratio=4.0,
  qkv_bias=True,
  drop_rate=drop_rate,
  attn_drop_rate=attn_drop_rate,
  drop_path_rate=dropout_path_rate,
  norm_layer=nn.LayerNorm,
  use_checkpoint=use_checkpoint,
  spatial_dims=spatial_dims,
  downsample=look_up_option(downsample, MERGING_MODE) if isinstance(downsample, str) else downsample,
  use_v2=use_v2,
)

self.encoder1 = UnetrBasicBlock(
  spatial_dims=spatial_dims,
  in_channels=in_channels,
  out_channels=feature_size,
  kernel_size=3,
  stride=1,
  norm_name=norm_name,
  res_block=True,
)

self.encoder2 = UnetrBasicBlock(
  spatial_dims=spatial_dims,
  in_channels=feature_size,
  out_channels=feature_size,
  kernel_size=3,
  stride=1,
  norm_name=norm_name,
  res_block=True,
)

self.encoder3 = UnetrBasicBlock(
  spatial_dims=spatial_dims,
  in_channels=2 * feature_size,
  out_channels=2 * feature_size,
  kernel_size=3,
  stride=1,
  norm_name=norm_name,
  res_block=True,
)

self.encoder4 = UnetrBasicBlock(
  spatial_dims=spatial_dims,
  in_channels=4 * feature_size,
  out_channels=4 * feature_size,
  kernel_size=3,
  stride=1,
  norm_name=norm_name,
  res_block=True,
)

self.encoder10 = UnetrBasicBlock(
  spatial_dims=spatial_dims,
  in_channels=16 * feature_size,
  out_channels=16 * feature_size,
  kernel_size=3,
  stride=1,
  norm_name=norm_name,
  res_block=True,
)

self.decoder5 = UnetrUpBlock(
  spatial_dims=spatial_dims,
  in_channels=16 * feature_size,
  out_channels=8 * feature_size,
  kernel_size=3,
  upsample_kernel_size=2,
  norm_name=norm_name,
  res_block=True,
)

self.decoder4 = UnetrUpBlock(
  spatial_dims=spatial_dims,
  in_channels=feature_size * 8,
  out_channels=feature_size * 4,
  kernel_size=3,
  upsample_kernel_size=2,
  norm_name=norm_name,
  res_block=True,
)

self.decoder3 = UnetrUpBlock(
  spatial_dims=spatial_dims,
  in_channels=feature_size * 4,
  out_channels=feature_size * 2,
  kernel_size=3,
  upsample_kernel_size=2,
  norm_name=norm_name,
  res_block=True,
)
self.decoder2 = UnetrUpBlock(
  spatial_dims=spatial_dims,
  in_channels=feature_size * 2,
  out_channels=feature_size,
  kernel_size=3,
  upsample_kernel_size=2,
  norm_name=norm_name,
  res_block=True,
)

self.decoder1 = UnetrUpBlock(
  spatial_dims=spatial_dims,
  in_channels=feature_size,
  out_channels=feature_size,
  kernel_size=3,
  upsample_kernel_size=2,
  norm_name=norm_name,
  res_block=True,
)

self.out = UnetOutBlock(spatial_dims=spatial_dims, in_channels=feature_size, out_channels=out_channels)

5.2.1 SwinTransformer

class SwinTransformer(nn.Module):
  """
  Swin Transformer based on: "Liu et al.,
  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  <https://arxiv.org/abs/2103.14030>"
  https://github.com/microsoft/Swin-Transformer
  """

  def __init__(
    self,
    in_chans: int,
    embed_dim: int,
    window_size: Sequence[int],
    patch_size: Sequence[int],
    depths: Sequence[int],
    num_heads: Sequence[int],
    mlp_ratio: float = 4.0,
    qkv_bias: bool = True,
    drop_rate: float = 0.0,
    attn_drop_rate: float = 0.0,
    drop_path_rate: float = 0.0,
    norm_layer: type[LayerNorm] = nn.LayerNorm,
    patch_norm: bool = False,
    use_checkpoint: bool = False,
    spatial_dims: int = 3,
    downsample="merging",
    use_v2=False,
  ) -> None:
  """
  Args:
    in_chans: dimension of input channels.
    embed_dim: number of linear projection output channels.
    window_size: local window size.
    patch_size: patch size.
    depths: number of layers in each stage.
    num_heads: number of attention heads.
    mlp_ratio: ratio of mlp hidden dim to embedding dim.
    qkv_bias: add a learnable bias to query, key, value.
    drop_rate: dropout rate.
    attn_drop_rate: attention dropout rate.
    drop_path_rate: stochastic depth rate.
    norm_layer: normalization layer.
    patch_norm: add normalization after patch embedding.
    use_checkpoint: use gradient checkpointing for reduced memory usage.
    spatial_dims: spatial dimension.
    downsample: module used for downsampling, available options are `"mergingv2"`, `"merging"` and a
        user-specified `nn.Module` following the API defined in :py:class:`monai.networks.nets.PatchMerging`.
        The default is currently `"merging"` (the original version defined in v0.9.0).
    use_v2: using swinunetr_v2, which adds a residual convolution block at the beginning of each swin stage.
  """
    super().__init__()
    self.num_layers = len(depths)
    self.embed_dim = embed_dim
    self.patch_norm = patch_norm
    self.window_size = window_size
    self.patch_size = patch_size
    self.patch_embed = PatchEmbed(
        patch_size=self.patch_size,
        in_chans=in_chans,
        embed_dim=embed_dim,
        norm_layer=norm_layer if self.patch_norm else None,  # type: ignore
        spatial_dims=spatial_dims,
    )
    self.pos_drop = nn.Dropout(p=drop_rate)
    dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
    self.use_v2 = use_v2
    self.layers1 = nn.ModuleList()
    self.layers2 = nn.ModuleList()
    self.layers3 = nn.ModuleList()
    self.layers4 = nn.ModuleList()
    if self.use_v2:
      self.layers1c = nn.ModuleList()
      self.layers2c = nn.ModuleList()
      self.layers3c = nn.ModuleList()
      self.layers4c = nn.ModuleList()
    down_sample_mod = look_up_option(downsample, MERGING_MODE) if isinstance(downsample, str) else downsample
    for i_layer in range(self.num_layers):
      layer = BasicLayer(
        dim=int(embed_dim * 2**i_layer),
        depth=depths[i_layer],
        num_heads=num_heads[i_layer],
        window_size=self.window_size,
        drop_path=dpr[sum(depths[:i_layer]) : sum(depths[: i_layer + 1])],
        mlp_ratio=mlp_ratio,
        qkv_bias=qkv_bias,
        drop=drop_rate,
        attn_drop=attn_drop_rate,
        norm_layer=norm_layer,
        downsample=down_sample_mod,
        use_checkpoint=use_checkpoint,
        )
      if i_layer == 0:
        self.layers1.append(layer)
      elif i_layer == 1:
        self.layers2.append(layer)
      elif i_layer == 2:
        self.layers3.append(layer)
      elif i_layer == 3:
        self.layers4.append(layer)
      if self.use_v2:
        layerc = UnetrBasicBlock(
          spatial_dims=3,
          in_channels=embed_dim * 2**i_layer,
          out_channels=embed_dim * 2**i_layer,
          kernel_size=3,
          stride=1,
          norm_name="instance",
          res_block=True,
        )
      if i_layer == 0:
        self.layers1c.append(layerc)
      elif i_layer == 1:
        self.layers2c.append(layerc)
      elif i_layer == 2:
        self.layers3c.append(layerc)
      elif i_layer == 3:
        self.layers4c.append(layerc)
    self.num_features = int(embed_dim * 2 ** (self.num_layers - 1))

  def proj_out(self, x, normalize=False):
    if normalize:
      x_shape = x.size()
      if len(x_shape) == 5:
        n, ch, d, h, w = x_shape
        x = rearrange(x, "n c d h w -> n d h w c")
        x = F.layer_norm(x, [ch])
        x = rearrange(x, "n d h w c -> n c d h w")
      elif len(x_shape) == 4:
        n, ch, h, w = x_shape
        x = rearrange(x, "n c h w -> n h w c")
        x = F.layer_norm(x, [ch])
        x = rearrange(x, "n h w c -> n c h w")
    return x

  def forward(self, x, normalize=True):
    x0 = self.patch_embed(x)
    x0 = self.pos_drop(x0)
    x0_out = self.proj_out(x0, normalize)
    if self.use_v2:
      x0 = self.layers1c[0](x0.contiguous())
    x1 = self.layers1[0](x0.contiguous())
    x1_out = self.proj_out(x1, normalize)
    if self.use_v2:
      x1 = self.layers2c[0](x1.contiguous())
    x2 = self.layers2[0](x1.contiguous())
    x2_out = self.proj_out(x2, normalize)
    if self.use_v2:
      x2 = self.layers3c[0](x2.contiguous())
    x3 = self.layers3[0](x2.contiguous())
    x3_out = self.proj_out(x3, normalize)
    if self.use_v2:
      x3 = self.layers4c[0](x3.contiguous())
    x4 = self.layers4[0](x3.contiguous())
    x4_out = self.proj_out(x4, normalize)
    return [x0_out, x1_out, x2_out, x3_out, x4_out]

5.2.2 UnetrBasicBlock

class UnetrBasicBlock(nn.Module):
  """
  A CNN module that can be used for UNETR, based on: "Hatamizadeh et al.,
  UNETR: Transformers for 3D Medical Image Segmentation <https://arxiv.org/abs/2103.10504>"
  """

  def __init__(
    self,
    spatial_dims: int,
    in_channels: int,
    out_channels: int,
    kernel_size: Sequence[int] | int,
    stride: Sequence[int] | int,
    norm_name: tuple | str,
    res_block: bool = False,
  ) -> None:
    """
    Args:
      spatial_dims: number of spatial dimensions.
      in_channels: number of input channels.
      out_channels: number of output channels.
      kernel_size: convolution kernel size.
      stride: convolution stride.
      norm_name: feature normalization type and arguments.
      res_block: bool argument to determine if residual block is used.
    """

    super().__init__()

    if res_block:
      self.layer = UnetResBlock(
        spatial_dims=spatial_dims,
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        norm_name=norm_name,
      )
    else:
      self.layer = UnetBasicBlock(  # type: ignore
        spatial_dims=spatial_dims,
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        norm_name=norm_name,
      )

  def forward(self, inp):
    return self.layer(inp)

5.2.3 UnetrUpBlock

class UnetrUpBlock(nn.Module):
  """
  An upsampling module that can be used for UNETR: "Hatamizadeh et al.,
  UNETR: Transformers for 3D Medical Image Segmentation <https://arxiv.org/abs/2103.10504>"
  """

  def __init__(
    self,
    spatial_dims: int,
    in_channels: int,
    out_channels: int,
    kernel_size: Sequence[int] | int,
    upsample_kernel_size: Sequence[int] | int,
    norm_name: tuple | str,
    res_block: bool = False,
  ) -> None:
    """
    Args:
      spatial_dims: number of spatial dimensions.
      in_channels: number of input channels.
      out_channels: number of output channels.
      kernel_size: convolution kernel size.
      upsample_kernel_size: convolution kernel size for transposed convolution layers.
      norm_name: feature normalization type and arguments.
      res_block: bool argument to determine if residual block is used.
    """
    super().__init__()
    upsample_stride = upsample_kernel_size
    self.transp_conv = get_conv_layer(
      spatial_dims,
      in_channels,
      out_channels,
      kernel_size=upsample_kernel_size,
      stride=upsample_stride,
      conv_only=True,
      is_transposed=True,
    )

    if res_block:
      self.conv_block = UnetResBlock(
        spatial_dims,
        out_channels + out_channels,
        out_channels,
        kernel_size=kernel_size,
        stride=1,
        norm_name=norm_name,
      )
    else:
      self.conv_block = UnetBasicBlock(  # type: ignore
        spatial_dims,
        out_channels + out_channels,
        out_channels,
        kernel_size=kernel_size,
        stride=1,
        norm_name=norm_name,
      )

  def forward(self, inp, skip):
    # number of channels for skip should equals to out_channels
    out = self.transp_conv(inp)
    out = torch.cat((out, skip), dim=1)
    out = self.conv_block(out)
    return out

5.2.4 UnetOutBlock

class UnetOutBlock(nn.Module):
  def __init__(
    self, spatial_dims: int, in_channels: int, out_channels: int, dropout: tuple | str | float | None = None
  ):
    super().__init__()
    self.conv = get_conv_layer(
      spatial_dims,
      in_channels,
      out_channels,
      kernel_size=1,
      stride=1,
      dropout=dropout,
      bias=True,
      act=None,
      norm=None,
      conv_only=False,
    )

  def forward(self, inp):
    return self.conv(inp)

到了這里，關(guān)于[論文筆記](méi) Swin UNETR 論文筆記: MRI 圖像腦腫瘤語(yǔ)義分割的文章就介紹完了。如果您還想了解更多內(nèi)容，請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章，希望大家以后多多支持TOY模板網(wǎng)！