Zhangzhe's Blog

ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

Posted on 2021-08-01 Edited on 2026-01-25 In CNN Architecture Design Valine:

URL

https://arxiv.org/pdf/2006.15102.pdf

TL;DR

ULSAM 是一个超轻量级的子空间注意力网络，适合用在轻量级的网络中，例如 MobileNet、ShuffleNet 等
适合用在图像细粒度分类任务中，能减少大约 13% 的 Flops 和大约 25% 的 params，在 ImageNet - 1K 和其他三个细粒度分类数据集上 Top1 error 分别降低 0.27% 和 1%
与 SENet 有点类似，SENet 在 C 维度上添加注意力，ULSAM 在HW 维度上添加注意力

Algorithm

网络结构

将输入 tensor F 按照通道分为 g 组：CHW --> gGHW， $F = [F_1,F_2,...,F_g]$ ，每一组 $F_n$ 被称为一个子空间
对每个子空间 $F_n$ $F_{n}$ 进行如下运算：
- Depth-wise Conv(kernel_size = 1)
- MaxPool2d(kernel_size = 3, stride = 1, padding = 1)，这一步可以获得感受野同时减小方差
- Point-wise Conv(kernel_size = 1), kernels = 1
- softmax
- out = x + x * softmax
将所有子空间的结果 concat 作为输出

公式表示

$dw_n = {DW}^{1*1}(F_n)$
$maxpool_n = {maxpool}^{3*3, 1}(dw_n)$
$pw_n = {PW}^1(maxpool_n)$
$A_{n} = softmax(pw_n)$
$\hat F_n = (A_n \otimes F_n) \oplus F_n$
$\hat F = concat([\hat F_1,\hat F_2,...,\hat F_g])$

源码表示

import torch
import torch.nn as nn
torch.set_default_tensor_type(torch.cuda.FloatTensor)
class SubSpace(nn.Module):
    """
    Subspace class.
    ...
    Attributes
    ----------
    nin : int
        number of input feature volume.
    Methods
    -------
    __init__(nin)
        initialize method.
    forward(x)
        forward pass.
    """
    def __init__(self, nin):
        super(SubSpace, self).__init__()
        self.conv_dws = nn.Conv2d(
            nin, nin, kernel_size=1, stride=1, padding=0, groups=nin
        )
        self.bn_dws = nn.BatchNorm2d(nin, momentum=0.9)
        self.relu_dws = nn.ReLU(inplace=False)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.conv_point = nn.Conv2d(
            nin, 1, kernel_size=1, stride=1, padding=0, groups=1
        )
        self.bn_point = nn.BatchNorm2d(1, momentum=0.9)
        self.relu_point = nn.ReLU(inplace=False)
        self.softmax = nn.Softmax(dim=2)
    def forward(self, x):
        out = self.conv_dws(x)
        out = self.bn_dws(out)
        out = self.relu_dws(out)
        out = self.maxpool(x)
        out = self.conv_point(out)
        out = self.bn_point(out)
        out = self.relu_point(out)
        m, n, p, q = out.shape
        out = self.softmax(out.view(m, n, -1))
        out = out.view(m, n, p, q)
        out = out.expand(x.shape[0], x.shape[1], x.shape[2], x.shape[3])
        out = torch.mul(out, x)
        out = out + x
        return out

Grad-CAM++ 热力图

ULSAM 加入到 MobileNet v1 和 v2 之后，模型的 focus 能力更好

Thoughts

虽然 Flops 和 params 减小或者几乎不变，但引入了很多 element-wise 运算，估计速度会慢
SENet 使用 sigmoid 来处理权重，而 ULSAM 使用 HW 维度上 softmax 处理权重，所以需要使用残差结构

网络表现

通过控制变量实验，验证子空间数量 g 和替换位置 pos 对模型表现的影响

对比实验

Selective Kernel Networks

Posted on 2021-08-01 Edited on 2025-12-11 In CNN Architecture Design Valine:

URL

https://arxiv.org/pdf/1903.06586.pdf

TL;DR

SKNet 给 N 个不同感受野分支的 feature 通道赋予权重，结合了 Attention to channel 和 select kernel

SKNet网络结构

数学表达

$X\in\mathbb R^{H'\times W' \times C'} \overset{\tilde F}{\longrightarrow} \tilde U \in \mathbb R^{H\times W\times C}$
$X\in\mathbb R^{H'\times W' \times C'} \overset{\hat F}{\longrightarrow} \hat U \in \mathbb R^{H\times W\times C}$
$U=\tilde U + \hat U$
$s_c = F_{gp}(U_c) = \frac{1}{H\times W}\sum_{i=1}^H\sum_{j=1}^W U_c(i, j)$
$z = F_{fc}(s) = \delta(\beta (Ws)),\ \ \ \ W\in\mathbb R^{d\times C},\ \ \ \ d = max(\frac{C}{r}, L)$
$a_c = \frac{e^{A_cz}}{e^{A_cz} + e^{B_cz}},\ \ b_c = \frac{e^{B_cz}}{e^{A_cz} + e^{B_cz}},\ \ \ \ A_c,B_c\in\mathbb R^{1\times d}$
$V_c = a_c . \tilde U_c + b_c . \hat U_c,\ \ \ \ V_c\in\mathbb R^{H\times W}$

SKNet实验结果

ImageNet
other

Thoughts

SENet 与 SKNet 属于 Attention to channel，ULSAM 属于 Attention to HW，两个合起来是否可以替代 Non-local——在 THW上的 Attention

Squeeze-and-Excitation Networks

Posted on 2021-08-01 Edited on 2025-12-11 In CNN Architecture Design Valine:

URL

https://arxiv.org/pdf/1709.01507.pdf

TL;DR

SENet 给每个通道赋予权重，Attention to Channel

Algorithm

数学表达

$z_c = F_{sq}(u_c) = \frac{1}{H \times W}\sum_{i=1}^H\sum_{j=1}^W u_c(i, j),\ \ \ \ \ z \in \mathbb R^C$
$s = F_{ex}(z, W) = \sigma(g(z, W)) = \sigma(W_2\delta(W_1z)), \ \ \ W_1 \in \mathbb R^{\frac{C}{r}\times C},\ \ \ W_2 \in \mathbb R^{C \times \frac{C}{r}}$
$\tilde X_c = F_{scale}(u_c, s_c) = s_cu_c,\ \ \ \ X \in \mathbb R^C$

SENet实验结果

ImageNet
other

Thoughts

SENet 与 SKNet 属于 Attention to channel，ULSAM 属于 Attention to HW，两个合起来是否可以替代 Non-local——在 THW上的 Attention

WeightNet: Revisiting the Design Space of weight networks

Posted on 2021-08-01 Edited on 2025-12-11 In CNN Architecture Design Valine:

URL

https://arxiv.org/pdf/2007.11823.pdf

TL;DR

一种动态产生 Conv 权重的方法，统一了 SENet 和 CondConv 等动态卷积算法

Dataset/Algorithm/Model/Experiment Detail

结构

使用 GAP + FC + Sigmoid + Group_FC + Reshape 把 $(N, C_{in}, H, W)$ 的输入 feature map 变成 $(C_{out}, C_{in}, K, K)$ 的 kernel，再与 feature map 做 Conv

megengine 实现

WeightNet

Thoughts

使用 Group_FC 产生 Weight 是 make sense 的，毕竟下一层 Conv 会做通道间融合
比 CondConv 和 SENet 在结构上要激进不少，原理上是把 CondConv 和 SENet 对 Conv Weight 的初始化往前提到了 Group FC 中，去掉了人为设计

DETR: End-to-End Object Detection with Transformers

Posted on 2021-07-26 Edited on 2025-12-11 In Transformer Valine:

URL

https://arxiv.org/pdf/2005.12872.pdf

Algorithm

Architecture

DETR inference

import torch
from torch import nn
from torchvision.models import resnet50
class DETR(nn.Module):
    def __init__(
        self, num_classes, hidden_dim, nheads, num_encoder_layers, num_decoder_layers
    ):
        super().__init__()
        # We take only convolutional layers from ResNet-50 model
        self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
        self.conv = nn.Conv2d(2048, hidden_dim, 1)
        self.transformer = nn.Transformer(
            hidden_dim, nheads, num_encoder_layers, num_decoder_layers
        )
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
        self.linear_bbox = nn.Linear(hidden_dim, 4)
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
    def forward(self, inputs):
        x = self.backbone(inputs)
        h = self.conv(x)
        H, W = h.shape[-2:]
        pos = (
            torch.cat(
                [
                    self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
                    self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
                ],
                dim=-1,
            )
            .flatten(0, 1)
            .unsqueeze(1)
        )
        h = self.transformer(
            pos + h.flatten(2).permute(2, 0, 1), self.query_pos.unsqueeze(1)
        )
        return self.linear_class(h), self.linear_bbox(h).sigmoid()
detr = DETR(
    num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6
)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)
print(logits.shape)     # [100, 1, 92]: [num_query, batch, classes]
print(bboxes.shape)     # [100, 1, 4]: [num_query, batch, box]

模型拓扑图

Linformer: Self-Attention with Linear Complexity

Posted on 2021-07-26 Edited on 2025-12-11 In Transformer Valine:

URL

https://arxiv.org/pdf/2006.04768.pdf

TL;DR

本方法—— Linformer 使用矩阵的低秩来降低原始 Transformer 的 Multi-HEAD Attention 计算的时空复杂度
不同 Transformer 结构的复杂度

Algorithm

原始 `Transformer` 使用的 `Multi-HEAD Attention`

$head_i = Attention(QW_i^Q,KW_i^K,VW_i^V)=softmax[\frac{QW_i^Q(KW_i^K)^T}{\sqrt d_k}]VW_i^V$
其中： $K,Q,V \in \mathbb R^{n\times d_m} \ \ \ W_i^Q,W_i^K\in \mathbb R^{d_m\times d_k}$
所以： $softmax[\frac{QW_i^Q(KW_i^K)^T}{\sqrt d_k}] \in \mathbb {R} ^{n\times n}$ ，n 表示序列长度，所以原始 Transformer 使用的 Multi-HEAD Attention 的时空复杂度为 $O(n^2)$

`Linformer` 对 `Multi-HEAD Attention` 的修改

将 $KW_i^K, VW_i^V\in \mathbb R^{n\times d_k}$ 投影到 $E_iKW_i^K, F_iVW_i^V\in \mathbb R^{k\times d_k}$ ，其中 k 是一个常数，时空复杂度变成了 $O(n)$ ，其中，E、F 都是可学习的投影矩阵， $E,F \in \mathbb R^{k\times n}$
$\bar{head_i} = Attention(QW_i^Q,E_iKW_i^K, F_iVW_i^V)=softmax[\frac{QW_i^Q(E_iKW_i^K)^T}{\sqrt d_k}]F_iVW_i^V$
投影矩阵 E、F 可共享参数，分为：
- Headwise sharing： $E_i=E,\ \ F_i=F, \ \ for\ each\ layer$
- Key-value sharing： $E_i=E = F_i, \ \ for\ each\ layer$
- Layerwise sharing： $E, F, \ \ layer\ sharing$

理论依据与结果

特征值的长尾分布
效果（与 BERT-base 对比）

Thoughts

文中提到不使用奇异值分解来得到低秩矩阵的原因是：奇异值分解会引入额外的计算量，并且无法共享参数
代码被打包为了 linformer 的 pip 包，可以在 torch 框架下直接使用

Training data-efficient image transformers & distillation through attention

Posted on 2021-07-21 Edited on 2025-12-11 In Transformer Valine:

URL

https://arxiv.org/pdf/2012.12877.pdf

TL;DR

基于 ViT，但解决了 ViT 依赖超大数据集预训练的问题，与 ConvNet 同样在 imageNet 数据集上训练，可以达到 SOTA
提出一种基于 distillation token 的蒸馏

Algorithm

transformer (ViT)

Multi-head self attention layer (MSA)：
$head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)=softmax[\frac{QW_i^Q(KW_i^K)^T}{\sqrt{d_k}}]VW_i^V$
Transformer block： FFN （2 × FC + bottleneck + GeLU + LayerNorm） + MSA + Residual
Class token：在 patch 维度 concat 一维 P x P （P 表示 patch_size），并将这个维度作为输出，其他 patch 维度丢弃
Interpolate position embeding：当输入分辨率变化时，直接对 embeding 插值

distillation

常用 Soft distillation：
$L=(1-\lambda)L_{CE}(\psi(Z_s),y)+\lambda\tau^2KL(\psi(Z_s/\tau),\psi(Z_t/\tau))$
其中： $y$ 表示 GT label， $Z_s\ ,\ Z_t$ 表示 logits of student model and teacher model， $\psi$ 表示 softmax， $\tau$ 表示蒸馏温度， $L_{CE}, KL$ 分表表示交叉熵与 KL 散度
Hard distillation：
$L=\frac{1}{2}L_{CE}(\psi(Z_s),y)+\frac{1}{2}L_{CE}(\psi(Z_s),y_t),\ \ \ y_t = argmax(Z_t)$
Hard distillation + label smooth 解决 data augmentation 中 crop 导致的图像与 label 不对应的问题
Distillation token：类似 Class token，在 patch 维度 concat 一维 P x P，与 Class token 一起输出计算 loss 与 inference
Joint classifier： $pred=argmax(\psi(C_s)+\psi(D_s))$ ，其中 $C_s\ ,\ D_s$ 分别表示 logits of class token and distillation token

other

很多很多的训练技巧

Thoughts

训练技巧很重要。从源码角度看，本文能解决 ViT 依赖超大数据集预训练的问题，主要原因是训练技巧强大
本文提出的关于蒸馏方法的理解可以作为蒸馏 Transformer 的指导

Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

Posted on 2021-07-21 Edited on 2025-12-11 In Transformer Valine:

URL

https://arxiv.org/pdf/2008.01232.pdf

TL;DR

本文将 bert 模型结构用于多帧动作识别网络的末尾的时间信息融合部分，在 HMDB51 和 UCF101 两个 Action Recognition 数据集上目前仍是 SOTA

Algorithm

一句话总结本文的主要工作：SOTA - TGAP + BERT = NEW SOTA

之前 Action Recognition 常用的网络结构

1. 3D Conv + TGAP

将连续多帧视频一起送入网络，使用 3D Conv 或 C(2 + 1)D 降维时间与空间，升维 Channel
使用 TGAP (temporal global average pooling ) （torch.nn.AdaptiveAvgPool3d）对时间空间一起全局平均池化到一个 scalar，然后 Channel 维做 FC 分类

2. 3D Conv + GAP + LSTM

backbone 部分与 1 相似
对时空 feature map 使用 GAP，保留时间维度的特征，使用 LSTM 等结构处理时间序列，输出 FC 分类

3. 基于 2D Conv + 时序等

本文网络结构

本文认为 TGAP 会丢失很多时序信息，GAP + LSTM 效果也不好
在末尾使用 GAP + BERT 是一个较好的选择，并只对 Transformer 的 ClassToken 监督

对 Transformer 一个有趣的解释

Transformer 的数学表达式： $y_i=PFFN(\frac{1}{N(x)}\sum_{\forall{j}}g(x_i)f(x_i,x_j))$
其中：

PFFN: Position-wise Feed-forward Networ
$f(x_i,x_j)=softmax_j(\theta(x_i)^T\phi(x_j))$ ，其中 $g,\phi,\theta$ 都是 projection function （FC）
如果 $g,\phi,\theta$ 都变成 1 × 1 × 1 Conv，那 Transformer 就变成了 non-local ，所以用 BERT 处理图像序列就非常合理了…

baseline

本文选取的 baseline 是 Action Recognition 经典的网络结构 R(2 + 1)D 和 SlowFastNet

对 R(2 + 1)D 网络的改进

R(2 + 1)D - TGAP + 1层 BERT = R(2 + 1)D_BERT，目前 HMDB51 和 UCF101 上的 SOTA

对 SlowFastNet 的改进

BERT 的后融合实现： SlowFastNet 的两路序列各自经过 BERT 再 Concat 比 Concat 后再 BERT 效果好…

对比实验

作者做了非常完善的对比实验，包括是否使用光流信息，是否在backbone尾部降维，Transformer 用几层几个 head 等，详细见 paper

Thoughts

关于 BERT 与 Non-local 的关系还是挺有趣的

URL

TL;DR

Algorithm

网络结构

公式表示

源码表示

Grad-CAM++ 热力图

Thoughts

网络表现

对比实验

URL

TL;DR

SKNet网络结构

数学表达

SKNet实验结果

Thoughts

URL

TL;DR

Algorithm

数学表达

SENet实验结果

Thoughts

URL

TL;DR

Dataset/Algorithm/Model/Experiment Detail

结构

megengine 实现

Thoughts

URL

Algorithm

Architecture

DETR inference

URL

TL;DR

Algorithm

原始 Transformer 使用的 Multi-HEAD Attention

Linformer 对 Multi-HEAD Attention 的修改

理论依据与结果

Thoughts

URL

TL;DR

Algorithm

transformer (ViT)

distillation

other

Thoughts

URL

TL;DR

Algorithm

之前 Action Recognition 常用的网络结构

1. 3D Conv + TGAP

2. 3D Conv + GAP + LSTM

3. 基于 2D Conv + 时序 等

本文网络结构

对 Transformer 一个有趣的解释

baseline

对 R(2 + 1)D 网络的改进

对 SlowFastNet 的改进

对比实验

Thoughts

原始 `Transformer` 使用的 `Multi-HEAD Attention`

`Linformer` 对 `Multi-HEAD Attention` 的修改

3. 基于 2D Conv + 时序等