Training data-efficient image transformers & distillation through attention

URL

Multi-head self attention layer (MSA)：
$head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)=softmax[\frac{QW_i^Q(KW_i^K)^T}{\sqrt{d_k}}]VW_i^V$
Transformer block： FFN （2 × FC + bottleneck + GeLU + LayerNorm） + MSA + Residual
Class token：在 patch 维度 concat 一维 P x P （P 表示 patch_size），并将这个维度作为输出，其他 patch 维度丢弃
Interpolate position embeding：当输入分辨率变化时，直接对 embeding 插值

常用 Soft distillation：
$L=(1-\lambda)L_{CE}(\psi(Z_s),y)+\lambda\tau^2KL(\psi(Z_s/\tau),\psi(Z_t/\tau))$
其中： $y$ 表示 GT label， $Z_s\ ,\ Z_t$ 表示 logits of student model and teacher model， $\psi$ 表示 softmax， $\tau$ 表示蒸馏温度， $L_{CE}, KL$ 分表表示交叉熵与 KL 散度
Hard distillation：
$L=\frac{1}{2}L_{CE}(\psi(Z_s),y)+\frac{1}{2}L_{CE}(\psi(Z_s),y_t),\ \ \ y_t = argmax(Z_t)$
Hard distillation + label smooth 解决 data augmentation 中 crop 导致的图像与 label 不对应的问题
Distillation token：类似 Class token，在 patch 维度 concat 一维 P x P，与 Class token 一起输出计算 loss 与 inference
Joint classifier： $pred=argmax(\psi(C_s)+\psi(D_s))$ ，其中 $C_s\ ,\ D_s$ 分别表示 logits of class token and distillation token