Zhangzhe's Blog

The projection of my life.

0%

Activate or Not: Learning Customized Activation

URL

https://arxiv.org/pdf/2009.04759.pdf

TL;DR

  • 本文将常见的激活函数分为两大类,基于 Maxout 和基于 Smooth maximum
  • 基于 Maxout 的主要是 XXXReLU 家族
  • 基于 smooth maximum 的本文命名为 activate or not 家族,著名的 Swish 在 β=1\beta=1 时就是 ACON-A

Dataset/Algorithm/Model/Experiment Detail

Smooth maximumn

  • Sβ(x1,...,xn)=i=1nxi×eβxi=1neβxS_{\beta}(x_1,...,x_n) = \frac{\sum_{i=1}^nx_i \times e^{\beta x}}{\sum_{i=1}^n e^{\beta x}} ,当 β,Sβmax\beta \rightarrow \infty, S_\beta \rightarrow max ,当 β0,Sβmean\beta \rightarrow 0, S_\beta \rightarrow mean

  • 当 n = 2 时, Sβ(ηa(x),ηb(x))=(ηa(x)ηb(x))×σ[β(ηa(x)ηb(x))]+ηb(x)S_\beta(\eta_a(x),\eta_b(x)) = (\eta_a(x)-\eta_b(x))\times\sigma[\beta(\eta_a(x)-\eta_b(x))]+\eta_b(x) ,其中: σ\sigma 表示 Sigmoid, η\eta 表示 per channle 的线性函数

acon1.png

Meta-ACON

  • ACON 中的 β\beta 从一个 learnable parameter 变成一个 network,ACON -> Meta-ACON,这里的 network 与 SENet 中的 channel-scale 用到的两层 fc 结构相同

代码 (ACON-C 和 Meta-ACON-C)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
from torch import nn


class AconC(nn.Module):
r"""ACON activation (activate or not).
# AconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is a learnable parameter
# according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>.
"""

def __init__(self, width):
super().__init__()
self.p1 = nn.Parameter(torch.randn(1, width, 1, 1))
self.p2 = nn.Parameter(torch.randn(1, width, 1, 1))
self.beta = nn.Parameter(torch.ones(1, width, 1, 1))

def forward(self, x):
return (self.p1 * x - self.p2 * x) * torch.sigmoid(
self.beta * (self.p1 * x - self.p2 * x)
) + self.p2 * x


class MetaAconC(nn.Module):
r"""ACON activation (activate or not).
# MetaAconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is generated by a small network
# according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>.
"""

def __init__(self, width, r=16):
super().__init__()
self.fc1 = nn.Conv2d(
width, max(r, width // r), kernel_size=1, stride=1, bias=True
)
self.bn1 = nn.BatchNorm2d(max(r, width // r))
self.fc2 = nn.Conv2d(
max(r, width // r), width, kernel_size=1, stride=1, bias=True
)
self.bn2 = nn.BatchNorm2d(width)

self.p1 = nn.Parameter(torch.randn(1, width, 1, 1))
self.p2 = nn.Parameter(torch.randn(1, width, 1, 1))

def forward(self, x):
beta = torch.sigmoid(
self.bn2(
self.fc2(
self.bn1(
self.fc1(
x.mean(dim=2, keepdims=True).mean(dim=3, keepdims=True)
)
)
)
)
)
return (self.p1 * x - self.p2 * x) * torch.sigmoid(
beta * (self.p1 * x - self.p2 * x)
) + self.p2 * x

效果

acon2.png

Thoughts

  • 是一种动态上下界的几乎函数,并很 general 的解释了 Smooth maximum 机制的作用