Zhangzhe's Blog

The projection of my life.

0%

DoReFa-Net: Traing Low Bitwidth Convlutional Neural Networks With Low Bitwidth Gradients

URL

https://arxiv.org/pdf/1606.06160.pdf

TL;DR

  • DoReFa-Net是一种神经网络量化方法,包括对weights、activations和gradients进行量化

  • 量化模型的位宽重要性:gradient_bits > activation_bits > weight_bits

Algorithm

算法与实现角度理解:

总体流程

dorefa

quantization of weights (v1) (上图 fωWf_\omega^W )

  • STE (straight-through estimator)

    一种对本身不可微函数(例如四舍五入函数round()、符号函数sign()等)的手动指定微分的方法,量化网络离不开STE

  • 当1 == w_bits时:

    Forward:ro=sign(ri)×EF(ri)Forward: r_o = sign(r_i) \times E_F(|r_i|)

    Backward:cri=croBackward: \frac{\partial c}{\partial r_i} = \frac{\partial c}{\partial r_o}

    其中 EF(ri)E_F(|r_i|) 表示weight每层通道的绝对值均值

  • 当1 < w_bits 时:

    Forwardv1:ro=2quantizek(tanh(ri)2max(tanh(ri))+12)1Forward\\_v1: r_o = 2 quantize_k(\frac{tanh(r_i)}{2max(|tanh(r_i)|)} + \frac{1}{2}) - 1

    Backward:cri=crororiBackward: \frac{\partial c}{\partial r_i} = \frac{\partial c}{\partial r_o} \frac{\partial r_o}{\partial r_i}

    其中 quantize_k()函数是一个量化函数,使用类似四舍五入的方法,将 [0,1,232][0, 1, 2^{32}] 之间离散的数聚类到 [0,1,2wbits][0, 1, 2^{w\\_bits}]

    • quantize_k():

      $ Forward: r_o = \frac{1}{2k-1}round((2k-1)r_i)$

      Backward:cri=croBackward: \frac{\partial c}{\partial r_i} = \frac{\partial c}{\partial r_o}

  • quantization of weights 源码

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    def uniform_quantize(k):
    class qfn(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
    if k == 32:
    out = input
    elif k == 1:
    out = torch.sign(input)
    else:
    n = float(2 ** k - 1)
    out = torch.round(input * n) / n
    return out

    @staticmethod
    def backward(ctx, grad_output):
    # STE (do nothing in backward)
    grad_input = grad_output.clone()
    return grad_input

    return qfn().apply


    class weight_quantize_fn(nn.Module):
    def __init__(self, w_bit):
    super(weight_quantize_fn, self).__init__()
    assert w_bit <= 8 or w_bit == 32
    self.w_bit = w_bit
    self.uniform_q = uniform_quantize(k=w_bit)

    def forward(self, x):
    if self.w_bit == 32:
    weight_q = x
    elif self.w_bit == 1:
    E = torch.mean(torch.abs(x)).detach()
    weight_q = self.uniform_q(x / E) * E
    else:
    weight = torch.tanh(x)
    max_w = torch.max(torch.abs(weight)).detach()
    weight = weight / 2 / max_w + 0.5
    weight_q = max_w * (2 * self.uniform_q(weight) - 1)
    return weight_q

quantization of activations (上图 fαAf_\alpha^A )

  • 对每层输出量化

    $ f^A_\alpha = quantize_k®$

quantization of gradients (上图 fγGf_\gamma^G )

  • 对梯度量化

    fγk(dr)=2max0(dr)[quantizek[dr2max0(dr)+12+N(k)]12]f^k_\gamma(dr)=2max_0(|dr|)[quantize_k[\frac{dr}{2max_0(|dr|)} + \frac{1}{2} + N(k)] - \frac{1}{2}]

    其中:$ dr $ 表示backward上游回传的梯度, kk 表示 gradient_bits, N(k)N(k) 表示随机均匀噪声N(k)=σ2k1,   σUniform(0.5,0.5)N(k) = \frac{\sigma}{2^k-1},\ \ \ \sigma \sim Uniform(-0.5, 0.5)

  • 由于模型端上inference阶段不需要梯度信息,而大多数模型也不会在端上训练,再加上低比特梯度训练会影响模型精度,所以对梯度的量化大多数情况下并不会使用

其他方面

  • 由于最后一层的输出分布与模型中输出的分布不同,所以为了保证精度,不对模型输出层output做量化(上图step 5, 6)

  • 由于模型第一层的输入与中间层输入分布不同,而且输入通道数较小,weight不量化代价小,所以 第一层的weight做量化

  • 使用多步融合,不保存中间结果会降低模型运行是所需内存