Zhangzhe's Blog

The projection of my life.

0%

Two-Step Quantization for Low-bit Neural Networks

URL

http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Two-Step_Quantization_for_CVPR_2018_paper.pdf

TL;DR

  • 对于 weight 的量化与对 activation 的量化如果同时学习,模型收敛比较困难,所以分成 code learningtransformation function learning 两个过程进行
  • code learning :先保持 weight 全精度,量化 activation
  • transformation function learning:量化 weight,学习 Al1AlA_{l-1} \to A_l 的映射
  • 最终结果:2-bits activations + 3 值 weights 的 TSQ 只比官方全精度模型准确率低0.5个百分点

Algorithm

传统量化网络

  • 优化过程

    minimize{Wl}   L(ZL,y)minimize_{\{W_l\}}\ \ \ \mathcal L(Z_L, y)

    subject to           W^l=QW(Wl)subject\ to \ \ \ \ \ \ \ \ \ \ \ \hat W_l = Q_W(W_l)

                                Zl^=W^lA^l1\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat {Z_l} = \hat W_l \hat A_{l-1}

                                Al=ψ(Zl)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ A_l = \psi (Z_l)

                                A^l=Qn(A),   forl=1,2,...,L\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat A_l = Q_n(A),\ \ \ for l=1,2,...,L

  • 难收敛的原因

    • 由于对 QW()Q_W()μLW\mu\frac{\partial L}{\partial W} 难以直接更新到 W^\hat W ,导致 WW 更新缓慢
    • QA()Q_A() 的 STE 会引起梯度高方差

Two-Step Quantization (TSQ)

  • step 1:code learning

    基于 HWGQ,不同点是:

    • weights 保持全精度

    • 引入超参数:稀疏阈值 ϵ0\epsilon \ge 0开源代码ϵ=0.32,δ=0.6487\epsilon = 0.32, \delta = 0.6487

      $ Q_{\epsilon}(x)=\left{\begin{array}{ll}{q_{i}^{\prime}} & {x \in\left(t_{i}^{\prime}, t_{i+1}^{\prime}\right]} \ {0} & {x \leq \epsilon}\end{array}\right.$

    • 引入超参数的目的是使得网络更关注于 high activations

  • step 2:transformation function learning

    我对这个步骤的理解是:使用全精度weights网络蒸馏low-bits weights网络

    minimizeΛ,W^YQϵ(ΛW^X)F2=minimize{αi},{w^iT}iyiTQϵ(αiw^iTX)22\begin{aligned} \underset{\Lambda, \hat{W}}{\operatorname{minimize}} \left\|Y-Q_{\epsilon}(\Lambda \hat{W} X)\right\|_{F}^{2} = \operatorname{minimize}_{\left\{\alpha_{i}\right\},\left\{\hat{w}_{i}^{T}\right\}} \sum_{i}\left\|y_{i}^{T}-Q_{\epsilon}\left(\alpha_{i} \hat{w}_{i}^{T} X\right)\right\|_{2}^{2} \end{aligned}

    其中: αi\alpha_i 表示每个卷积核的缩放因子,XXYY 分别表示 A^l1\hat A_{l-1}A^l\hat A_l用全精度weights得到的量化activations网络蒸馏量化weights量化activations网络

    引入辅助变量 zz 对 transformation function learning 进行分解:

    minimizeα,w,zyQϵ(z)22+λzαXTw^22\underset{\alpha, w, z}{\operatorname{minimize}} \quad\left\|y-Q_{\epsilon}(z)\right\|_{2}^{2}+\lambda\left\|z-\alpha X^{T} \hat{w}\right\|_{2}^{2}

    • Solving α\alpha and w^\hat{w} with zz fixed:

      minimizeα,w^J(α,w^)=zαXTw^22\underset{\alpha, \hat{w}}{\operatorname{minimize}} \quad J(\alpha, \hat{w})=\left\|z-\alpha X^{T} \hat{w}\right\|_{2}^{2}

      J(α,w^)=zTz2αzTXTw^+α2w^TXXTw^J(\alpha, \hat{w})=z^{T} z-2 \alpha z^{T} X^{T} \hat{w}+\alpha^{2} \hat{w}^{T} X X^{T} \hat{w}

      α=zTXTw^w^TXXTw^\alpha^{*}=\frac{z^{T} X^{T} \hat{w}}{\hat{w}^{T} X X^{T} \hat{w}}

      w^=argmaxw^(zTXTw^)2w^TXXTw^\hat{w}^{*}=\underset{\hat{w}}{\operatorname{argmax}} \frac{\left(z^{T} X^{T} \hat{w}\right)^{2}}{\hat{w}^{T} X X^{T} \hat{w}}

    • Solving zz with α\alpha and w^\hat{w} fixed:

      minimizezi(yiQϵ(zi))2+λ(zivi)2\underset{z_{i}}{\operatorname{minimize}} \quad\left(y_{i}-Q_{\epsilon}\left(z_{i}\right)\right)^{2}+\lambda\left(z_{i}-v_{i}\right)^{2}

      czi(0)=min(0,vi){c}{z_{i}^{(0)}=\min \left(0, v_{i}\right)}

      zi(1)=min(M,max(0,λvi+yi1+λ){z_{i}^{(1)}=\min \left(M, \max \left(0, \frac{\lambda v_{i}+y_{i}}{1+\lambda}\right)\right.}

      zi(2)=max(M,vi){z_{i}^{(2)}=\max \left(M, v_{i}\right)}

      • 使用 Optimal TernaryWeights Approximation (OTWA) 初始化 α\alphaW^\hat W

        minα,w^ wαw^22      subject to    α>0,    w^{1,0,+1}m\min_{\alpha, \hat{w}} \ {\|w-\alpha \hat{w}\|_{2}^{2}}\ \ \ \ \ \ subject\ to\ \ \ \ \alpha>0, \ \ \ \ {\hat{w} \in\{-1,0,+1\}^{m}}

        α=wTw^w^Tw^\alpha^{*} =\frac{w^{T} \hat{w}}{\hat{w}^{T} \hat{w}}

        w^=argmaxw^(wTw^)2w^Tw^\hat{w}^{*} =\underset{\hat{w}}{\operatorname{argmax}} \frac{\left(w^{T} \hat{w}\right)^{2}}{\hat{w}^{T} \hat{w}}

        w^j={sign(wj)abs(wj) in top r of abs(w)0 others \hat{w}_{j}=\left\{\begin{array}{ll}{\operatorname{sign}\left(w_{j}\right)} & {\operatorname{abs}\left(w_{j}\right) \text { in top } r \text { of } a b s(w)} \\ {0} & {\text { others }}\end{array}\right.

      • α\alphaw^\hat{w} 初始值的计算过程 (OTWA)

        初始值的计算过程

Thoughts

  • 对 weights 的量化与 activations 的量化拆分是一个容易想到的简化量化问题的方法
  • 把对 weights 的量化转换成一种自蒸馏的方法,与 量化位宽 decay 有相似之处