ALiBi: Train short, test long: Attention with linear biases enables input length extrapolation

URL

TL;DR

本文提出一种比 T5 bias 更简单的 position embedding 方法叫做 ALiBi (Attention with Linear Bias)，简单好用
可以在短数据集上训练，在长数据集上测试，即具有外推性

Algorithm

T5 bias

先讲一下 T5 bias 是如何实现 position embedding 的，主要分三步：

计算 query / key 的 n * n 相对位置矩阵，形如：

[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
 [-1,  0,  1,  2,  3,  4,  5,  6,  7,  8],
 [-2, -1,  0,  1,  2,  3,  4,  5,  6,  7],
 [-3, -2, -1,  0,  1,  2,  3,  4,  5,  6],
 [-4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
 [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4],
 [-6, -5, -4, -3, -2, -1,  0,  1,  2,  3],
 [-7, -6, -5, -4, -3, -2, -1,  0,  1,  2],
 [-8, -7, -6, -5, -4, -3, -2, -1,  0,  1],
 [-9, -8, -7, -6, -5, -4, -3, -2, -1,  0]]

将相对位置矩阵分桶（超过 num_buckets 的饱和到 num_buckets）

[[ 0, 17, 18, 19, 20, 21, 22, 23, 24, 24],
 [ 1,  0, 17, 18, 19, 20, 21, 22, 23, 24],
 [ 2,  1,  0, 17, 18, 19, 20, 21, 22, 23],
 [ 3,  2,  1,  0, 17, 18, 19, 20, 21, 22],
 [ 4,  3,  2,  1,  0, 17, 18, 19, 20, 21],
 [ 5,  4,  3,  2,  1,  0, 17, 18, 19, 20],
 [ 6,  5,  4,  3,  2,  1,  0, 17, 18, 19],
 [ 7,  6,  5,  4,  3,  2,  1,  0, 17, 18],
 [ 8,  7,  6,  5,  4,  3,  2,  1,  0, 17],
 [ 8,  8,  7,  6,  5,  4,  3,  2,  1,  0]]

这里上三角和下三角都有值是因为 encoder bidirection=True，如果是 decoder，则如下：

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [2, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [3, 2, 1, 0, 0, 0, 0, 0, 0, 0],
 [4, 3, 2, 1, 0, 0, 0, 0, 0, 0],
 [5, 4, 3, 2, 1, 0, 0, 0, 0, 0],
 [6, 5, 4, 3, 2, 1, 0, 0, 0, 0],
 [7, 6, 5, 4, 3, 2, 1, 0, 0, 0],
 [8, 7, 6, 5, 4, 3, 2, 1, 0, 0],
 [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]]

最后是将此 n * n 的 relative position bucket 通过可学习的 embedding 函数变成 n * n * num_heads 的向量，和每个头的 attention score（softmax 之前） 相加，然后通过逐行 softmax 得到 attention weight

ALiBi

用数学公式表示： $softmax(q_iK^T+m\cdot[-(i-1),...,-2,-1,0])$
ALiBi 的计算和 T5 bias 的前两步几乎一模一样
第三步不再使用可学习的 embedding 函数映射到每个头上，而是将距离矩阵的值和每个头独立的 不可学习的 常量 m 值相乘，然后和 attention score 相加
$m_h = \frac{b}{(2^{(8/H)} \cdot b)^h}$ $m_{h} = \frac{b}{( 2 ^{(8 / H)} \cdot b ) ^{h}}$
- b 是一个基数
- H 是注意力头的数量
- h 是注意力头的索引（从 0 到 H-1）

Thought

标准 attention 的 $pe \in \mathbb{R}^{n\times d}$ 已经慢慢被淘汰了，不管是 RoPE / T5 Bias / ALiBi 都已经逐渐演变成 $pe \in \mathbb{R}^{n\times n}$ 直接作用在 attention score 上了
ALiBi 的外推性其实本质是强行饱和掉远距离，有点过于粗暴了…