T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

URL

prompt 的作用是在输入之前加上对任务的描述
比如 english_to_franch("Hello, world!") API 会被 prompt 为 "translate English to French: Hello, world!" 纯文本输入到模型

T5 使用的分词算法是 unigram，词表可以在 https://huggingface.co/google-t5/t5-base/blob/main/tokenizer.json 这里找到
"translate English to French: Hello, world!" 会被 tokenize 为：[13959, 1566, 12, 2379, 10, 8774, 6, 296, 55, 1]

与 GPT 系列使用可学习的 position embedding 不同，T5 使用的是 position encoding
且使用的是相对位置编码，而不是绝对位置编码
与 GPT 系列只在模型 casual decoder 第一层输入加入 position embedding 不同，T5 的 position encoding 是在 encoder 以及 decoder 的每一层都是使用了

与 GPT 系列直接使用 token embedding + position embedding 直接得到 hidden state 来输入 decoder 不同，T5 有 encoder 结构
T5 的 encoder 结构采用标准 transformer encoder 结构，每个 token 可以看到所有 token（双向注意力机制）
encoder 一共 12 层，每一层包括如下顺序结构为：
- self attention block
  - layer norm
  - self attention
  - dropout
- FFN
  - layer norm
  - MLP
  - dropout
encoder 最终输出一个 shape = (batch, input_token_len, encoder_dim) 的 encoder hidden state

与 GPT 系列不同之处在于 T5 在 decoder 阶段需要 decoder input
通常情况下 decoder input 是 <BOS> 或 <S> 等特殊标记，token 长度仅仅为 1，用于表示序列开始
decoder input token embedding 和 position encoding 过程和 encoder input token embedding 和 position encoding 并无区别
token embedding + position encoding 得到 decoder hidden state，其 shape = (batch, 1, decoder_dim)

decoder 一共 12 层，每一层包括如下顺序结构为：
- self attention block
  - layer norm
  - self attention
  - dropout
- cross attention block
  - layer norm
  - self attention
  - dropout
- FFN
  - layer norm
  - MLP
  - dropout
其中 self attention 的输入是 decoder hidden state（注意不是 encoder hidden state），在 self attention 中，和 GPT 类似，采用 单向注意力
decoder hidden state 和 encoder hidden state 输入到 cross attention中，Cross attention 和 Self attention 实际上只有一个区别：
- self attention 的 query / key / value 都由同一个 hidden state 得到，因此称为 self
- cross attention 的 key / value 由同一个 hidden state 得到，query 由另一个 hidden state 得到，因此称为 cross
- 在 encoder-decoder 架构的 transformer 中，decoder 中的 cross attention 的 key / value 通常由 encoder output hidden state 得到，query 通常由 decoder hidden state 得到
- Cross attention 中每个 decoder hidden state 可以查询到所有的 encoder hidden state
重复跑完 12 层，最终输出 shape = (batch, 1, decoder_dim) 的 decoder output hidden state

需要将 docoder output hidden state 用一层 MLP 转化到 vocabulary 空间，找到最可能的一个 token
此 token 对应的单词即为模型最终输出的第一个词。
如果这个词是词表中的结束符，则停止输出。如果不是，则用此词替代前一个词，重复上述的 5. decoder input token embedding and position encoding 和 6. decoder 和 7. decoder output hidden state to token 过程，直到达到最长输出长度限制或出现停止符。

标准 transformer 的行为

encoder 输入所有文本，双向注意力，得到 encoder hidden state
decoder 输入初始化为 <BOS>，长度为 1
decoder 每一层包含：
1. self attention，单向注意力
2. cross attention，双向注意力，decoder hidden state 做 query，encoder hidden states 做 key and value
3. FFN