Zhangzhe's Blog

The projection of my life.

0%

LoRA: Low-Rank Adaptation of Large Language Models

URL

TL;DR

  • 本文提出一种名为 Low Rank Adaption (LoRA) 的大模型微调技术,可有效降低大模型微调过程中的可微调参数(降低 10000 倍)和显存占用(降低 3 倍)
  • 具体做法是在 Linear 算子和 Embedding 算子中插入可训练的参数量较少的低秩分解矩阵,冻结原始参数,只训练低秩分解矩阵

Algorithm

总体流程

lora_1.png

  • 这张图几乎包含了 LoRA 全部的信息量:
    1. LoRA 在原始 Linear 参数的基础上,加入了低秩分解矩阵 ARd×r,BRr×dA\in\mathbb{R}^{d\times r},B\in\mathbb{R}^{r\times d}
    2. r << d,所以叫
    3. 原始参数冻结,只训练 AB
    4. 矩阵 A 用均值为 0 的正态分布初始化
    5. 矩阵 B 用全 0 初始化

对应代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, r):
super(LoRALinear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.r = r
# 线性层的权重矩阵
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.bias = nn.Parameter(torch.zeros(out_features))
# LoRA的低秩分解矩阵
self.A = nn.Parameter(torch.randn(r, out_features))
self.B = nn.Parameter(torch.zeros(in_features, r))
def forward(self, x):
# 应用线性层
output = torch.matmul(x, self.weight) + self.bias
# 应用LoRA的低秩分解矩阵
output = output + torch.matmul(x, torch.matmul(self.B, self.A))
return output
def convert_to_standard_linear(self):
# 将LoRA参数转换到标准线性层中
self.weight = nn.Parameter(self.weight + torch.matmul(self.B, self.A))
# 删除LoRA的低秩分解矩阵
del self.A
del self.B
return self
class LoRATransformerLayer(nn.Module):
def __init__(self, d_model, r):
super(LoRATransformerLayer, self).__init__()
self.d_model = d_model
self.r = r
# 自注意力模块的权重矩阵
self.Wq = LoRALinear(d_model, d_model, r)
self.Wk = LoRALinear(d_model, d_model, r)
self.Wv = LoRALinear(d_model, d_model, r)
self.Wo = nn.Linear(d_model, d_model)
def forward(self, x):
# 计算查询、键和值的投影
q = self.Wq(x)
k = self.Wk(x)
v = self.Wv(x)
# 计算自注意力得分和输出
attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_model**0.5)
attn_weights = torch.softmax(attn_scores, dim=-1)
attn_output = torch.matmul(attn_weights, v)
# 计算最终的输出
output = self.Wo(attn_output)
return output
def convert_to_standard_transformer(self):
# 将LoRA参数转换到标准Transformer网络中
self.Wq = self.Wq.convert_to_standard_linear()
self.Wk = self.Wk.convert_to_standard_linear()
self.Wv = self.Wv.convert_to_standard_linear()
return self
# 示例用法
d_model = 512
r = 8
layer = LoRATransformerLayer(d_model, r)
input_tensor = torch.randn(10, 32, d_model)
output_tensor = layer(input_tensor)
print(output_tensor.shape) # 输出: torch.Size([10, 32, 512])
# 转换到标准Transformer网络
standard_layer = layer.convert_to_standard_transformer()
print(standard_layer)

实际使用

  1. 实际使用时,可以用 LoRA_Layer替换大模型中所有 Transformer 层的 LinearEmbedding,对 FFN 中的 MLP 不做替换
  2. 可以不同的任务微调不同的 LoRA 模型
  3. 部署时可以用重参数化融合的方法,将 LoRA 训练参数融合到原始模型中,不会付出任何推理代价

Thought

  • 有道理,也挺好用,但据说真正做大模型预训练 / 微调的没人用