0%

CUDA 基础之矩阵乘优化

Posted on 2022-11-21 Edited on 2025-03-21 In CUDA Valine:

CUDA（Compute Unified Device Architecture，统一计算设备架构）资料：

GPU 体系结构

物理模型
- 典型的 GPU 包含一组流处理器 (stream multi-processors, SM)，每个流处理器都有许多核心，硬件实现上这些核心之间可共享内存（shared memory）
逻辑模型
- 逻辑模型中，引入了 Grid / Block / Thread 三级概念，逻辑模型与物理的对应关系如下：
  
  因此：同一个 Block 中的 Thread 可共享 shared memory
Memory Hierarchy

shared memory 速度几乎和 L1 cache 一样，比 local memory 和 global memory 都快的多（在物理上，local memory 和 global memory 是同一块 DRAM）
在对 GPU 进行编程时，需要创建一组进程块 (thread blocks)，每个 thread 映射到单个核心，而 block 映射到流式多处理器 (SM)，如下图所示：
每个线程可由 threadIdx 和 blockIdx 索引，在实际应用中，可以有多维线程索引

共享内存优化

以矩阵乘为例， $A\in \mathbb{R}^{1024\times 1024},B\in \mathbb{R}^{1024\times 1024}$ $A \in R^{1024 \times 1024}, B \in R^{1024 \times 1024}$
- 同一个 block 中的多个 thread 可共享内存，因此可以重排同一个 block 中的 thread 数据，使得尽可能少的数据缓存到 shared memory 中
- 优化前：
  - 每个 thread 需要计算输出矩阵中 8 * 8 的数据，需要从 local memory 中读取 8 * 8 * 1024 * 2 数据
  - 每个 block 中的 thread 之间没有数据共享，所以需要从 local memory 中读取 $8 * 8 * 8 * 8 * 1024 * 2 = 2^{23}$ 个矩阵元素
- 优化后：
  - 每个 block 计算输出矩阵的 64 * 64 的数据最少需要 $64 * 1024 * 2=2^{17}$ 的数据，可提前将这部分数据缓存到 shared memory
  - 然后每个 thread 从 shared memory 读数据计算，需读取 $64 * 1024 * 2=2^{17}$ 个数据
- 内存优化前后每个 block 读取数据对比：
  - 优化前：从 local memory 读取 $2^{23}$ 个矩阵元素
  - 优化后：从 local memory 读取 $2^{17}$ 个矩阵元素到 shared memory，再从 shared memory 读取 $2^{17}$ 个数据计算