其他
从Core Dump中提取CUDA的报错信息
设备在无效内存地址上使用了加载或存储指令。这使得进程处于不一致的状态,任何后续的CUDA工作都将返回相同的错误。若要继续使用CUDA,进程必须终止并重新启动。
此外,如果有多个线程使用CUDA API,cudaErrorIllegalAddress可能首先在另一个线程上报错,而不是在启动线程上报错。因此,即使在CUDA_LAUNCH_BLOCKING=1的情况下,我也不信任堆栈跟踪呈现的信息。
生成core dumps
使用cuda-gdb打开core dumps
$ /usr/local/cuda/bin/cuda-gdb
(cuda-gdb) target cudacore /tmp/cudacoredump.hostname.pid
Opening GPU coredump: /tmp/cudacoredump.hostname.pid
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ff8b63ce440
[Current focus set to CUDA kernel 0, grid 132575338, block (1240,0,0), thread (0,1,0), device 0, sm 1, warp 62, lane 0]
#0 0x00007ff8b63ce570 in void (anonymous namespace)::softmax_warp_forward<c10::Half, c10::Half, float, 8, false, true>(c10::Half*, c10::Half const*, int, int, int, bool const*, int, bool)<<<(1824,1,1),(32,4,1)>>> ()
触发Warp Illegal Address的指令地址:The exception was triggered at PC 0x7ff8b63ce440
正在运行的kernel名称:softmax_warp_forward
执行停止的地址:0x00007ff8b63ce570
反汇编kernel
(cuda-gdb) disas
...
0x00007ff8b63ce420 <+1056>: IADD3 R8, R6.reuse, 0xc0, RZ
0x00007ff8b63ce430 <+1072>: IADD3 R18, R6, 0xe0, RZ
0x00007ff8b63ce440 <+1088>: LDG.E.U8.SYS R19, [R2+0xe0]
0x00007ff8b63ce450 <+1104>: ISETP.GE.AND P3, PT, R8, R13, PT
...
0x00007ff8b63ce440 <+1088>: LDG.E.U8.SYS R19, [R2+0xe0]
4
检查寄存器
(cuda-gdb) info reg
R0 0xb8198 754072
R1 0xfffc80 16776320
R2 0xff800000 -8388608
R3 0xff800000 -8388608
R4 0xff800000 -8388608
R5 0x7ff8 32760
R6 0x0 0
R7 0x2 2
R8 0x407ce000 1081925632
...
读取GPU内存
# read a void* from CUDA's global memory:
(cuda-gdb) print *(void * @global *)0x7ff841000000
# read an int from CUDA's global memory
(cuda-gdb) print *(int @global *)0x7ff841000000
6
恢复传递给kernel的参数
0x00007ff8b63ce080 <+128>: IMAD R0, R3.reuse, c[0x0][0x174], R6
(cuda-gdb) print *(int @parameter *)0x174
152
_global__ void softmax_warp_forward(
output_t *dst,
const input_t *src,
int batch_size, int stride,
int element_count,
const bool *mask = nullptr,
const int head_chunk_size = -1, bool is_transformer_mask = false) {
...
struct Args { // offset
output_t *dst; // 0
const input_t *src; // 8
int batch_size; // 16
int stride; // 20
int element_count; // 24
// <4 bytes padding>
const bool *mask; // 32
const int head_chunk_size; // 40
bool is_transformer_mask; // 44
};
# stride
(cuda-gdb) print *(int @parameter *) (0x160 + 20)
152