Debug往事（一）

Original zijie0 RandomGenerator 2022-07-20

收录于合集

最近工作中遇到了一个非常诡异的随机 segfault 问题，连着排查了几天都没有找到问题的根源，不过倒是又学习到了很多新知识。回想起之前查过的一些 bug，感觉也可以总结分享一些出来，顺便也向高手们学习一下更先进的 debug 技术。今天先来一篇比较简单的跟 lightgbm 相关的排错故事。

问题背景

在某个算法项目中，我们使用了大家耳熟能详的 lightgbm 来进行建模。由于当时公司的机器比较强劲，加上需要训练的模型数量比较多，我们使用了 Python 中的 multiprocessing 来进行多个模型的并发训练。

一开始一切都运行完美，模型 pipeline 用上并发之后效率提升了不少。直到有一天，我们开始构建多组数据下的回测场景，代码大致是这个样子：

for param in [param1, param2]: run_pipeline(param)

这个run_pipeline里大概逻辑是：

from multiprocessing import Pool

def run_pipeline(param): df = build_features(param) dfs = split(df) # 示意一下这里用了多进程来跑 lgb with Pool(4) as p: p.map(lgb_training, dfs)

比较奇怪的是，不管这个 param 怎么变化，甚至两次传相同的 param，都会导致这个run_pipeline第一次能执行成功，而第二次就会 hang 住，没有任何错误提示。

问题分析

经验判断

我们拿到一个问题，首先会用一些过往经验来“检索”类似的问题。比如之前碰到过使用 Python 的 multiprocessing 时，如果 child process 因为 segfault，OOM，open file 限制等问题异常退出，会导致主进程 hang 住（例如这个 issue^[1]），感觉就跟这个情况有点类似。

带着这个假设，我们先做了一些常规的检测，例如执行ulimit -c unlimited打开 core file 的生成限制，查看各类系统资源和/var/log/messages。但是复现了几次之后没有发现任何异常，也没有 core file 生成，暂时排除了子进程 crash 的可能。

查看 callstack

对于这种进程 hang 的问题，最常规的方法就是查看当前的 callstack 来判断程序卡在什么位置，尤其对于各种 IO 请求能很快定位到问题。我们执行pstack <pid>来查看进程状态，得到如下的信息：

Thread 21 (Thread 0x7f1dd77f6700 (LWP 386196)):#0 0x00007f1f1e8a5965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007f1e6ec9b4cb in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /xxx/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/x86_64-conda_cos6-linux-gnu/bits/gthr-default.h:878

#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at /xxx/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/condition_variable.cc:53

#3  0x00007f1eebbd48d2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() () from /data/miniconda3/envs/yyy/lib/python3.7/site-packages/pyarrow/libarrow.so.15

#4  0x00007f1eebfecc0f in execute_native_thread_routine () from /data/miniconda3/envs/yyy/lib/python3.7/site-packages/pyarrow/libarrow.so.15

#5 0x00007f1f1e8a1dd5 in start_thread () from /lib64/libpthread.so.0#6 0x00007f1f1e5cb02d in clone () from /lib64/libc.so.6

我们主要关注跟我们应用相关的调用（也就是 Python 环境），这里比较可疑的是pyarrow（在这个库上踩过不少坑……），我们利用它来读取各类 parquet 文件。所以这里就又有了一个假设，是不是pyarrow导致了我们的进程 hang 住呢？

查看一下pandas的文档，发现我们可以在读取写入 parquet 时选用不同的底层引擎。默认是pyarrow，可以选择切换到fastparquet。不过一个一个调用来改比较花费时间，这里我们用了一个小 trick，直接改site-packages下的pandas代码，将默认引擎换掉，就可以快速验证这个想法了。

不过在换成fastparquet以后，卡顿问题仍然存在。此时再看 callstack，也没有了pyarrow的身影，可见问题也不是出在它身上。

动态追踪

Callstack 是一个相对“静止”的排查方法，有时候需要人工获取多个 callstack 来比较和判断程序是不是状态在持续变化，会比较麻烦。所以我们又转向了动态追踪技术，先用strace来看下父进程和子进程的状态。这里没有保存当时的文件，只记录了一下当时看到各个进程都在做 futex_wait，而且是在等待同一个地址。

[pid 2509] 09:41:30.587691 futex(0x7f1a0c0e3d28, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000124>[pid 2509] 09:41:30.588039 futex(0x7f1a0c0e3d78, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49998530} <unfinished ...>[pid 2491] 09:41:30.617915 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <1.000234>[pid 2491] 09:41:30.618109 futex(0x7f1a0c090c28, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000121>[pid 2491] 09:41:30.618450 futex(0x7f1a0c090c78, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999998440} <unfinished ...>[pid 2509] 09:41:30.638360 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.050213>

怎么查这个地址是什么呢？一种办法是使用类似systemtap之类的动态追踪工具，或者使用gdb来查看这个进程的各种状态，内存空间等。

Systemtap 脚本示例：

#! /usr/bin/env stap

# This script tries to identify contended user-space locks by hooking# into the futex system call.

global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */global FUTEX_PRIVATE_FLAG = 128 /* linux 2.6.22+ */global FUTEX_CLOCK_REALTIME = 256 /* linux 2.6.29+ */

global lock_waits # long-lived stats on (tid,lock) blockage elapsed timeglobal process_names # long-lived pid-to-execname mapping

global entry_times%, uaddrs%

probe syscall.futex { if ((op & ~(FUTEX_PRIVATE_FLAG|FUTEX_CLOCK_REALTIME)) != FUTEX_WAIT) next entry_times[tid()] = gettimeofday_us() uaddrs[tid()] = futex_uaddr}

probe syscall.futex.return { if (!(entry_times[tid()])) next elapsed = gettimeofday_us() - entry_times[tid()] lock_waits[pid(), uaddrs[tid()]] <<< elapsed delete entry_times[tid()] delete uaddrs[tid()] if (!(pid() in process_names)) process_names[pid()] = execname()}

probe end { foreach ([pid+, lock] in lock_waits) printf ("%s[%d] lock %p contended %d times, %d avg us\n", process_names[pid], pid, lock, @count(lock_waits[pid,lock]), @avg(lock_waits[pid,lock]))}

不过这两个方法都有不小的成本，例如需要各种对应的 devel，debug symbol 包的安装等。另外个人感觉 futex 相关操作比较底层，如果没有执行程序的源码好像很难判断进程到底在等待什么。

意外发现

在等待 kernel debug symbol 的下载过程中，我们突然注意到每次strace退出时，都会提示从 8 个 process detach 出来。转念一想，为啥每次都正好是 8 个呢？

<detached ...>strace: Process 17826 detachedstrace: Process 17827 detachedstrace: Process 17828 detachedstrace: Process 17829 detachedstrace: Process 17830 detachedstrace: Process 17831 detachedstrace: Process 17832 detachedstrace: Process 17833 detached

想了想程序中设定了多进程/线程来执行的地方，感觉最有可能的就是我们在 lgb 训练时设定了 num_threads 为 8，数量正好对上！这时候排查的思路就转化为 2 个方向：

搜索一下 lgb 训练 hang 的可能原因，比如 github issue，stackoverflow 等
进一步收缩排查范围，看是 lgb 多线程的问题，还是 Python 多进程的问题

经过一番简单的搜索，很快就定位到这个问题来源于 lgb 使用的OpenMP与 multiprocessing 中默认使用的 fork 之间的冲突，例如这个 stackoverflow 的问题^[2]，还有这个 git issue^[3]。

验证这个问题也很简单，把 num_threads 设置为 1 再执行一下相同的代码，发现就能顺利运行了！

解决方案

为了解决这个问题，我们也可以有几个思路：

编译/安装 lgb 时，可以选择不用 OpenMP: pip install lightgbm --install-option=--nomp ，不过看起来这样就成了单线程训练了，官方文档上也写着"strongly not recommended"！
编译 lgb 时，不用 OpenMP，改用 Intel 的 ICC，不明确具体操作方法，而且风险还是比较大……
修改 multiprocessing 启动子进程的方式，默认是 fork，另外有两种可以选的，spawn 和 forkserver，看起来都可以绕过这个问题。用法很简单：

from multiprocessing import get_context

with get_context('forkserver').Pool(2) as pool: res = pool.starmap(run, [1, 2])

这样就可以了！各种启动多进程的方式可以具体看 Python 标准库中的相关代码 multiprocessing/context.py 中。例如 fork 的实现：

def _launch(self, process_obj): code = 1 parent_r, child_w = os.pipe() self.pid = os.fork() if self.pid == 0: try: os.close(parent_r) code = process_obj._bootstrap() finally: os._exit(code) else: os.close(child_w) self.finalizer = util.Finalize(self, os.close, (parent_r,)) self.sentinel = parent_r

spawn 的实现：

def _launch(self, process_obj): from . import semaphore_tracker tracker_fd = semaphore_tracker.getfd() self._fds.append(tracker_fd) prep_data = spawn.get_preparation_data(process_obj._name) fp = io.BytesIO() set_spawning_popen(self) try: reduction.dump(prep_data, fp) reduction.dump(process_obj, fp) finally: set_spawning_popen(None)

parent_r = child_w = child_r = parent_w = None try: parent_r, child_w = os.pipe() child_r, parent_w = os.pipe() cmd = spawn.get_command_line(tracker_fd=tracker_fd, pipe_handle=child_r) self._fds.extend([child_r, child_w]) self.pid = util.spawnv_passfds(spawn.get_executable(), cmd, self._fds) self.sentinel = parent_r with open(parent_w, 'wb', closefd=False) as f: f.write(fp.getbuffer()) finally: if parent_r is not None: self.finalizer = util.Finalize(self, os.close, (parent_r,)) for fd in (child_r, child_w, parent_w): if fd is not None: os.close(fd)

可以看到 fork 的实现里直接调用了 os.fork() ，实现简洁，性能上也比较好。但为啥还要有 spawn 呢？因为 Windows 上没有 fork 啊 :)

不过呢，fork 也是有它的优势的，例如我们在 Python 中常用的“单例模式”，就只能在 fork 模式下生效。

from multiprocessing import get_context

class Const: a = None

def run(x): if Const.a is None: return 0 else: return x

def main(): Const.a = 5 # Change context below will lead to different results with get_context('forkserver').Pool(2) as pool: res = pool.map(run, [1, 2]) print(res)

if __name__ == '__main__': main()

以上述代码为例，如果改了不同的 context（fork, forkserver, spawn），会获取不同的结果。只有使用 fork 形式时，子进程能正确的获取到父进程里设置的 Const.a 的值为 5，其它模式都不行。

总结

所以在这个 debug case 中，我们学习到了 2 点：

Python 中的 multiprocessing 与 OpenMP 是有冲突的，用了 OpenMP 的库比如 lightgbm, pytorch 等在使用中都需要注意。
Multiprocessing 可以选择不同的运行模式，但只有 fork 可以很好的继承父进程的状态，用到全局变量或者单例模式，或者使用 Windows 这类没有 fork 实现的操作系统时需要额外注意。

今天先聊到这里，下次有机会再聊聊之前处理过的一个性能问题。

参考资料

[1]

这个 issue: https://stackoverflow.com/questions/24370930/multiprocessing-pool-hangs-if-child-causes-a-segmentation-fault

[2]

stackoverflow 的问题: https://stackoverflow.com/questions/49049388/understanding-openmp-shortcomings-regarding-fork

[3]

git issue: https://github.com/Microsoft/LightGBM/issues/1789

从来就不缺傻子！

女高管与男下属上班约会开房，男方妻子闹到单位！被开除后她辩称：一直保持0.46-1.22米“个人距离”

周一004 意甲帕尔马VS卡利亚里【全网最强分析】今日继续拿捏主任！跟上吃肉！昨日推荐早场全收，今日8000倍直接做胆！

错过这轮牛市，等于2000年错过楼市！

入不敷出的成武县，每年给退休公务员发8亿养老金

Debug往事（一）

问题背景

问题分析

经验判断

查看 callstack

动态追踪

意外发现

解决方案

总结

参考资料

您可能也对以下帖子感兴趣

从来就不缺傻子！

女高管与男下属上班约会开房，男方妻子闹到单位！被开除后她辩称：一直保持0.46-1.22米“个人距离”

周一004 意甲 帕尔马VS卡利亚里 【全网最强分析】今日继续拿捏主任！跟上吃肉！昨日推荐早场全收，今日8000倍直接做胆！

错过这轮牛市，等于2000年错过楼市！

入不敷出的成武县，每年给退休公务员发8亿养老金

生成图片，分享到微信朋友圈

Debug往事（一）

问题背景

问题分析

经验判断

查看 callstack

动态追踪

意外发现

解决方案

总结

参考资料

您可能也对以下帖子感兴趣

周一004 意甲帕尔马VS卡利亚里【全网最强分析】今日继续拿捏主任！跟上吃肉！昨日推荐早场全收，今日8000倍直接做胆！