一例 centos7.6 内核 hardlock 的解析

Original 云平台 OPPO数智技术 2021-10-05

收录于话题 #云化底层的那些事 14个内容

点击关注“OPPO互联网技术”，阅读更多技术干货

本文以一个在centos7.6内核发生的crash，目前在红帽系列的内核还未见合入，如果线上遇到了可以参考本案例进行规避，同时也描述一下常见hardlock导致panic的解bug流程。

1. 故障现象

机器出现复位，收集到了crash文件，从日志中显示为hardlock。

KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7.x86_64/vmlinuxDUMPFILE: vmcore [PARTIAL DUMP]CPUS: 48DATE: Wed Dec 11 23:25:55 2019UPTIME: 80 days, 12:06:06-----------------------crash之前运行时间LOAD AVERAGE: 67.57, 65.21, 53.49-----------死之前load 较高TASKS: 3169NODENAME:RELEASE: 3.10.0-957.27.2.el7.x86_64VERSION: #1 SMP Mon Jul 29 17:46:05 UTC 2019MACHINE: x86_64 (2300 Mhz)MEMORY: 382.5 GBPANIC: "Kernel panic - not syncing: Hard LOCKUP"--------配置了hardlock_panicPID: 0COMMAND: "swapper/34"TASK: ffff9f1a36998000 (1 of 48) [THREAD_INFO: ffff9f1a36994000]CPU: 34STATE: TASK_RUNNING (PANIC)

crash> btPID: 0 TASK: ffff9f1a36998000 CPU: 34 COMMAND: "swapper/34"#0 [ffff9f487e0489f0] machine_kexec at ffffffffae063b34#1 [ffff9f487e048a50] __crash_kexec at ffffffffae11e242#2 [ffff9f487e048b20] panic at ffffffffae75d85b#3 [ffff9f487e048ba0] nmi_panic at ffffffffae09859f#4 [ffff9f487e048bb0] watchdog_overflow_callback at ffffffffae14a881#5 [ffff9f487e048bc8] __perf_event_overflow at ffffffffae1a26b7#6 [ffff9f487e048c00] perf_event_overflow at ffffffffae1abd24#7 [ffff9f487e048c10] intel_pmu_handle_irq at ffffffffae00a850#8 [ffff9f487e048e38] perf_event_nmi_handler at ffffffffae76d031#9 [ffff9f487e048e58] nmi_handle at ffffffffae76e91c#10 [ffff9f487e048eb0] do_nmi at ffffffffae76eb3d#11 [ffff9f487e048ef0] end_repeat_nmi at ffffffffae76dd89 [exception RIP: tg_unthrottle_up+24]RIP: ffffffffae0dc4d8 RSP: ffff9f487e043e08 RFLAGS: 00000046RAX: ffff9f4a6a078400 RBX: ffff9f787be9ab80 RCX: ffff9f4a03ea9530RDX: 0000000000000005 RSI: ffff9f787be9ab80 RDI: ffff9f4a03ea9400------task_groupRBP: ffff9f487e043e08 R8: ffff9f782c78a100 R9: 0000000000000001R10: 0000000000004dcd R11: 0000000000000005 R12: ffff9f78790edc00R13: ffffffffae0dc4c0 R14: 0000000000000000 R15: ffff9f4a03ea9400ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000--- <NMI exception stack> ---#12 [ffff9f487e043e08] tg_unthrottle_up at ffffffffae0dc4d8#13 [ffff9f487e043e10] walk_tg_tree_from at ffffffffae0d3b20#14 [ffff9f487e043e60] unthrottle_cfs_rq at ffffffffae0e46e7#15 [ffff9f487e043e98] distribute_cfs_runtime at ffffffffae0e496a#16 [ffff9f487e043ee8] sched_cfs_period_timer at ffffffffae0e4b67#17 [ffff9f487e043f20] __hrtimer_run_queues at ffffffffae0c71e3#18 [ffff9f487e043f78] hrtimer_interrupt at ffffffffae0c776f#19 [ffff9f487e043fc0] local_apic_timer_interrupt at ffffffffae05a61b#20 [ffff9f487e043fd8] smp_apic_timer_interrupt at ffffffffae77b6e3#21 [ffff9f487e043ff0] apic_timer_interrupt at ffffffffae777df2--- <IRQ stack> ---#22 [ffff9f1a36997db8] apic_timer_interrupt at ffffffffae777df2 [exception RIP: cpuidle_enter_state+87]RIP: ffffffffae5b06c7 RSP: ffff9f1a36997e60 RFLAGS: 00000202RAX: 0018b604313aebda RBX: ffff9f1a36997e38 RCX: 0000000000000018RDX: 0000000225c17d03 RSI: ffff9f1a36997fd8 RDI: 0018b604313aebdaRBP: ffff9f1a36997e88 R8: 0000000000005a0e R9: 0000000000000018R10: 0000000000004dcd R11: 0000000000000005 R12: ffffffffae7699bcR13: ffffffffae7699c8 R14: ffffffffae7699bc R15: ffffffffae7699c8ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0000#23 [ffff9f1a36997e90] cpuidle_idle_call at ffffffffae5b081e#24 [ffff9f1a36997ed0] arch_cpu_idle at ffffffffae0366de#25 [ffff9f1a36997ee0] cpu_startup_entry at ffffffffae0fd7ba#26 [ffff9f1a36997f28] start_secondary at ffffffffae0580d7#27 [ffff9f1a36997f50] start_cpu at ffffffffae0000d5

2. 分析过程

2.1 可能存在问题的原因

我们知道，hardlock一般有两种，一种是关中断时间过长，超过了阈值，系统通过NMI发送来收集信息。另外一种就是类似于这里的案例（https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/kernel/watchdog.c?id=7edaeb6841dfb27e362288ab8466ebdc4972e867），主要是watchdog本身的机制导致的bug，当然这种案例比较极端少见。

其中以第一种案例比较常见。而常见的关中断时间长，也有多种情况，比如长时间抢spinlock抢不到，比如只是代码bug导致运行时间过长。本文属于第二种情况。

2.2 具体原因的分析

/** High resolution timer interrupt* Called with interrupts disabled*/void hrtimer_interrupt(struct clock_event_device *dev)

既然是时间长，就要看一下这个阈值是多少：

crash> p watchdog_threshwatchdog_thresh = $1 = 60

默认是10s，说明这个环境上有人因为hardlock的问题加大过这个阈值。

反汇编对应的代码，分析如下：

crash> task_group ffff9f4a03ea9400struct task_group {css = {cgroup = 0xffff9f4a57564000,

crash> cgroup.sibling 0xffff9f4a57564000sibling = {next = 0xffff9f4a84df3c10,prev = 0xffff9f1b5b435a10}crash> list 0xffff9f4a84df3c10 |wc -l4460

说明该task_group有4459个兄弟，也就是同一等级的task_group 有这么多，4459比 4460少一个是因为要减去父cgroup的作为链表串接头的计数。

我们知道，cfs在支持组调度之后，每个task_group创建的两个定时器，一个是周期性定时器，也就是堆栈中的sched_cfs_period_timer ，另外一个是 slack_timer,用来归还时间给总池子的。

因为hrtimer是关中断运行的，所以需要解决堆栈中这个定时器为什么会运行这么长时间。

经过分析代码，有两种可能，一种如下：

static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) {struct cfs_bandwidth *cfs_b = container_of(timer, struct cfs_bandwidth, period_timer);ktime_t now;int overrun;int idle = 0;

raw_spin_lock(&cfs_b->lock);for (;;) {-----------一直循环 now = hrtimer_cb_get_time(timer);//其实就是ktime_get，也就是当前时间 overrun = hrtimer_forward(timer, now, cfs_b->period);

if (!overrun)-------假设overrun一直非0则不退出 break;

idle = do_sched_cfs_period_timer(cfs_b, overrun);}raw_spin_unlock(&cfs_b->lock);

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART; }

这是一种可能，还有一种可能如下：

static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun){

while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {////当前还是被throttled并且还有时间没有分发完毕,并且没有并发分发

runtime = cfs_b->runtime;//剩余的时间继续分发

cfs_b->distribute_running = 1;//设置正在分发的标志 raw_spin_unlock(&cfs_b->lock); /* we can't nest cfs_b->lock while distributing bandwidth */ runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires); raw_spin_lock(&cfs_b->lock);

cfs_b->distribute_running = 0; throttled = !list_empty(&cfs_b->throttled_cfs_rq);//判断当前是否被throttled

cfs_b->runtime -= min(runtime, cfs_b->runtime);//防止变成负数}

}

因为第二个循环受限于一种条件，也就是当前被限流的task_group不为空，同时还有时间没有分发完毕，同时没有在并发分发，否则则会break。

我们先来分析第二种可能，先看看当前被限流的task_group链表是否为空，这个是串接在父task_group中的。

crash> task_group.parent ffff9f4a03ea9400parent = 0xffff9f78790edc00

crash> task_group.cfs_bandwidth.throttled_cfs_rq 0xffff9f78790edc00cfs_bandwidth.throttled_cfs_rq = {next = 0xffff9f782c78c900,prev = 0xffff9f48362bd900},crash> list 0xffff9f782c78c900 |wc -l46

除去链表头，说明还有45个task_group被限流，满足一个条件，那runtime呢？

crash> task_group.cfs_bandwidth.runtime 0xffff9f78790edc00cfs_bandwidth.runtime = 0

初步看起来好像不符合循环的条件，那有没有可能是前面一直循环，但是到crash的时候，这个值被修改为了0呢？

那就需要查看一下调用 distribute_cfs_runtime 的时候的 runtime是多少，runtime是第二个参数，也就是rsi，我们需要获取栈中的数据。

crash> dis -l distribute_cfs_runtime/usr/src/debug/kernel-3.10.0-957.27.2.el7/linux-3.10.0-957.27.2.el7.x86_64/kernel/sched/fair.c: 34870xffffffffae0e4860 <distribute_cfs_runtime>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]0xffffffffae0e4865 <distribute_cfs_runtime+5>: push %rbp0xffffffffae0e4866 <distribute_cfs_runtime+6>: mov %rsp,%rbp

0xffffffffae0e4869 <distribute_cfs_runtime+9>: push %r15 0xffffffffae0e486b <distribute_cfs_runtime+11>: push %r14 0xffffffffae0e486d <distribute_cfs_runtime+13>: push %r13 0xffffffffae0e486f <distribute_cfs_runtime+15>: push %r12

0xffffffffae0e4871 <distribute_cfs_runtime+17>: push %rbx0xffffffffae0e4872 <distribute_cfs_runtime+18>: sub $0x18,%rsp

0xffffffffae0e4876 <distribute_cfs_runtime+22>: mov %rsi,-0x40(%rbp)----将初始剩余时间压栈，也就是栈中的000000003b9aca00,也就是十进制的 1000000000

#15 [ffff9f487e043e98] distribute_cfs_runtime at ffffffffae0e496affff9f487e043ea0: 000000003b9aca00 ffff9f782c78a100 ----我们的rsi被压栈在此ffff9f487e043eb0: 51afefcdce629c68 ffff9f78790edd80ffff9f487e043ec0: ffff9f78790edd48 ffff9f78790ede40ffff9f487e043ed0: 0018a72ed3fa2963 0000000000000001ffff9f487e043ee0: ffff9f487e043f18 ffffffffae0e4b67#16 [ffff9f487e043ee8] sched_cfs_period_timer at ffffffffae0e4b67ffff9f487e043ef0: ffff9f78790edd80 ffff9f487e055960ffff9f487e043f00: ffff9f487e0559a0 ffffffffae0e4a90ffff9f487e043f10: ffff9f487e055a98 ffff9f487e043f70ffff9f487e043f20: ffffffffae0c71e3

crash> pd 0x000000003b9aca00$2 = 1000000000

而带宽设置的时候，参数如下：

crash> cfs_bandwidth.distribute_running,runtime,period,quota crash> ffff9f78790edd48distribute_running = trueruntime = 0period = {tv64 = 23148000}quota = 1000000000----------------每个period补充的时间

所以说，能确定进入 distribute_cfs_runtime的时候，是第一次循环，因为rsi入参的值就是quota配置的值。也就是说，我们推断的第二种可能性被推翻了，因为假设是第二次循环且有限流，则不可能入参为1000000000。

不是 do_sched_cfs_period_timer 的while 循环导致了hardlock的检测，当然我们也不能排除一次while循环耗时很长的情况。而 do_sched_cfs_period_timer 的执行时间长短取决于 throttled_cfs_rq 链表的长短，目前crash的时候是有 45 个task_group 被限流。

有没有一种可能，很多个限流的task_group，导致处理很长时间，直到crash的时刻还剩下45个未处理完毕了？通过走读代码，这种概率是存在的。

但是我们认为非常小，因为我们当前对应的内核版本3.10.0.957内核，已经合入了 id=c06f04c70489b9deea3212af8375e2f0c2f0b184 这个补丁的。

哪怕最极端的情况，当出现 4459个 task_group被限流，需要 distribute_cfs_runtime 来解除限流，对应的时间消耗也达不到60s之多。

所以我们需要回到第一种可能性去，就是 sched_cfs_period_timer 出现了多次循环。

第一种可能需要出现循环的条件是：hrtimer_forward 返回非0值，也就是出现了overrun。

overrun就是当前时间减去timer->node.expires 为 cfs_b->period 的整数倍的次数，也就是相当于本来应该在overrun就是当前时间减去timer->timer->node.expires 时刻回调的定时器没有及时执行。

period_timer = {node = {node = {__rb_parent_color = 18446637938508750208, rb_right = 0x0, rb_left = 0x0 }, expires = {tv64 = 6955566201684000}}

2.3 复现

需要制造这样一种情况的条件就是：cfs_b→period比较小，另外待解除限流的进程足够多就行。下面我们就模仿一下这种情况：通过复现，我们来验证一下猜测。

第一步，创建一个耗cpu的进程

# cat caq.c

\#include <stdio.h>

int main(int argc,char* argv[]){int i=0;while(1)i++;

}

然后

gcc -g -o caq.o caq.c

这个caq.o下面会用到。

#!/bin/bash

mkdir -p /sys/fs/cgroup/cpu/user.slice/1

echo 1000> /sys/fs/cgroup/cpu/user.slice/1/cpu.cfs_period_us

#因为我们有60个核，就设置54个吧

echo 54000> /sys/fs/cgroup/cpu/user.slice/1/cpu.cfs_quota_us

for i in {1..2000}domkdir -p /sys/fs/cgroup/cpu/user.slice/1/$itemp="_caq_$i"echo $temp./caq.o "$temp" &pid=$(ps -ef |grep -i $temp|grep -v grep |awk '{print $2}') echo $pid >/sys/fs/cgroup/cpu/user.slice/1/$i/cgroup.procsdone

在一个目录下创建2000个cg，很快我们就触发了crash。

如果你在线上执行了上述操作，你可以更新自己的简历了。

3. 结论

那么怎么解决这个问题呢？commitid=2e8e19226，看似把问题解决了，如下：

for (;;) {@@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) if (!overrun) break; + if (++count > 3) {+ u64 new, old = ktime_to_ns(cfs_b->period);++ new = (old * 147) / 128; /* ~115% */+ new = min(new, max_cfs_quota_period);++ cfs_b->period = ns_to_ktime(new);++ /* since max is 1s, this is limited to 1e9^2, which fits in u64 */+ cfs_b->quota *= new;+ cfs_b->quota /= old;++ pr_warn_ratelimited(+ "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",+ smp_processor_id(),+ new/NSEC_PER_USEC,+ cfs_b->quota/NSEC_PER_USEC);++ /* reset count so we don't come right back in here */+ count = 0;+ }+

这个补丁通过放大period来尝试解决这个问题，但是如果仔细分析的话，也不一定能解决，比如你变态地创建更多cpu消耗型的cg，period升级也可能会触发crash，毕竟分发时间的时候，中间有一把spinlock的时间是不可控的。

当然比改之前确实概率降低很多，或者干脆touch_nmi_watchdog一下？这个就留给喜欢和社区打交道的同学了。

从业务的角度说，尽量不要在一个目录下创建那么多耗cpu的cg，同时将period尽量放大一些比较好。

作者 ：Anqing

招聘信息

OPPO互联网云平台团队招聘一大波岗位，涵盖Java、容器、Linux内核开发、产品经理、项目经理等多个方向，请在公众号后台回复关键词“云招聘”查看查详细信息。

你可能还喜欢

OPPO 实时数仓揭秘：从顶层设计实现离线与实时的平滑迁移

OPPO自研ESA DataFlow架构与实践

客户端IPV6迁移适配——连接竞速算法Happy Eyeballs探索实践（一）

OPPO异地多活实践——缓存篇

OPPO自研代码审查系统火眼Code Review实践

更多技术干货

扫码关注

OPPO互联网技术

我就知道你“在看”

: ， . Video Mini Program Like ，轻点两下取消赞 Wow ，轻点两下取消在看

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

跟着南通住建局学“朝令夕改”

一例 centos7.6 内核 hardlock 的解析

1. 故障现象

2. 分析过程

2.1 可能存在问题的原因

2.2 具体原因的分析

3. 结论

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

跟着南通住建局学“朝令夕改”

生成图片，分享到微信朋友圈

一例 centos7.6 内核 hardlock 的解析

1. 故障现象

2. 分析过程

2.1 可能存在问题的原因

2.2 具体原因的分析

3. 结论

您可能也对以下帖子感兴趣