查看原文
其他

jbd2的一种死锁分析

背景:这个是在centos7的环境上复现的,内核版本为3.10.0-957.27.2.el7。

下面列一下我们是怎么排查并解这个问题的。

  一、故障现象

oppo云内核团队接到运维兄弟收集的测试环境一例crash,现象是load很高,卡得没法操作:

KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 40 DATE: Mon Apr 19 15:46:07 2021----收集crash的日期 UPTIME: 312 days, 00:49:56LOAD AVERAGE: 72886.67, 72873.82, 72735.93

我们授权后登陆oppo云宿主,查看最早报hung的日志如下:

Apr 16 19:46:22 [ERR] : - [26652536.550311] INFO: task jbd2/dm-12-8:547025 blocked for more than 120 seconds.Apr 16 19:46:22 [ERR] : - [26652536.550342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.Apr 16 19:46:22 [WARNING] : - [26652536.550372] Call Trace:Apr 16 19:46:22 [WARNING] : - [26652536.550380] [<ffffffff8a76ae39>] schedule_preempt_disabled+0x29/0x70Apr 16 19:46:22 [WARNING] : - [26652536.550383] [<ffffffff8a768db7>] __mutex_lock_slowpath+0xc7/0x1d0Apr 16 19:46:22 [WARNING] : - [26652536.550385] [<ffffffff8a76819f>] mutex_lock+0x1f/0x2fApr 16 19:46:22 [WARNING] : - [26652536.550405] [<ffffffffc03f17b8>] jbd2_update_log_tail+0x28/0x60 [jbd2]Apr 16 19:46:22 [WARNING] : - [26652536.550410] [<ffffffffc03ea5dd>] jbd2_journal_commit_transaction+0x155d/0x19b0 [jbd2]Apr 16 19:46:22 [WARNING] : - [26652536.550417] [<ffffffffc03efe89>] kjournald2+0xc9/0x260 [jbd2]---主函数Apr 16 19:46:22 [WARNING] : - [26652536.550423] [<ffffffff8a0c3f50>] ? wake_up_atomic_t+0x30/0x30Apr 16 19:46:22 [WARNING] : - [26652536.550428] [<ffffffffc03efdc0>] ? commit_timeout+0x10/0x10 [jbd2]Apr 16 19:46:22 [WARNING] : - [26652536.550430] [<ffffffff8a0c2e81>] kthread+0xd1/0xe0Apr 16 19:46:22 [WARNING] : - [26652536.550433] [<ffffffff8a0c2db0>] ? insert_kthread_work+0x40/0x40Apr 16 19:46:22 [WARNING] : - [26652536.550437] [<ffffffff8a776c1d>] ret_from_fork_nospec_begin+0x7/0x21Apr 16 19:46:22 [WARNING] : - [26652536.550439] [<ffffffff8a0c2db0>] ? insert_kthread_work+0x40/0x40Apr 16 19:46:22 [ERR] : - [26652536.550467] INFO: task java:12610 blocked for more than 120 seconds.

二、故障现象分析

既然最早报hung的进程为jbd2/dm-12-8:547025,那就需要先看下该进程的堆栈。

crash> bt 547025PID: 547025 TASK: ffff91792aff1040 CPU: 16 COMMAND: "jbd2/dm-12-8" #0 [ffff91a907247b68] __schedule at ffffffff8a769a72 #1 [ffff91a907247bf8] schedule_preempt_disabled at ffffffff8a76ae39 #2 [ffff91a907247c08] __mutex_lock_slowpath at ffffffff8a768db7 #3 [ffff91a907247c60] mutex_lock at ffffffff8a76819f #4 [ffff91a907247c78] jbd2_update_log_tail at ffffffffc03f17b8 [jbd2] #5 [ffff91a907247ca8] jbd2_journal_commit_transaction at ffffffffc03ea5dd [jbd2] #6 [ffff91a907247e48] kjournald2 at ffffffffc03efe89 [jbd2] #7 [ffff91a907247ec8] kthread at ffffffff8a0c2e81 #8 [ffff91a907247f50] ret_from_fork_nospec_begin at ffffffff8a776c1dcrash> ps -m 547025[ 2 19:52:41.384] [UN] PID: 547025 TASK: ffff91792aff1040 CPU: 16 COMMAND: "jbd2/dm-12-8"

从上面ps -m的打印看,该进程最后一次调度的时间是67时52分之前,也就是[  219:52:41.384]:

按照当前的时间点减去报hung的时间点,说明这个进程报hung之后,没有再经历调度。下面就需要看,这个进程卡在什么地方。代码如下:

void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block){ mutex_lock(&journal->j_checkpoint_mutex);//caq:这里需要拿锁 if (tid_gt(tid, journal->j_tail_sequence)) __jbd2_update_log_tail(journal, tid, block); mutex_unlock(&journal->j_checkpoint_mutex);}

从堆栈和代码可以确定,是在等待 journal->j_checkpoint_mutex。

然后看mutex放在栈里的位置:

crash> dis -l jbd2_update_log_tail |grep mutex_lock0xffffffffc03f17b3 <jbd2_update_log_tail+35>: callq 0xffffffff8a768180 <mutex_lock>crash> dis -l jbd2_update_log_tail |grep mutex_lock -B 100xffffffffc03f1796 <jbd2_update_log_tail+6>: mov %rsp,%rbp0xffffffffc03f1799 <jbd2_update_log_tail+9>: push %r140xffffffffc03f179b <jbd2_update_log_tail+11>: mov %rdx,%r140xffffffffc03f179e <jbd2_update_log_tail+14>: push %r130xffffffffc03f17a0 <jbd2_update_log_tail+16>: mov %esi,%r13d0xffffffffc03f17a3 <jbd2_update_log_tail+19>: push %r120xffffffffc03f17a5 <jbd2_update_log_tail+21>: lea 0xf0(%rdi),%r120xffffffffc03f17ac <jbd2_update_log_tail+28>: push %rbx0xffffffffc03f17ad <jbd2_update_log_tail+29>: mov %rdi,%rbx0xffffffffc03f17b0 <jbd2_update_log_tail+32>: mov %r12,%rdi---x86的arch,说明是r12是第一个入参0xffffffffc03f17b3 <jbd2_update_log_tail+35>: callq 0xffffffff8a768180 <mutex_lock>

然后看mutex_lock里面,并没有人动r12寄存器。

dis -l mutex_lock|grep r12 ---没有输出,说明这个函数里面没有人动r12

再往下面调用链看:

crash> dis -l __mutex_lock_slowpath |head -12/usr/src/debug/kernel-3.10.0-957.27.2.el7/linux-3.10.0-957.27.2.el7.x86_64/kernel/mutex.c: 7710xffffffff8a768cf0 <__mutex_lock_slowpath>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]0xffffffff8a768cf5 <__mutex_lock_slowpath+5>: push %rbp0xffffffff8a768cf6 <__mutex_lock_slowpath+6>: mov %rsp,%rbp0xffffffff8a768cf9 <__mutex_lock_slowpath+9>: push %r150xffffffff8a768cfb <__mutex_lock_slowpath+11>: push %r140xffffffff8a768cfd <__mutex_lock_slowpath+13>: push %r13/usr/src/debug/kernel-3.10.0-957.27.2.el7/linux-3.10.0-957.27.2.el7.x86_64/arch/x86/include/asm/current.h: 140xffffffff8a768cff <__mutex_lock_slowpath+15>: mov %gs:0x10e80,%r13/usr/src/debug/kernel-3.10.0-957.27.2.el7/linux-3.10.0-957.27.2.el7.x86_64/kernel/mutex.c: 7710xffffffff8a768d08 <__mutex_lock_slowpath+24>: push %r12----r12在此被压栈,从这取mutex就行0xffffffff8a768d0a <__mutex_lock_slowpath+26>: push %rbx

根据上面分析,从栈中取得mutex如下:

crash> struct mutex ffff91793cbbe0f0----从栈中取的mutexstruct mutex {----从源码的上下文看,这个其实就是journal_t.j_checkpoint_mutex count = { counter = -9 }, wait_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, wait_list = { next = 0xffff91a907247c10, prev = 0xffff91612cb139c0 }, owner = 0xffff9178e0ad1040, ----对应的owner { osq = { tail = { counter = 0 } }, __UNIQUE_ID_rh_kabi_hide1 = { spin_mlock = 0x0 }, {<No data fields>} }}

下面,就需要根据mutex的owner,看一下对应的进程为什么持有锁之后,长时间不放锁了。

crash> bt 0xffff9178e0ad1040----对应的进程为13027PID: 13027 TASK: ffff9178e0ad1040 CPU: 10 COMMAND: "java" #0 [ffff91791321f828] __schedule at ffffffff8a769a72 #1 [ffff91791321f8b8] schedule at ffffffff8a769f19 #2 [ffff91791321f8c8] jbd2_log_wait_commit at ffffffffc03ef7c5 [jbd2] #3 [ffff91791321f940] jbd2_log_do_checkpoint at ffffffffc03ec0bf [jbd2] #4 [ffff91791321f9a8] __jbd2_log_wait_for_space at ffffffffc03ec4bf [jbd2]---持有j_checkpoint_mutex #5 [ffff91791321f9f0] add_transaction_credits at ffffffffc03e63d3 [jbd2] #6 [ffff91791321fa50] start_this_handle at ffffffffc03e65e1 [jbd2] #7 [ffff91791321fae8] jbd2__journal_start at ffffffffc03e6a93 [jbd2] #8 [ffff91791321fb30] __ext4_journal_start_sb at ffffffffc05abc69 [ext4] #9 [ffff91791321fb70] ext4_da_write_begin at ffffffffc057e099 [ext4]#10 [ffff91791321fbf8] generic_file_buffered_write at ffffffff8a1b73a4#11 [ffff91791321fcc0] __generic_file_aio_write at ffffffff8a1b9ce2#12 [ffff91791321fd40] generic_file_aio_write at ffffffff8a1b9f59#13 [ffff91791321fd80] ext4_file_write at ffffffffc0573322 [ext4]#14 [ffff91791321fdf0] do_sync_write at ffffffff8a241c13#15 [ffff91791321fec8] vfs_write at ffffffff8a242700#16 [ffff91791321ff08] sys_write at ffffffff8a24351f#17 [ffff91791321ff50] system_call_fastpath at ffffffff8a776ddb RIP: 00007fd9b3b0569d RSP: 00007fd8b1f6d680 RFLAGS: 00000293 RAX: 0000000000000001 RBX: 00007fd8b1f6d820 RCX: 0000000000000010 RDX: 0000000000000046 RSI: 00007fd8b1f6b7c0 RDI: 00000000000000ee RBP: 00007fd8b1f6b790 R8: 00007fd8b1f6b7c0 R9: 00000005c5741848 R10: 0000000000216546 R11: 0000000000000293 R12: 0000000000000046 R13: 00007fd8b1f6b7c0 R14: 00000000000000ee R15: 0000000000000046 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

13027也在等待,处于UN状态,它的最后一次调度时间为:

crash> ps -m 13027[ 2 19:52:46.519] [UN] PID: 13027 TASK: ffff9178e0ad1040 CPU: 10 COMMAND: "java"对比一下报hung的jbd2进程:crash> ps -m 547025[ 2 19:52:41.384] [UN] PID: 547025 TASK: ffff91792aff1040 CPU: 16 COMMAND: "jbd2/dm-12-8"

我们发现,其实13027处于UN的时间应该比547025还要多5秒,为啥不是它触发了hung的检测呢?这个问题留在后面解释。

下面,我们就需要继续分析 13027 为什么持有journal_t.j_checkpoint_mutex锁却没有及时释放。

从代码分析 __jbd2_log_wait_for_space。

void __jbd2_log_wait_for_space(journal_t *journal){ ...... while (jbd2_log_space_left(journal) < nblocks) {//caq:空间不够 write_unlock(&journal->j_state_lock); mutex_lock(&journal->j_checkpoint_mutex);
/* * Test again, another process may have checkpointed while we * were waiting for the checkpoint lock. If there are no * transactions ready to be checkpointed, try to recover * journal space by calling cleanup_journal_tail(), and if * that doesn't work, by waiting for the currently committing * transaction to complete. If there is absolutely no way * to make progress, this is either a BUG or corrupted * filesystem, so abort the journal and leave a stack * trace for forensic evidence. */ write_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) { mutex_unlock(&journal->j_checkpoint_mutex); return; } spin_lock(&journal->j_list_lock); nblocks = jbd2_space_needed(journal); space_left = jbd2_log_space_left(journal);//caq:再次获取一次数据 if (space_left < nblocks) {//caq:空间还是不够 int chkpt = journal->j_checkpoint_transactions != NULL; tid_t tid = 0;
if (journal->j_committing_transaction) tid = journal->j_committing_transaction->t_tid; spin_unlock(&journal->j_list_lock); write_unlock(&journal->j_state_lock); if (chkpt) { jbd2_log_do_checkpoint(journal);//caq:此时持有 j_checkpoint_mutex......}

__jbd2_log_wait_for_space在持有 j_checkpoint_mutex的情况下,进入jbd2_log_do_checkpoint并在等待log提交,等待完成的条件是当前提交的tid必须要不大于 j_commit_sequence。

int jbd2_log_wait_commit(journal_t *journal, tid_t tid){ int err = 0;
read_lock(&journal->j_state_lock);#ifdef CONFIG_JBD2_DEBUG if (!tid_geq(journal->j_commit_request, tid)) { printk(KERN_ERR "%s: error: j_commit_request=%d, tid=%d\n", __func__, journal->j_commit_request, tid); }#endif while (tid_gt(tid, journal->j_commit_sequence)) {---只要当前tid大于j_commit_sequence jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n", tid, journal->j_commit_sequence); read_unlock(&journal->j_state_lock); wake_up(&journal->j_wait_commit); wait_event(journal->j_wait_done_commit, !tid_gt(tid, journal->j_commit_sequence));--就会等到tid不大于j_commit_sequence为止 read_lock(&journal->j_state_lock); } read_unlock(&journal->j_state_lock);
if (unlikely(is_journal_aborted(journal))) err = -EIO; return err;}

根据栈中的详细数据:

crash> bt -f 13027PID: 13027 TASK: ffff9178e0ad1040 CPU: 10 COMMAND: "java" #0 [ffff91791321f828] __schedule at ffffffff8a769a72 ffff91791321f830: 0000000000000082 ffff91791321ffd8 ffff91791321f840: ffff91791321ffd8 ffff91791321ffd8 ffff91791321f850: 000000000001ab80 ffff917a3761e180 ffff91791321f860: ffff91791321f870 ffffffff8a75e2cb ffff91791321f870: ffff91791321f8f8 0000000065353fc1 ffff91791321f880: 0000000000000246 ffff91793cbbe000 ffff91791321f890: 000000006f19bbe9 ffff91793cbbe090 ffff91791321f8a0: ffff91793cbbe028 ffff91791321f8f8 ffff91791321f8b0: ffff91791321f8c0 ffffffff8a769f19 #1 [ffff91791321f8b8] schedule at ffffffff8a769f19 ffff91791321f8c0: ffff91791321f938 ffffffffc03ef7c5 #2 [ffff91791321f8c8] jbd2_log_wait_commit at ffffffffc03ef7c5 [jbd2] ffff91791321f8d0: ffff9178e0ad1040 ffff91793cbbe0a8 ffff91791321f8e0: 0000000000000000 ffff9178e0ad1040 ffff91791321f8f0: ffffffff8a0c3f50 ffff91793cbbe098 ffff91791321f900: ffff91793cbbe098 0000000065353fc1 ffff91791321f910: ffff91a925027700 ffff91793cbbe000 ---r12就是journal_t ffff91791321f920: ffff91793cbbe3a0 ffff9186a0a3a1a0 ffff91791321f930: 000000006f19bbe9 ffff91791321f9a0 ---r15就是000000006f19bbe9 ffff91791321f940: ffffffffc03ec0bf #3 [ffff91791321f940] jbd2_log_do_checkpoint at ffffffffc03ec0bf [jbd2]

当前的tid值,可以从栈中取出为:

查看当前tid为:crash> p 0x6f19bbe9$2 = 1863957481crash> journal_t.j_commit_sequence ffff91793cbbe000 j_commit_sequence = 1863957480

熟悉jbd2代码的同学到此肯定能看出问题来了,因为要想j_commit_sequence 不停地增长并完成,得交给对应设备的jbd2进程,而刚好,它又在等待我们的13027 持有的j_checkpoint_mutex,两者明显死锁了。

三、故障复现&修复

__jbd2_log_wait_for_space调用 jbd2_log_wait_commit 的两个分支,一个是jbd2_log_do_checkpoint-->__process_buffer-->jbd2_log_wait_commit,在这个分支是持有j_checkpoint_mutex的。

而在另外一个分支:jbd2_log_do_checkpoint-->jbd2_log_wait_commit,是先释放了j_checkpoint_mutex再进入的jbd2_log_wait_commit。

void __jbd2_log_wait_for_space(journal_t *journal){ int nblocks, space_left; /* assert_spin_locked(&journal->j_state_lock); */
nblocks = jbd2_space_needed(journal); while (jbd2_log_space_left(journal) < nblocks) {//caq:空间不够 write_unlock(&journal->j_state_lock); mutex_lock(&journal->j_checkpoint_mutex);---持有锁
/* * Test again, another process may have checkpointed while we * were waiting for the checkpoint lock. If there are no * transactions ready to be checkpointed, try to recover * journal space by calling cleanup_journal_tail(), and if * that doesn't work, by waiting for the currently committing * transaction to complete. If there is absolutely no way * to make progress, this is either a BUG or corrupted * filesystem, so abort the journal and leave a stack * trace for forensic evidence. */ write_lock(&journal->j_state_lock); if (journal->j_flags & JBD2_ABORT) { mutex_unlock(&journal->j_checkpoint_mutex); return; } spin_lock(&journal->j_list_lock); nblocks = jbd2_space_needed(journal); space_left = jbd2_log_space_left(journal);//caq:再次获取一次数据 if (space_left < nblocks) {//caq:空间还是不够 int chkpt = journal->j_checkpoint_transactions != NULL; tid_t tid = 0;
if (journal->j_committing_transaction) tid = journal->j_committing_transaction->t_tid; spin_unlock(&journal->j_list_lock); write_unlock(&journal->j_state_lock); if (chkpt) {----这个分支,持有j_checkpoint_mutex jbd2_log_do_checkpoint(journal);//caq:注意此时持有 j_checkpoint_mutex } else if (jbd2_cleanup_journal_tail(journal) == 0) { /* We were able to recover space; yay! */ ; } else if (tid) {---而这个分支,同样是可能会调用 jbd2_log_wait_commit,正常释放了锁 /* * jbd2_journal_commit_transaction() may want * to take the checkpoint_mutex if JBD2_FLUSHED * is set. So we need to temporarily drop it. */ mutex_unlock(&journal->j_checkpoint_mutex);//caq:这个分支在此释放了锁 jbd2_log_wait_commit(journal, tid); write_lock(&journal->j_state_lock); continue; } else { printk(KERN_ERR "%s: needed %d blocks and " "only had %d space available\n", __func__, nblocks, space_left); printk(KERN_ERR "%s: no way to get more " "journal space in %s\n", __func__, journal->j_devname); WARN_ON(1); jbd2_journal_abort(journal, 0); } write_lock(&journal->j_state_lock); } else { spin_unlock(&journal->j_list_lock); } mutex_unlock(&journal->j_checkpoint_mutex); }}

而在linux主线中,有一个14年的补丁其实是去掉了__process_buffer中复杂的持锁流程的。

[root@custom-16-126 jbd2]# git log -p be1158cc615fd723552f0d9912087423c7cadda5commit be1158cc615fd723552f0d9912087423c7cadda5Author: Theodore Ts'o <tytso@mit.edu>Date: Mon Sep 1 21:19:01 2014 -0400
jbd2: fold __process_buffer() into jbd2_log_do_checkpoint() __process_buffer() is only called by jbd2_log_do_checkpoint(), and it had a very complex locking protocol where it would be called with the j_list_lock, and sometimes exit with the lock held (if the return code was 0), or release the lock. This was confusing both to humans and to smatch (which erronously complained that the lock was taken twice). Folding __process_buffer() to the caller allows us to simplify the control flow, making the resulting function easier to read and reason about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150 bytes (over 4% of the text size). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>

那是不是说,把__process_buffer重构成上面补丁这样就可以了呢?

其实不是的,重构并没有改变拿锁的逻辑,真正修改的话,需要参照下面这个补丁。

commit 53cf978457325d8fb2cdecd7981b31a8229e446eAuthor: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>Date: Thu Jan 31 23:42:11 2019 -0500
jbd2: fix deadlock while checkpoint thread waits commit thread to finish This issue was found when I tried to put checkpoint work in a separate thread, the deadlock below happened: Thread1 | Thread2 __jbd2_log_wait_for_space | jbd2_log_do_checkpoint (hold j_checkpoint_mutex)| if (jh->b_transaction != NULL) | ... | jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail | will lock j_checkpoint_mutex, | but will be blocked here. | jbd2_log_wait_commit(journal, tid); | wait_event(journal->j_wait_done_commit, | !tid_gt(tid, journal->j_commit_sequence)); | ... |wake_up(j_wait_done_commit) } | then deadlock occurs, Thread1 will never be waken up. To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()    when we are going to wait for transaction commit.

回到上面遗留的小问题,为什么jbd2/dm-12-8:547025的最后调度时间比 13027 要晚,但是它却先报出hung呢?这个需要看 khungtaskd的检测机制有关,可以查看check_hung_uninterruptible_tasks的源码,具体跟task的串接位置相关。

四、故障规避或解决

我们的解决方案是:在红帽的当前3.10.0-957.27.2内核中合入对应的补丁,或者偷懒直接升级到kernel-3.10.0-1127.el7及以上。

  关于作者:

Anqing,OPPO混合云内核&虚拟化技术负责人


☆ END ☆



招聘信息

OPPO互联网团队招聘一大波岗位,涵盖Java、容器、Linux内核开发、产品经理、项目经理等多个方向,请在公众号后台回复关键词“招聘”查看查详细信息。


你可能还喜欢
一例 centos7.6 内核 hardlock 的解析OPPO内核性能追踪平台技术实践
如何进行 kubernetes 问题的排障
一例ubuntu 16内核crash分析:radix tree相关(上)
一例ubuntu 16内核crash分析:radix tree相关(下)
: . Video Mini Program Like ,轻点两下取消赞 Wow ,轻点两下取消在看

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存