生产案例 | 损坏的索引
前言
今天下午某位同事找到我说:"有个SQL在数据库中执行1ms就能出结果,但是业务压测发起的SQL就会一直执行,查不出结果,进程也没有锁,CPU占用也很高"。
让我们一起分析一下这个有趣的案例 ~
现象
通过观察pg_stat_activity
发现这个进程一直处于active的状态,并且没有等待事件
postgres=# select * from pg_stat_activity where pid = '105430';
-[ RECORD 1 ]----+-----------------------------------------------------------
datid | 17048
datname | hmlxxx
pid | 105430
usesysid | 17065
usename | hmlxxx
application_name |
client_addr | 30.69.xx
client_hostname | [null]
client_port | 44895
backend_start | 2022-08-05 14:09:00.692644+08
xact_start | 2022-08-05 14:19:00.989568+08
query_start | 2022-08-05 14:19:44.01897+08
state_change | 2022-08-05 14:19:44.018971+08
wait_event_type | [null]
wait_event | [null]
state | active
backend_xid | [null]
backend_xmin | 820266061
query | select xxx ... where lexusdipa0_id = $1
backend_type | client backend
可以看到这条SQL已经运行了36分钟
[postgres@xiongcc ~]$ top -p 105430 -b1
top - 17:11:57 up 641 days, 6:31, 5 users, load average: 4.38, 7.12, 8.06
Tasks: 1 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.0 us, 4.5 sy, 0.0 ni,81.6 id, 5.1 wa, 0.0 hi, 0.8 si, 0.0 st
...
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
105430 postgres 20 0 11.148g 116336 110660 R 13.3 0.0 36:12.43 postgres
而自己手动在数据库中执行这条SQL发现1ms就可以出结果,那么是什么造成了这样的差异?
通过GDB抓一下堆栈看看
#0 0x000000000071f5e3 in pg_atomic_read_u32_impl (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics/generic.h:48
#1 pg_atomic_read_u32 (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics.h:247
#2 LWLockAttemptLock (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:750
#3 0x00000000007201b7 in LWLockAcquire (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:1187
可以看到顶层堆栈在尝试添加轻量锁LWLock,在数据库中,会大量使用共享内存来保存数据结构。
不同的进程需要对这些数据结构频繁地进行读写操作,轻量锁则负责保护这些共享内存中的数据结构。轻量锁是一种读写锁,有共享和排他两种模式:
排他模式:用于数据修改操作,例如:insert、update或delete,确保不会同时对同一资源进行多重更新。 共享模式:用于读取数据操作,允许多个事务读取相同的数据,但不允许其他事务修改当前数据,如select语句。
typedef enum LWLockMode
{
LW_EXCLUSIVE, ---排它模式
LW_SHARED, ---共享模式
LW_WAIT_UNTIL_FREE /* A special mode used in PGPROC->lwWaitMode,
* when waiting for lock to become free. Not
* to be used as LWLockAcquire argument */
} LWLockMode;
早期版本LWLock依赖SpinLock实现,随着版本不断进行优化,利用自旋锁直接保护轻量锁的方法会导致性能恶化,因此目前轻量锁也借助原子操作的原语(TAS/CAS)进行无锁保护。
与机器相关的实现,利用TAS指令集实现(定义在s_lock.h和s_lock.c中); 与机器无关,利用PostgreSQL定义的信号量PGSemaphore实现(定义在spin.c中)。
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{...
old_state = pg_atomic_read_u32(&lock->state); //读出锁的原始值
while (true)
{...
desired_state = old_state; //desired_state上先保持旧值
if (mode == LW_EXCLUSIVE)
{
lock_free = (old_state & LW_LOCK_MASK) == 0;
if (lock_free)
desired_state += LW_VAL_EXCLUSIVE; //desired_state上先保持旧值后,根据锁的模式加排它锁的标识
}
else
{
lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
if (lock_free)
desired_state += LW_VAL_SHARED; //desired_state上先保持旧值后,根据锁的模式加共享锁的标识
}
//desired_state成为在锁标志位上将被设置的值。pg_atomic_compare_exchange_u32()完成设置标志位(不同硬件平台,设置方式不同)
if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state))
{
...
}
}
...
}
通过堆栈的简单分析,我们得出一点结论,数据库在对共享内存中的某些结构加轻量锁。
继续分析,通过strace抓一下调用
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0
可以看到此时在不断地在唤醒等待,唤醒等待,如此重复,仿佛陷入到了一个死循环里面,自旋锁自死锁。而自旋锁本身又没有等待队列、死锁检测机制,需要手动释放,因此看现象貌似遇到了死锁。
//uaddr指向一个地址,val代表这个地址期待的值,当*uaddr==val时,才会进行wait
int futex_wait(int *uaddr, int val);
//唤醒n个在uaddr指向的锁变量上挂起等待的进程
int futex_wake(int *uaddr, int n);
通过perf抓取该进程的热点函数,可以看到热点函数有PinBuffer、LWLockAttempLock、LWLockRelease等,这些函数都会涉及到spinlock。
但是到目前为止,我们只发现了基本现象,但是不知道根因,即便知道了锁的特征,也无法知悉是哪一个环节出了问题。
这个时候不妨试试最原始的办法,开启全日志 log_statement = all。通过观察,有一个惊人发现!发现同一类SQL有些SQL可以正确执行,但是唯独有一个参数无法执行,只有BIND的过程,而没有EXECUTE的过程!
2022-08-05 17:36:29.232 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6098,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.232 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6099,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.233 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2645,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.233 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2646,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.233 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6100,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.233 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6101,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6102,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6103,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2648,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,""
2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2649,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,",""
2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6104,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,""" ---此处 👈
2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2650,"UPDATE",",,,,,,,,""
2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2651,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2652,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"'
2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2653,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2654,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2655,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.238 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2656,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.238 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2657,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.238 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2658,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
可以看到,当最后一次17:36:29.235
执行了这个SQL之后,241835进程经过BIND之后,就没有后续日志了!而其他同类SQL都可以正常执行,于是把这个参数抓取到数据库中去执行一下,可以看到其他参数都可以瞬间执行,唯独这一个SQL一跑就死,ctrl + c都停止不了!
可以看到第一次51.927ms就出来了,而第二次一直无法结束,执行计划显式是走的索引,根据我之前的经验,隐约感觉是索引出了问题,于是用amcheck插件检测一下
SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'public'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
AND c.relname = '索引名'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC LIMIT 10;
果然。。索引又双叒叕出错了!
感兴趣的童鞋可以去看一下这个函数的逻辑
/*
* Function performs the following checks on target page, or pages ancillary to
* target page:
*
* - That every "real" data item is less than or equal to the high key, which
* is an upper bound on the items on the page. Data items should be
* strictly less than the high key when the page is an internal page.
*
* - That within the page, every data item is strictly less than the item
* immediately to its right, if any (i.e., that the items are in order
* within the page, so that the binary searches performed by index scans are
* sane).
*
* - That the last data item stored on the page is strictly less than the
* first data item on the page to the right (when such a first item is
* available).
*
* - Various checks on the structure of tuples themselves. For example, check
* that non-pivot tuples have no truncated attributes.
*
* Furthermore, when state passed shows ShareLock held, function also checks:
*
* - That all child pages respect strict lower bound from parent's pivot
* tuple.
*
* - That downlink to block was encountered in parent where that's expected.
*
* - That high keys of child pages matches corresponding pivot keys in parent.
*
* This is also where heapallindexed callers use their Bloom filter to
* fingerprint IndexTuples for later table_index_build_scan() verification.
*
* Note: Memory allocated in this routine is expected to be released by caller
* resetting state->targetcontext.
*/
static void
bt_target_page_check(BtreeCheckState *state)
小结
这个问题的现象很费解,当传入了某个值之后,就会一直运行停不下来,现象就是一直在自旋,等待给数据页加锁。
根据结果反推的话,查询读取到了这个错误的索引页之后,尝试加锁,但是数据异常,加锁失败,加之自旋锁显著的特点就是"死等",需要获取SpinLock的进程会一直主动检查能否获取得到锁,直到获取到锁后才会继续执行下面的逻辑,这把锁会一直被这个线程持有,直到线程自己主动释放,因此这也是这个进程为什么停不下来的原因,以及传入相同参数的进程都在等待,越堆越多!当然解决方式也好解决,reindex重建索引即可。
真是一段奇妙的踩坑经历啊,那么后续我们的巡检平台是否有必要对这种极其罕见的场景进行索引有效性检测呢?This is a question!