页剪枝与奇妙的BUG
1前言
今天在隔壁群里看到一个很有意思的问题:"为什么 autovacuum = off ,vacuum_defer_cleanup_age 参数不为 0 , dead tuple 依然会被回收",在我思考这个问题的时候突然想到我们自研数据库实现了闪回,那么这个问题是否会对闪回这个功能产生影响呢?一起分析一下!最后发现居然是 BUG,截至 15.2 最新版依然没有解决!
2现象
先按照他的步骤复现一下
直接 initdb 然后修改 vacuum_defer_cleanup_age 100w; autovacuum =off
然后就 craete table t, insert 1 一条;一个窗口 update \watch 0.01; 一个窗口begin; insert 不提交;dead tuple 就会被自动清理;然后把 vacuum_defer_cleanup_age 改成0 重启库,再试,就不会被清理了;正常了
为了保持一致,我也重新初始化,并且参数保持一致
postgres=# show autovacuum;
autovacuum
------------
off
(1 row)
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
1000000
(1 row)
postgres=# create table t1(id int);
CREATE TABLE
postgres=# insert into t1 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# select txid_current();
txid_current
--------------
735
(1 row)
然后新开一个会话每隔1秒更新,同时观察 pg_stat_all_tables,为了更加观测清晰,使用 pageinspect 同步观察
为了看清效果,我特意录制了一个动图,可以看到 last_vacuum 和 last_autovacuum 一直没有数据,当死元祖到 203 的时候,n_dead_tup 被离奇地"清除"了,突然变成了1,特意确认了一遍,autovacuum 确实关闭了。那是哪里出了问题?最开始我以为该案例是到了年龄,即使关闭了 autovacuum,也会强制冻结元组,但是很明显年龄还没有达到默认值 2 亿。
后来我想到了我之前曾写过一篇案例,index only scan的误区续,里面提到如下:
在PostgreSQL里面,清理的动作并不仅仅是 vacuum。一共有两个时机
时机1:在查询操作访问到某个页面时,会清理这个页面 时机2:通过 vacuum 操作来清理
后面随着功力见长,我大概猜到了问题出在哪里,没错,正是页剪枝(page pruning),Internal DB 里面也有所提及
The pruning processing will be executed, if possible, when a SQL command is executed such as SELECT, UPDATE, INSERT and DELETE. The exact execution timing is not described in this chapter because it is very complicated. The details are described in the README.HOT file.
PostgreSQL removes dead tuples if possible, as in the pruning process, at an appropriate time. In the document of PostgreSQL, this processing is called defragmentation. Fig. 7.5 despicts the defragmentation by HOT.
同时在我翻译的 《postgresql_internals-14_parts1-4_en》 里面也提到了这个特性,
在读取或者更新堆页面时,会执行一些快速页面清理和剪枝。为了一探究竟,我抓了一下跑的过程中所调用的函数,然后根据 prune 来过滤
[postgres@xiongcc ~]$ cat stap_new.out | grep prune | uniq
postgres: heap_page_prune
postgres: heap_page_prune_execute
postgres: heap_page_prune_opt
postgres: heap_prune_chain
postgres: heap_prune_record_redirect
postgres: heap_prune_record_unused
postgres: heap_prune_satisfies_vacuum
很清晰,所以我很快就定位到了具体代码
/*
* We prune when a previous UPDATE failed to find enough space on the page
* for a new tuple version, or when free space falls below the relation's
* fill-factor target (but not less than 10%).
*
* Checking free space here is questionable since we aren't holding any
* lock on the buffer; in the worst case we could get a bogus answer. It's
* unlikely to be *seriously* wrong, though, since reading either pd_lower
* or pd_upper is probably atomic. Avoiding taking a lock seems more
* important than sometimes getting a wrong answer in what is after all
* just a heuristic estimate.
*/
minfree = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
minfree = Max(minfree, BLCKSZ / 10);
if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
{
/* OK, try to get exclusive buffer lock */
if (!ConditionalLockBufferForCleanup(buffer))
return;
/*
* Now that we have buffer lock, get accurate information about the
* page's free space, and recheck the heuristic about whether to
* prune. (We needn't recheck PageIsPrunable, since no one else could
* have pruned while we hold pin.)
*/
if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
{
/* OK to prune */
(void) heap_page_prune(relation, buffer, vistest,
limited_xmin, limited_ts,
true, NULL);
}
/* And release buffer lock */
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
如注释,也如翻译中所说,有两种情况会进行剪枝,以删除在任何快照中不再可见的元组
之前的 UPDATE 操作没有找到足够的空间将新元组放入同一页面。 堆页中包含的数据多于 fillfactor 存储参数所允许的数据,比如 fillfactor 是 80,那么预留 20% 的空间用于更新操作,假如剩余的空间不足 20%,就会执行页剪枝,另外可以看到最低不能低于单个数据块大小的 10%
注意是删除在任何快照中不再可见的元祖,回顾一下操作,我最开始就获取了永久事务 ID,735,那么后续的更新操作的事务 ID 都会比 735 大(xmax > xmin),而 vacuum 清理死元祖的核心原理在于判断是否存在某个事务比删除元组的事务小
/*
* HeapTupleSatisfiesVacuum
*
* Determine the status of tuples for VACUUM purposes. Here, what
* we mainly want to know is if a tuple is potentially visible to *any*
* running transaction. If so, it can't be removed yet by VACUUM.
*
* OldestXmin is a cutoff XID (obtained from
* GetOldestNonRemovableTransactionId()). Tuples deleted by XIDs >=
* OldestXmin are deemed "recently dead"; they might still be visible to some
* open transaction, so we can't remove them, even if we see that the deleting
* transaction has committed.
*/
HTSV_Result
HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
Buffer buffer)
{
TransactionId dead_after = InvalidTransactionId;
HTSV_Result res;
res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
if (res == HEAPTUPLE_RECENTLY_DEAD)
{
Assert(TransactionIdIsValid(dead_after));
if (TransactionIdPrecedes(dead_after, OldestXmin))
res = HEAPTUPLE_DEAD;
...
case HEAPTUPLE_RECENTLY_DEAD:
/*
* If tuple is recently dead then we must not remove it from
* the relation. (We only remove items that are LP_DEAD from
* pruning.)
*/
recently_dead_tuples++;
prunestate->all_visible = false;
break;
case HEAPTUPLE_INSERT_IN_PROGRESS:
这个函数的逻辑是,如果 xmax > OldestXmin,则返回 HEAPTUPLE_RECENTLY_DEAD,意味着保留此元祖,不去进行删除。所以,如果系统中含有很久之前开启而未提交的事务,并且这个事务由于执行过更新,创建了事务 ID,那么计算出来的 OldestXmin 会非常小,vacuum 做上述这个判断时,结果通常为 true ,即返回 HEAPTUPLE_RECENTLY_DEAD ,这样就会保留此死元祖,导致回收无法完成,表膨胀也由此发生。
既然这样,应该不会删除啊,为什么还做了这个操作?让我们手动执行观测一下
postgres=# vacuum verbose t1;
INFO: vacuuming "public.t1"
INFO: table "t1": found 0 removable, 1 nonremovable row versions in 1 out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 4293973002
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
纳尼?我惊呆了,OldestXmin 怎么是 4293973002?怎么这么大一个数字!各位思考五分钟...
思考完毕,这个逻辑在 procarray.c 的 ComputeXidHorizons 里面
static void
ComputeXidHorizons(ComputeXidHorizonsResult *h)
{
ProcArrayStruct *arrayP = procArray;
TransactionId kaxmin;
bool in_recovery = RecoveryInProgress();
TransactionId *other_xids = ProcGlobal->xids;
LWLockAcquire(ProcArrayLock, LW_SHARED);
...
else
{
/*
* Compute the cutoff XID by subtracting vacuum_defer_cleanup_age.
*
* vacuum_defer_cleanup_age provides some additional "slop" for the
* benefit of hot standby queries on standby servers. This is quick
* and dirty, and perhaps not all that useful unless the primary has a
* predictable transaction rate, but it offers some protection when
* there's no walsender connection. Note that we are assuming
* vacuum_defer_cleanup_age isn't large enough to cause wraparound ---
* so guc.c should limit it to no more than the xidStopLimit threshold
* in varsup.c. Also note that we intentionally don't apply
* vacuum_defer_cleanup_age on standby servers.
*/
h->oldest_considered_running =
TransactionIdRetreatedBy(h->oldest_considered_running,
vacuum_defer_cleanup_age);
h->shared_oldest_nonremovable =
TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
vacuum_defer_cleanup_age);
h->data_oldest_nonremovable =
TransactionIdRetreatedBy(h->data_oldest_nonremovable,
vacuum_defer_cleanup_age);
h->catalog_oldest_nonremovable =
TransactionIdRetreatedBy(h->catalog_oldest_nonremovable,
vacuum_defer_cleanup_age);
/* defer doesn't apply to temp relations */
}
然后会进入到 TransactionIdRetreatedBy,这一步骤会返回 TransactionId(无符号整型 32 位)
/* return transaction ID backed up by amount, handling wraparound correctly */
static inline TransactionId
TransactionIdRetreatedBy(TransactionId xid, uint32 amount)
{
xid -= amount;
while (xid < FirstNormalTransactionId)
xid--;
return xid;
}
注意这个信息在 15 里面有点差异,较前面的 14 版本会多一些附加信息
postgres=# vacuum verbose t1;
INFO: vacuuming "postgres.public.t1"
INFO: finished vacuuming "postgres.public.t1": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total)
tuples: 24 removed, 1 remain, 0 are dead but not yet removable
removable cutoff: 4293968253, which was 1000000 XIDs old when operation ended ---👈🏻在这里
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 20 hits, 4 misses, 5 dirtied
WAL usage: 2 records, 2 full page images, 1088 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 90.54 s
VACUUM
👆🏻可以看到事务 ID(957)减去 vacuum_defer_cleanup_age,再转化为 unit32,就变成了这么大坨数字,957 - 1000000 = -999043,然后补码便是 4293968253(计算机中正数用原码表示,负数用补码表示),
原码:1000 0000 0000 1111 0011 1110 1000 0011 反码:1111 1111 1111 0000 1100 0001 0111 1100 补码:1111 1111 1111 0000 1100 0001 0111 1101
然后再代入到 TransactionIdPrecedes 进行比较两个事务大小
/*
* TransactionIdPrecedes --- is id1 logically < id2?
*/
bool
TransactionIdPrecedes(TransactionId id1, TransactionId id2)
{
/*
* If either ID is a permanent XID then we can just do unsigned
* comparison. If both are normal, do a modulo-2^32 comparison.
*/
int32 diff;
if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
return (id1 < id2);
diff = (int32) (id1 - id2);
return (diff < 0);
}
这样两个普通事务相减,再转为 int32,就变成了负数(最高位为1),返回真,所以 957 这个事务比 OldestXmin 小,不满足 xmax > OldestXmin,所以被清除了。
为了验证想法,这次我将 vacuum_defer_cleanup_age 参数关掉
postgres=# show autovacuum;
autovacuum
------------
on
(1 row)
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
0
(1 row)
postgres=# create table t3(id int);
CREATE TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
990
(1 row)
然后新开一个会话删除
[postgres@xiongcc ~]$ psql
psql (15.1)
Type "help" for help.
postgres=# delete from t3;
DELETE 1
postgres=# vacuum verbose t3;
INFO: vacuuming "postgres.public.t3"
INFO: finished vacuuming "postgres.public.t3": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total)
tuples: 0 removed, 1 remain, 1 are dead but not yet removable ---👈🏻保留
removable cutoff: 990, which was 2 XIDs old when operation ended
new relfrozenxid: 989, which is 1 XIDs ahead of previous value
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 87.047 MB/s, avg write rate: 65.286 MB/s
buffer usage: 7 hits, 4 misses, 3 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
可以看到,这条元组没有被删除,算出来的 oldestXmin 也是正确的(txid_current - vacuum_defer_cleanup_age)具体代码细节在下面,此处就不再演示,感兴趣的童鞋自己追一下流程,什么时候变成 HEAPTUPLE_RECENTLY_DEAD(无法删除),什么时候变成 HEAPTUPLE_DEAD(可以删除)
/*
* HeapTupleSatisfiesVacuum
*
* Determine the status of tuples for VACUUM purposes. Here, what
* we mainly want to know is if a tuple is potentially visible to *any*
* running transaction. If so, it can't be removed yet by VACUUM.
*
* OldestXmin is a cutoff XID (obtained from
* GetOldestNonRemovableTransactionId()). Tuples deleted by XIDs >=
* OldestXmin are deemed "recently dead"; they might still be visible to some
* open transaction, so we can't remove them, even if we see that the deleting
* transaction has committed.
*/
HTSV_Result
HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
Buffer buffer)
{
TransactionId dead_after = InvalidTransactionId;
HTSV_Result res;
res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
if (res == HEAPTUPLE_RECENTLY_DEAD)
{
Assert(TransactionIdIsValid(dead_after));
if (TransactionIdPrecedes(dead_after, OldestXmin))
res = HEAPTUPLE_DEAD;
}
else
Assert(!TransactionIdIsValid(dead_after));
return res;
}
既然这个算法就是简单的相减,那么问题就来了!看下面实验👇🏻
3vacuum_defer_cleanup_age = 10000
postgres=# show autovacuum;
autovacuum
------------
on
(1 row)
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
10000
(1 row)
postgres=# create table t3(id int);
CREATE TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
1000
(1 row)
另开一个会话删除数据,可以看到 oldestXmin 这个值就不对了,oldestXmin = 4294958296
postgres=# delete from t3;
DELETE 1
postgres=# vacuum verbose t3;
INFO: vacuuming "postgres.public.t3"
INFO: table "t3": truncated 1 to 0 pages
INFO: finished vacuuming "postgres.public.t3": index scans: 0
pages: 1 removed, 0 remain, 1 scanned (100.00% of total)
tuples: 1 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 4294958296, which was 10003 XIDs old when operation ended
index scan not needed: 1 pages from table (100.00% of total) had 1 dead item identifiers removed
avg read rate: 14.782 MB/s, avg write rate: 14.782 MB/s
buffer usage: 9 hits, 4 misses, 4 dirtied
WAL usage: 6 records, 1 full page images, 8638 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
4vacuum_defer_cleanup_age = 1000
postgres=# drop table t3;
DROP TABLE
postgres=# show autovacuum;
autovacuum
------------
on
(1 row)
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
1000
(1 row)
postgres=# create table t3(id int);
CREATE TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
1006
(1 row)
这次 oldestXmin 是 6,正确,txid_current - vacuum_defer_cleanup_age
postgres=# delete from t3;
DELETE 1
postgres=# vacuum verbose t3;
INFO: vacuuming "postgres.public.t3"
INFO: finished vacuuming "postgres.public.t3": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total)
tuples: 0 removed, 1 remain, 1 are dead but not yet removable
removable cutoff: 6, which was 1002 XIDs old when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 61.516 MB/s, avg write rate: 46.137 MB/s
buffer usage: 7 hits, 4 misses, 3 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
5vacuum_defer_cleanup_age = 100
postgres=# drop table t3;
DROP TABLE
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
100
(1 row)
postgres=# create table t3(id int);
CREATE TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
1011
(1 row)
这次 oldestXmin 是 911,这次也正确,txid_current - vacuum_defer_cleanup_age
postgres=# delete from t3;
DELETE 1
postgres=# vacuum verbose t3;
INFO: vacuuming "postgres.public.t3"
INFO: finished vacuuming "postgres.public.t3": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total)
tuples: 0 removed, 1 remain, 1 are dead but not yet removable
removable cutoff: 911, which was 102 XIDs old when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 86.088 MB/s, avg write rate: 64.566 MB/s
buffer usage: 7 hits, 4 misses, 3 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
各位其实应该能想明白,假如当前的事务 ID 大于 1000,确保相减出来的值是正数,那么就不会有问题!
postgres=# create table t3(id int);
CREATE TABLE
postgres=# insert into t3 values(1);
INSERT 0 1
postgres=# begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
1016
(1 row)
postgres=*# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
1100
(1 row)
postgres=# delete from t3;
DELETE 1
postgres=# vacuum verbose t3;
INFO: vacuuming "postgres.public.t3"
INFO: table "t3": truncated 1 to 0 pages
INFO: finished vacuuming "postgres.public.t3": index scans: 0
pages: 1 removed, 0 remain, 1 scanned (100.00% of total)
tuples: 1 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 4294967212, which was 1103 XIDs old when operation ended
index scan not needed: 1 pages from table (100.00% of total) had 1 dead item identifiers removed
avg read rate: 12.823 MB/s, avg write rate: 12.823 MB/s
buffer usage: 9 hits, 4 misses, 4 dirtied
WAL usage: 6 records, 1 full page images, 8638 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
只要相减的值是负数,oldestXmin 就会出问题,这次变成了 4294967212。
因此,回到我们最开始的现场,重新初始化全部抹掉重来
postgres=# show autovacuum;
autovacuum
------------
off
(1 row)
postgres=# show vacuum_defer_cleanup_age ;
vacuum_defer_cleanup_age
--------------------------
10000
(1 row)
然后为了确保事务 ID 大于10000,消耗一些事务 ID 先
Wed 15 Feb 2023 11:30:05 AM CST (every 0.001s)
txid_current
--------------
14898
(1 row)
Wed 15 Feb 2023 11:30:05 AM CST (every 0.001s)
txid_current
--------------
14899
(1 row)
^C
postgres=# select txid_current();
txid_current
--------------
14900
(1 row)
确保大于 10000,然后老样子来操作一下
直接 initdb 然后修改 vacuum_defer_cleanup_age 100w; autovacuum =off
然后就 craete table t, insert 1 一条;一个窗口 update \watch 0.01; 一个窗口begin; insert 不提交;dead tuple 就会被自动清理;然后把 vacuum_defer_cleanup_age 改成0 重启库,再试,就不会被清理了;正常了
这次就不会被清理了,死元祖会一直增加!
Wed 15 Feb 2023 11:33:47 AM CST (every 1s)
relname | n_dead_tup | last_vacuum | last_autovacuum
---------+------------+-------------+-----------------
t1 | 280 | |
(1 row)
Wed 15 Feb 2023 11:33:48 AM CST (every 1s)
relname | n_dead_tup | last_vacuum | last_autovacuum
---------+------------+-------------+-----------------
t1 | 280 | |
(1 row)
Wed 15 Feb 2023 11:33:49 AM CST (every 1s)
relname | n_dead_tup | last_vacuum | last_autovacuum
---------+------------+-------------+-----------------
t1 | 282 | |
(1 row)
Wed 15 Feb 2023 11:33:50 AM CST (every 1s)
relname | n_dead_tup | last_vacuum | last_autovacuum
---------+------------+-------------+-----------------
t1 | 282 | |
(1 row)
Wed 15 Feb 2023 11:33:51 AM CST (every 1s)
relname | n_dead_tup | last_vacuum | last_autovacuum
---------+------------+-------------+-----------------
t1 | 284 | |
(1 row)
6小结
从分析,到追代码,到复现验证,真是一个有趣的案例!各位可以去提 bug 了!这个特性是在优化 14 版本的时候引入的,dc7420c2c9274a283779ec19718d2d16323640c0,其实和我之前写过的费解的分区表和表空间类似,官方在引入新特性的过程中,可能会带来一些额外的坑,就跟我们自己写代码一样,修复了一个bug,可能又引入一个其他的bug,导致最后越修越多....
基于此实现的闪回也要当心,不过闪回一般都要求关闭 prune,另外我看到的那个 patch 不是修复这个的。https://www.postgresql.org/message-id/attachment/142215/v2-0001-Fix-corruption-due-to-vacuum_defer_cleanup_age-un.patch,所以各位可以去提 BUG了,致谢名单里下一个就是你。