周其仁:停止改革,我们将面临三大麻烦

抛开立场观点不谈,且看周小平写一句话能犯多少语病

罗马尼亚的声明:小事件隐藏着大趋势——黑暗中的风:坚持做对的事相信未来的结果

布林肯突访乌克兰,为何选择去吃麦当劳?

中国不再是美国第一大进口国,贸易战殃及纺织业? 美国进一步延长352项中国商品的关税豁免期

生成图片,分享到微信朋友圈

自由微信安卓APP发布,立即下载! | 提交文章网址
查看原文

生产案例 | 怪异的表膨胀

xiongcc PostgreSQL学徒 2023-03-04

前言

这两天一直在纠结一个表膨胀的问题,分析的过程很艰辛,但是收获颇丰,请耐心阅读,相信各位能对vacuum和表膨胀理解得更加透彻。

现象

在2月19号,线上同事找过来说磁盘一直持续上涨,紧急扩容之后不一会又涨上去了。第一反应是不是WAL不断堆积?后来通过查看pg_stat_activity,发现某张表上vacuum跑了40多个小时,并且由于这个表在持续不断地执行update,表一直在膨胀中。不过奇怪的地方在于这个表据开发和运维反馈说之前一直都是正常的,就从18号开始不正常了,磁盘使用率持续上升。

于是,登上去看了一下这台出问题的实例(由于涉及到生产安全,做了脱敏,下文都用test表代替生产表)。

首先在操作系统层快速看一下,这个进程运行了2416分钟,大约40多个小时,并且还在跑,通过top -p也能看到,vacuum已将单个CPU打满。

[postgres@xiongcc ~]$ ps -aux | grep `echo $PGNAME` | grep -v idle
postgres 33579  61.8  13.3 70799040  52576364 ?    Ss    Feb19    2416:44    postgres:xxx:autovacuum worker test

那么再看看pg_stat_all_tables记录的情况

postgres=# select * from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]-------+-------
relid               | 52668
schemaname          | public
relname             | test
seq_scan            | 328
seq_tup_read        | 9453687223
idx_scan            | 7319484949
idx_tup_fetch       | 66486320864
n_tup_ins           | 114580778
n_tup_upd           | 6439357491
n_tup_del           | 93271368
n_tup_hot_upd       | 4286447506
n_live_tup          | 21250167
n_dead_tup          | 1016968483
n_mod_since_analyze | 664473271
last_vacuum         | 
last_autovacuum     | 2022-02-19 04:41:58.811903+08
last_analyze        | 
last_autoanalyze    | 2022-02-19 04:43:20.897033+08
vacuum_count        | 0
autovacuum_count    | 232
analyze_count       | 0
autoanalyze_count   | 1289

这里可以看到问题所在了,n_dead_tup的数量是n_live_tup的N倍,膨胀十分严重。随着不断运行和update

update test set xxx = $1 where xxx_code = $2

test表一个小时大约增长30GB

postgres=# select pg_size_pretty(pg_total_relation_size('test'));
 pg_size_pretty 
----------------
 954 GB
(1 row)
postgres=# select pg_size_pretty(pg_total_relation_size('test'));
 pg_size_pretty 
----------------
 987 GB
(1 row)

那么为什么会有这么奇怪的现象呢?抓一下堆栈看看

[postgres@xiongcc ~]$ pstack 33579
#0 0x00000000004c9d32 in _bt_checkpage()
#1 0x00000000004c75c5 in _bt_getstackbuf()
#3 0x00000000004caf24 in _bt_pagedel()
#4 0x00000000004ccd5f in btvacuumpage()
#5 0x00000000004ccfac in btvacuumscan()
#6 0x00000000004cd138 in btbulkdelete()
#7 0x00000000005e7111 in lazy_vacuum_index()
#8 0x00000000005e7e7c in lazy_vacuum_rel()
#9 0x00000000005e5a21 in vacuum_rel()
#10 0x00000000005e6a42 in vacuum()
#11 0x00000000006b4899 in do_autovacuum()
#12 0x00000000006b4dc7 in AutoVacWorkerMain.isra.6()
#13 0x00000000006b5679 in StartAutoVacWorker()
#14 0x00000000006c28a2 in sigusr1_handler()
#15 <signal handler called>
#16 0x00002b0433dcf783 in __select_nocancel() from /lib64/libc.so.6
#17 0x000000000047da6a in ServerLoop()
#18 0x00000000006c36c9 in PostmasterMain()
#19 0x000000000047ee2b in main()

第一次看到的堆栈,可以看到进入到vacuum()里面之后,会由vacuum_rel()具体去做vacuum,同时vacuum又可细分为

•lazy vacuum:只是找到dead的元组,把它们的状态标记为可用状态,不进行空间合并。•full vacuum:除了lazy vacuum,还进行空间合并,因此它需要锁表。

autovacuum就是调用的lazy vacuum。

而btbulkdelete和顶层的栈在删除索引项

/*
 * Bulk deletion of all index entries pointing to a set of heap tuples.
 * The set of target tuples is specified via a callback routine that tells
 * whether any given heap tuple (identified by ItemPointer) is being deleted.
 *
 * Result: a palloc'd struct containing statistical info for VACUUM displays.
 */

IndexBulkDeleteResult *
btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
             IndexBulkDeleteCallback callback, void *callback_state)
{
    Relation    rel = info->index;
    BTCycleId    cycleid;
    
...
/*
 * btvacuumscan --- scan the index for VACUUMing purposes
 *
 * This combines the functions of looking for leaf tuples that are deletable
 * according to the vacuum callback, looking for empty pages that can be
 * deleted, and looking for old deleted pages that can be recycled.  Both
 * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
 * btbulkdelete call occurred and _bt_vacuum_needs_cleanup returned true).
 *
 * The caller is responsible for initially allocating/zeroing a stats struct
 * and for obtaining a vacuum cycle ID if necessary.
 */

static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
             IndexBulkDeleteCallback callback, void *callback_state,
             BTCycleId cycleid)
{
    Relation    rel = info->index;
    BTVacState    vstate;
    BlockNumber num_pages;
    BlockNumber scanblkno;
    bool        needLock;

...

再抓一下堆栈看看

[postgres@xiongcc ~]$ pstack 33579
#0 0x00002b0433dcf783 in __select_nocancel() from /lib64/libc.so.6
#1 0x00000000008683ea in pg_usleep()
#2 0x00000000005e7026 in vacuum_delay_point()
#3 0x00000000004cca84 in btvacuumpage()
#4 0x00000000004ccfac in btvacuumscan()
#5 0x00000000004cd138 in btbulkdelete()
#6 0x00000000005e7111 in lazy_vacuum_index()
#7 0x00000000005e7e7c in lazy_vacuum_rel()
#8 0x00000000005e5a21 in vacuum_rel()
#9 0x00000000005e6a42 in vacuum()
#10 0x00000000006b4988 in do_autovacuum()
#11 0x00000000006b4dc7 in AutoVacWorkerMain.isra.6()
#12 0x00000000006b5679 in StartAutoVacWorker()
#13 0x00000000006c28a2 in sigusr1_handler()
#14 <signal handler called>
#15 0x00002b0433dcf783 in __select_nocancel() from /lib64/libc.so.6
#16 0x000000000047da6a in ServerLoop()
#17 0x00000000006c36c9 in PostmasterMain()
#18 0x000000000047ee2b in main()

这次大体保持相同,不过值得注意的是,多了一个vacuum_delay_point(),这个就是基于代价的清理了,在清理索引、表、VM文件等,都有vacuum_delay_point这个函数。其作用是对autovacuum进行限流,防止其占用过多资源影响业务。但是默认值可能和你的硬件能力极不匹配,导致垃圾回收速度跟不上垃圾产生的速度。

当PostgreSQL在清理的过程中,总的cost达到了vacuum_cost_limit之后,就会进行休眠,休眠vacuum_cost_delay这么久。如果这个值设置不够高,也就是说跟当前的系统TPS不匹配的话,那么导致的结果就是vacuum老是被delay,比印度阿三还会偷懒。并且值得注意的是,在12的版本以前,autovacuum_vacuum_cost_delay的默认值是20ms,以后的版本才被调到了2ms,这个无疑会进一步地限制vacuum的速度。这个是特别需要注意的点,新手很容易掉进去这个坑。

简单分析完堆栈之后,可以看到,vacuum正在清理索引中。看看当前这个表的状态

postgres=# select count(*) from test;
 count 
---------
 21352305
(1 row)

Time531506.980 ms (08:51.507)

第一次查询居然花了8分钟,并没有Lock阻塞,查看对应的wait_event,发现是DataFileRead,Waiting for a read from a relation data file. 等待读取数据文件。

再查询一次看看

postgres=# select count(*) from test;
 count 
---------
 21352305
(1 row)

Time30377.980 ms (00:30.378)

为什么这次又算"正常"了呢?我这里加了双引号,由于表中的死元组过多,扫描30秒在这个场景下其实还算正常,因为需要去判断可见性。

那么为什么第一次这么慢,第二次又快了?这里考虑到几个原因

1.第一,最容易想到的,也就是buffer的影响,第一次查询预热了部分,但是由于有Ring Buffer的存在,这种情况产生的影响也很小。Ring Buffer就是为了防止缓冲区被污染,影响其他SQL的性能。比如这种一次性扫描大表的情况下,当在扫描时,发现表的大小超过shared_buffers的1/4时,就会使用特殊的环形缓冲区,大小只有256KB,参照下方。2.第二,可能是第一次扫描置了infomask,第二次查询的时候就可以不用再去CLOG获取可见性信息了,直接查询tuple的infomask即可高效获取可见性。但2000W条数据的查询,就算调用TransactionIdIsInProgressTransactionIdDidCommitTransactionIdDidAbort去获取事务状态也不可能达到如此大的差距,不然早歇菜了,数据量一大怎么玩。3.shared buffer里面都是脏块(不断update),在统计行数查询的时候需要读入数据块,假如发现这个时候已经没有可用的buffer缓冲区了之后,不脏的话直接丢弃还好,假如是脏的还需要自己去做刷脏的累活,并且这个是同步IO,不像bgwriter这种异步IO,所以会一直阻塞直到IO完成。并且参考文章后方,出问题的时候回收的速度只有3MB/s,感觉好像是这个理?但是又回到第一点的ring buffer,又说不通了,就256KB,就32个block,最多刷32个block。4.IO出问题了?或者主机出问题了?但是后面根据主机组同事排查,没有明显异样。

这样来看,好像怎么解释都不通?

参照 http://www.interdb.jp/pg/pgsql08.html

When reading or writing a huge table, PostgreSQL uses a ring buffer rather than the buffer pool. The ring buffer is a small and temporary buffer area. When any condition listed below is met, a ring buffer is allocated to shared memory:

1.Bulk-reading2.When a relation whose size exceeds one-quarter of the buffer pool size (shared_buffers/4) is scanned. In this case, the ring buffer size is 256 KB.3.Bulk-writing4.When the SQL commands listed below are executed. In this case, the ring buffer size is 16 MB.5.•COPY FROM command.

•CREATE TABLE AS command.•CREATE MATERIALIZED VIEW or REFRESH MATERIALIZED VIEW command.•ALTER TABLE command.

1.Vacuum-processing2.When an autovacuum performs a vacuum processing. In this case, the ring buffer size is 256 KB.

当时应该去看一下这个进程在做什么,一般我比较关注的几个点:

•strace看一下在做什么•lsof -p看一下打开了什么文件,或者查看/proc/pid/fd•pstack看一下堆栈•/proc/pid/wchan看一下对应进程的waiting channel信息,大量D进程的时候有时候可能是陷入内核互斥锁mutex,用户空间叫futex•/proc/pid/syscall看一下对应进程正在执行的系统调用•/proc/pid/statm看一下对应进程的内存状态

可惜当时忘了看,着急去处理问题,丢失了一段比较关键的信息。下次排查问题一定得记住这些关键信息

不过回头来看,光看等待事件,貌似和buffer,也就是内存有关,在我们的生产环境里,总共印象中碰到了三起和内存有关的错误,那么看一下操作系统的日志dmesg或者/var/log/messages,果然发现了一些端倪

[13458593.036844] mce:[Hardware Error]: Machine check events logged
[13458593.036864] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[13458593.036879] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#2_DIMM#O (channel1:2 slot:0 page:0x2b4f9b0 offset:0x740 grain:32 syndrome:0x0-  err_code:0008:00c2 socket:0 imc:1 rank:0 bg:3 ba:0 row:1c2b1 co1:10)

EDAC即error detection and correction(错误检测与纠正),是Linux系统内部的机制。在上面的日志中,可以清楚地看出是内存读错误。MC即memory controller(内存控制器)。CE则代表correctable error,是ECC内存中可以纠正的错误,相对地还有UE(uncorrectable error)。

当发现这个 "救命稻草" 之后,于是乎我迅速下结论是主机内存错误,联系主机组同事,哐哐一阵操作,在22号凌晨,将实例迁移到了备机(VCS),同时鉴于上方发现的vacuum_delay_point,也针对表级调整了参数,让vacuum触发的更加频繁。

postgres=# alter table test set (autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE
postgres=# alter table test set (autovacuum_vacuum_cost_delay = 0);
ALTER TABLE

就在我还在暗自庆幸解决了一个疑难杂症之时,实际情况又给我泼了一盆冷水,表还是在膨胀中。

不过有一点值得肯定,就是调优生效了,虽然膨胀中,但是膨胀的速度较之前会延缓的多,之前是一个小时膨胀30GB,现在7个小时膨胀了80GB,说明调优有效果。

[postgres:5432@pssdlxx] [02-22.10:04:42]=# \dt+ test
                                     List of relations
 Schema | Name | Type  |  Owner   |    Size     | Description 
--------+------+-------+----------+-------------+---------------
 public | test | table | postgres |   563 GB     | 
(1 row)

[postgres:5432@pssdlxx] [02-22.10:04:45]=# \dt+ test
                                     List of relations
 Schema | Name | Type  |  Owner   |    Size     | Description 
--------+------+-------+----------+-------------+---------------
 public | test | table | postgres |   564 GB     | 
(1 row)

[postgres:5432@pssdlxx] [02-22.15:56:37]=# \dt+ test
                                     List of relations
 Schema | Name | Type  |  Owner   |    Size     | Description 
--------+------+-------+----------+-------------+---------------
 public | test | table | postgres |   627 GB     | 
(1 row)

BUG?

另外在排查vacuum的日志过程中,还发现了一个有趣的现象

2022-02-15 17:39:32.904 CST,,,198827,,620b6c0f.308ab,1,,2022-02-15 17:02:07 CST,131/6646812,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans: 1
pages: 0 removed, 4619874 remain, 0 skipped due to pins, 4012541 skipped frozen
tuples: 117185 removed, 20900249 remain, 12892 are dead but not yet removable, oldest xmin: 2310656682
buffer usgae: 129850164 hits, 8736003 misses, 1154743 dirtied
avg read rete: 30.412 MB/s, avg write rate: 4.020 MB/s
system usage: CPU: user: 284.80 s, system: 62.12 s, elapsed: 2244.17 s",,,,,,,,,""

2022-02-16 04:09:47.166 CST,,,191771,,620bb3ac.2ed1b,1,,2022-02-15 22:07:40 CST,8/1020774448,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans: 1
pages: 0 removed, 4619874 remain, 4 skipped due to pins, 2670142 skipped frozen
tuples: 50718 removed, 14888673 remain, 21735 are dead but not yet removable, oldest xmin: 2313233944
buffer usgae: -1590880013 hits, 8844791 misses, 3804078 dirtied
avg read rete: 3.180 MB/s, avg write rate: 1.368 MB/s
system usage: CPU: user: 5468.69 s, system: 69.02 s, elapsed: 21726.74 s",,,,,,,,,""

2022-02-19 04:41:58.812 CST,,,19013,,620cf6a8.4a45,1,,2022-02-16 21:05:44 CST,4/961250052,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans: 1
pages: 0 removed, 12693865 remain, 0 skipped due to pins, 4415380 skipped frozen
tuples: 324542 removed, 10410321 remain, 11330 are dead but not yet removable, oldest xmin: 2325094920
buffer usgae: -1621398590 hits, 40271476 misses, 22928440 dirtied
avg read rete: 1.572 MB/s, avg write rate: 0.895 MB/s
system usage: CPU: user: 99513.58 s, system: 528.61 s, elapsed: 200174.58 s",,,,,,,,,""

在22号改了参数之后,vacuum确实触发的更加频繁了

2022-02-23 06:41:09.497 CST,,,40588,,6215556a.9e8c,1,,2022-02-23 05:28:10 CST,24/552963,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans:9
pages: 0 removed, 101799826 remain, 6 skipped due to pins, 586769 skipped frozen
tuples: 5939560 removed, 11121307 remain, 7065 are dead but not yet removable, oldest xmin: 2372685394
buffer usgae: 805741554 hits, 209730527 misses, 103007829 dirtied
avg read rete: 374.242 MB/s, avg write rate: 183.807 MB/s
system usage: CPU: user: 1898.01 s, system: 1264.11 s, elapsed: 4378.23 s",,,,,,,,,""

2022-02-23 06:43:03.480 CST,,,197361,,6215663c.302f1,1,,2022-02-23 06:42:11 CST,12/721238,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans:9
pages: 0 removed, 101914142 remain, 1 skipped due to pins, 100953070 skipped frozen
tuples: 2186 removed, 24194695 remain, 104 are dead but not yet removable, oldest xmin: 2373087920
buffer usgae: 6656945 hits, 1065669 misses, 223141 dirtied
avg read rete: 160.756 MB/s, avg write rate: 33.661 MB/s
system usage: CPU: user: 40.97 s, system: 6.68 s, elapsed: 51.78 s",,,,,,,,,""

2022-02-23 06:48:45.779 CST,,,209698,,6215682b.33322,1,,2022-02-23 06:48:11 CST,13/528492,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans:9
pages: 0 removed, 101814142 remain, 3 skipped due to pins, 101058554 skipped frozen
tuples: 2120 removed, 24984714 remain, 1146 are dead but not yet removable, oldest xmin: 2373114115
buffer usgae: 3764806 hits, 1024178 misses, 39803 dirtied
avg read rete: 235.054 MB/s, avg write rate: 9.135 MB/s
system usage: CPU: user: 27.97 s, system: 5.91 s, elapsed: 34.04 s",,,,,,,,,""

2022-02-25 00:04:31.614 CST,,,36689,,6218abd9.8f51,1,,2022-02-25 00:01:29 CST,35/1791376,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans:9
pages: 0 removed, 101814142 remain, 3 skipped due to pins, 101058554 skipped frozen
tuples: 2120 removed, 24984714 remain, 1146 are dead but not yet removable, oldest xmin: 2373114115
buffer usgae: 3764806 hits, 1024178 misses, 39803 dirtied
avg read rete: 235.054 MB/s, avg write rate: 9.135 MB/s
system usage: CPU: user: 27.97 s, system: 5.91 s, elapsed: 34.04 s",,,,,,,,,""

2022-02-28 01:34:17.711 CST,,,82404,,621bb604.141e4,1,,2022-02-28 01:33:56 CST,55/1152732,0,LOG,00000,"automatic vacuum of table ""pssdlxx.public.test"": index scans:9
pages: 0 removed, 1375524 remain, 1 skipped due to pins, 14130 skipped frozen
tuples: 18517 removed, 21699991 remain, 0 are dead but not yet removable, oldest xmin: 2407000851
buffer usgae: 416407 hits, 950871 misses, 52816 dirtied
avg read rete: 346.001 MB/s, avg write rate: 19.219 MB/s
system usage: CPU: user: 15.36 s, system: 2.87 s, elapsed: 21.47 s",,,,,,,,,""

这里可能各位已经注意到了,buffer usage显示的是负数!

buffer usgae: -1590880013 hits, 8844791 misses, 3804078 dirtied

我惊呆了,难道是遇到了bug?我立马去看了一下代码

            appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
                             read_rate, write_rate);
            appendStringInfo(&buf, _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
                             (long long) AnalyzePageHit,
                             (long long) AnalyzePageMiss,
                             (long long) AnalyzePageDirty);
            appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));

定义写的是%lld,也就是long long int,64bit,2^64次方,这么大一个数字,难道是溢出(overflow)了?一旦超过了最大值,就会像钟表或者汽车的里程表一样,从起点重新开始计数,但是2^64次方是1.844674407371e+19。后来去请教了一下建平,什么情况可能会这样,后来一阵摸索,才发现原来我看的代码是14的,而之前的版本已经做过了优化,见如下说明

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8ce501151e913a27ab4be89c4ea1a2d68a714e5b..a23cdefbd0d3fc515174144cc5c3430db5fd7921 100644 (file)
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -614,10 +614,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                             vacrelstats->new_dead_tuples,
                             OldestXmin);
            appendStringInfo(&buf,
-                            _("buffer usage: %d hits, %d misses, %d dirtied\n"),
-                            VacuumPageHit,
-                            VacuumPageMiss,
-                            VacuumPageDirty);
+                            _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+                            (long long) VacuumPageHit,
+                            (long long) VacuumPageMiss,
+                            (long long) VacuumPageDirty);
            appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
                             read_rate, write_rate);
            appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));

那这里似乎能说明问题了,这个负数表示运行了很久,出问题的版本是11的,这个值不断累加,最终溢出(overflow),产生了负数

    /* At this point we do NOT hold any locks. */

    /* if it was already in the buffer pool, we're done */
    if (found)
    {
        if (!isExtend)
        {
            /* Just need to update stats before we exit */
            *hit = true;
            VacuumPageHit++;

            if (VacuumCostActive)
                VacuumCostBalance += VacuumCostPageHit;

            TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
                                              smgr->smgr_rnode.node.spcNode,
                                              smgr->smgr_rnode.node.dbNode,
                                              smgr->smgr_rnode.node.relNode,
                                              smgr->smgr_rnode.backend,
                                              isExtend,
                                              found);

剩下的就是,为什么在表急剧膨胀的时候,回收速度这么慢??

avg read rete: 3.180 MB/s, avg write rate: 1.368 MB/s
system usage: CPU: user: 5468.69 s, system: 69.02 s, elapsed: 21726.74 s",,,,,,,,,""

avg read rete: 1.572 MB/s, avg write rate: 0.895 MB/s
system usage: CPU: user: 99513.58 s, system: 528.61 s, elapsed: 200174.58 s",,,,,,,,,""

难道是IO出了问题?但是根据主机组的同事排查,出问题的时间段

两台机器都是看到HW-MEM\SAN-SWITCH\SAN-HBA\SAN-LUN没明显问题

这就陷入到困局里了,主机组IO\内存\CPU等关键资源没有明显异常,就内存报了scrub error。

pg_stat_progress_vacuum

在操作系统层碰到障碍了,不妨再回到数据库层。PostgreSQL提供了大量的性能分析视图,在观察vacuum的过程中,我们通常还会去看pg_stat_progress_vacuum这个视图,这个视图是用于观察vacuum的进度的,类似的还有pg_stat_progress_analyze、pg_stat_progress_cluster等。

postgres=# \d pg_stat_progress_vacuum 
           View "pg_catalog.pg_stat_progress_vacuum"
       Column       |  Type   | Collation | Nullable | Default 
--------------------+---------+-----------+----------+---------
 pid                | integer |           |          | 
 datid              | oid     |           |          | 
 datname            | name    |           |          | 
 relid              | oid     |           |          | 
 phase              | text    |           |          | 
 heap_blks_total    | bigint  |           |          | 
 heap_blks_scanned  | bigint  |           |          | 
 heap_blks_vacuumed | bigint  |           |          | 
 index_vacuum_count | bigint  |           |          | 
 max_dead_tuples    | bigint  |           |          | 
 num_dead_tuples    | bigint  |           |          | 

其中phase阶段参照官网说明

光看可能有点难以理解,让我们来测试一下

postgres=# create table test_big2(id int4,name character varying(64),create_time timestamp(0without time zone default clock_timestamp());
CREATE TABLE
postgres=# create index idx_test_big2_id on test_big2 using btree (id);
CREATE INDEX
postgres=# alter table test_big2 set (autovacuum_enabled = off);
ALTER TABLE
postgres=# insert into test_big2(id,name) select n, n||'_vacuum' from generate_series(1,50000000) n;
INSERT 0 50000000
postgres=# delete from test_big2 where id > 30000000;
DELETE 20000000
postgres=# vacuum verbose test_big2 ;
---开始清理

另外一个窗口查看状态

postgres=# select * from pg_stat_progress_vacuum ;\watch 3

Tue 01 Mar 2022 07:10:51 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 10988
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 0

Tue 01 Mar 2022 07:10:54 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 41428
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 0

Tue 01 Mar 2022 07:10:57 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 59348
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 0

Tue 01 Mar 2022 07:11:00 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 72148
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 0

...

Tue 01 Mar 2022 07:11:33 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 298964
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 10660441

Tue 01 Mar 2022 07:11:36 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | vacuuming indexes
heap_blks_total    | 367638
heap_blks_scanned  | 302818
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 11184809
num_dead_tuples    | 11184585

...

Tue 01 Mar 2022 07:11:51 PM CST (every 3s)

-[ RECORD 1 ]------+---------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | vacuuming heap
heap_blks_total    | 367638
heap_blks_scanned  | 302818
heap_blks_vacuumed | 292218
index_vacuum_count | 1
max_dead_tuples    | 11184809
num_dead_tuples    | 11184585

...

Tue 01 Mar 2022 07:11:54 PM CST (every 3s)

-[ RECORD 1 ]------+--------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | scanning heap
heap_blks_total    | 367638
heap_blks_scanned  | 308121
heap_blks_vacuumed | 302817
index_vacuum_count | 1
max_dead_tuples    | 11184809
num_dead_tuples    | 721208


...

Tue 01 Mar 2022 07:12:06 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | vacuuming indexes
heap_blks_total    | 367638
heap_blks_scanned  | 367638
heap_blks_vacuumed | 302817
index_vacuum_count | 1
max_dead_tuples    | 11184809
num_dead_tuples    | 8815415


Tue 01 Mar 2022 07:12:18 PM CST (every 3s)

-[ RECORD 1 ]------+---------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | vacuuming heap
heap_blks_total    | 367638
heap_blks_scanned  | 367638
heap_blks_vacuumed | 364225
index_vacuum_count | 2
max_dead_tuples    | 11184809
num_dead_tuples    | 8815415

...

Tue 01 Mar 2022 07:12:21 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | truncating heap
heap_blks_total    | 367638
heap_blks_scanned  | 367638
heap_blks_vacuumed | 367638
index_vacuum_count | 2
max_dead_tuples    | 11184809
num_dead_tuples    | 8815415

...

Tue 01 Mar 2022 07:12:24 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
datid              | 13890
datname            | postgres
relid              | 68397
phase              | truncating heap
heap_blks_total    | 367638
heap_blks_scanned  | 367638
heap_blks_vacuumed | 367638
index_vacuum_count | 2
max_dead_tuples    | 11184809
num_dead_tuples    | 8815415

中间我省略了一些冗余项,根据实验,我们可以得出如下结论

•正常的vacuum包括如下几个阶段,initializing -> scanning heap -> vacuuming indexes -> vacuuming heap -> cleaning up indexes -> truncating heap -> performing final cleanup•heap_blks_total是预估的需要扫描清理的数据块数量,heap_blks_scanned会在scanning heap的阶段,不断上涨•此例实验看到的顺序是,scanning heap -> vacuum index,然后是vacuuming heap,因为此时heap_blks_scanned小于heap_blks_total( If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed.)于是又接着scanning heap -> vacuuming indexes -> vacuuming heap -> truncating heap•max_dead_tuples指在需要执行一个索引清理周期之前我们可以存储的死亡元组数,取决于maintenance_work_mem,用于控制vacuum进程分配的内存,存储那些无效索引元组的TID。vacuum进程扫描表并收集所有无效元组的TID。然后扫描索引,删除这些TID对应的所有索引项。如果它在扫描完整个表之前耗尽了存放无效元组TID所用的内存,那么它将停止表扫描,转而扫描索引,以丢弃堆积的TID列表,之后从它中断的位置继续扫描表。如果此后还遇到这样的情形,它将再次停止以扫描所有索引。对于一个大表,多次扫描索引的代价是非常昂贵的,特别是在表中有很多索引的情况下。如果maintenance_work_mem设置太低,可能需要多次索引扫描。而num_dead_tuples则表示从上一个索引清理周期以来收集的死亡元组数。这里可以看到index_vacuum_count最后是2,说明进行了两轮索引扫描,也符合我们看到的现象。vacuum index -> vacuuming heap -> scanning heap -> vacuuming indexes -> vacuuming heap -> truncating heap。•heap_blks_vacuumed则代表已经清理的数据块数量•最后的truncating heap阶段代表在截断文件末尾的数据页

根据最后完成的vacuum verbose的打印,也能略知一二,流程如下

1.vacuuming index (scanned index "idx_test_big2_id" to remove 11184585 row versions,注意这里是to remove)2.scanning heap (verbose中没打印)3.vacuuming heap (INFO: table "test_big2": removed 11184585 dead item identifiers in 82240 pages,注意这里是removed),因为heap_blks_scanned小于heap_blks_total,又返回去scanning heap4.scanning heap (verbose中没打印)5.vacuuming index (scanned index "idx_test_big2_id" to remove 8815415 row versions)6.vacuuming heap (table "test_big2": removed 8815415 dead item identifiers in 64820 page)7.index cleanup (INFO: index "idx_test_big2_id" now contains 30000000 row versions in 137099 pages
DETAIL: 20000000 index row versions were removed.,注意这里是removed
8.truncate heap

postgres=# vacuum verbose test_big2 ;
INFO:  vacuuming "public.test_big2"
INFO:  scanned index "idx_test_big2_id" to remove 11184585 row versions
DETAIL:  CPU: user1.74 s, system0.73 s, elapsed: 16.19 s

INFO:  table "test_big2": removed 11184585 dead item identifiers in 82240 pages
DETAIL:  CPU: user0.26 s, system0.23 s, elapsed: 2.11 s

INFO:  scanned index "idx_test_big2_id" to remove 8815415 row versions
DETAIL:  CPU: user1.40 s, system0.58 s, elapsed: 11.70 s

INFO:  table "test_big2": removed 8815415 dead item identifiers in 64820 pages
DETAIL:  CPU: user0.24 s, system0.19 s, elapsed: 2.08 s

INFO:  index "idx_test_big2_id" now contains 30000000 row versions in 137099 pages  
DETAIL:  20000000 index row versions were removed.
54835 index pages were newly deleted.
54835 index pages are currently deleted, of which 0 are currently reusable.
CPU: user0.00 s, system0.00 s, elapsed: 0.22 s.

INFO:  table "test_big2": found 20000000 removable, 30000000 nonremovable row versions in 367638 out of 367638 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1657169
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user7.02 s, system4.13 s, elapsed: 89.28 s.

INFO:  table "test_big2": suspending truncate due to conflicting lock request
INFO:  table "test_big2": truncated 367638 to 241984 pages
DETAIL:  CPU: user0.06 s, system0.35 s, elapsed: 1.66 s

INFO:  table "test_big2": suspending truncate due to conflicting lock request
INFO:  table "test_big2": truncated 241984 to 224448 pages
DETAIL:  CPU: user0.02 s, system0.07 s, elapsed: 3.05 s

INFO:  table "test_big2": truncated 224448 to 220579 pages
DETAIL:  CPU: user0.00 s, system0.02 s, elapsed: 1.78 s
VACUUM

当然,为了可观测性(Observability)可以通过如下SQL查看,为了展示清晰,我直接全部记录了下来,读者可以自行过滤。

Tue 01 Mar 2022 07:10:50 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:00.327657
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 44 MB
vacuumed           | 0 bytes
scanned_pct        | 1.5
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:10:53 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:03.4984
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 312 MB
vacuumed           | 0 bytes
scanned_pct        | 10.9
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:10:56 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:06.501812
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 456 MB
vacuumed           | 0 bytes
scanned_pct        | 15.9
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:10:59 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:09.50529
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 552 MB
vacuumed           | 0 bytes
scanned_pct        | 19.2
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:02 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:12.509953
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 752 MB
vacuumed           | 0 bytes
scanned_pct        | 26.2
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:05 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:15.514417
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 904 MB
vacuumed           | 0 bytes
scanned_pct        | 31.5
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:08 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:18.518801
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1008 MB
vacuumed           | 0 bytes
scanned_pct        | 35.1
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:11 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:21.5222
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1224 MB
vacuumed           | 0 bytes
scanned_pct        | 42.6
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:14 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:24.525298
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1408 MB
vacuumed           | 0 bytes
scanned_pct        | 49.0
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:17 PM CST (every 3s)

-[ RECORD 1 ]------+-----------------
pid                | 21270
duration           | 00:00:27.528915
waiting            | IO.DataFileWrite
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1492 MB
vacuumed           | 0 bytes
scanned_pct        | 51.9
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:20 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:30.531916
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1679 MB
vacuumed           | 0 bytes
scanned_pct        | 58.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 0.0

Tue 01 Mar 2022 07:11:23 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:33.53553
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1836 MB
vacuumed           | 0 bytes
scanned_pct        | 63.9
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 17.5

Tue 01 Mar 2022 07:11:26 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:36.537265
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 1955 MB
vacuumed           | 0 bytes
scanned_pct        | 68.1
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 36.0

Tue 01 Mar 2022 07:11:29 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:39.540419
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2160 MB
vacuumed           | 0 bytes
scanned_pct        | 75.2
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 67.9

Tue 01 Mar 2022 07:11:32 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:00:42.543371
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2328 MB
vacuumed           | 0 bytes
scanned_pct        | 81.0
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 94.1

Tue 01 Mar 2022 07:11:35 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:00:45.547048
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 0 bytes
scanned_pct        | 82.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:38 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:00:48.550786
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 0 bytes
scanned_pct        | 82.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:41 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:00:51.553929
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 0 bytes
scanned_pct        | 82.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:44 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:00:54.55678
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 0 bytes
scanned_pct        | 82.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:47 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:00:57.559698
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 0 bytes
scanned_pct        | 82.4
vacuumed_pct       | 0.0
index_vacuum_count | 0
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:50 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:01:00.562735
waiting            | IO.WALWrite
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2366 MB
vacuumed           | 2256 MB
scanned_pct        | 82.4
vacuumed_pct       | 78.5
index_vacuum_count | 1
dead_pct           | 100.0

Tue 01 Mar 2022 07:11:53 PM CST (every 3s)

-[ RECORD 1 ]------+-----------------
pid                | 21270
duration           | 00:01:03.565758
waiting            | IO.DataFileWrite
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2405 MB
vacuumed           | 2366 MB
scanned_pct        | 83.7
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 6.2

Tue 01 Mar 2022 07:11:56 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:01:06.570415
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2497 MB
vacuumed           | 2366 MB
scanned_pct        | 87.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 20.5

Tue 01 Mar 2022 07:11:59 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:01:09.57484
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2585 MB
vacuumed           | 2366 MB
scanned_pct        | 90.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 34.2

Tue 01 Mar 2022 07:12:02 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:01:12.577924
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | scanning heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2689 MB
vacuumed           | 2366 MB
scanned_pct        | 93.6
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 50.4

Tue 01 Mar 2022 07:12:05 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:01:15.580672
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2872 MB
vacuumed           | 2366 MB
scanned_pct        | 100.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:08 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:01:18.584531
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2872 MB
vacuumed           | 2366 MB
scanned_pct        | 100.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:11 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:01:21.587026
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2872 MB
vacuumed           | 2366 MB
scanned_pct        | 100.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:14 PM CST (every 3s)

-[ RECORD 1 ]------+------------------
pid                | 21270
duration           | 00:01:24.590357
waiting            | f
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming indexes
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2872 MB
vacuumed           | 2366 MB
scanned_pct        | 100.0
vacuumed_pct       | 82.4
index_vacuum_count | 1
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:17 PM CST (every 3s)

-[ RECORD 1 ]------+-----------------
pid                | 21270
duration           | 00:01:27.594302
waiting            | IO.DataFileWrite
mode               | user
database           | postgres
table              | test_big2
phase              | vacuuming heap
table_size         | 2872 MB
total_size         | 3944 MB
scanned            | 2872 MB
vacuumed           | 2818 MB
scanned_pct        | 100.0
vacuumed_pct       | 98.1
index_vacuum_count | 2
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:20 PM CST (every 3s)

-[ RECORD 1 ]------+--------------------
pid                | 21270
duration           | 00:01:30.597643
waiting            | IO.DataFilePrefetch
mode               | user
database           | postgres
table              | test_big2
phase              | truncating heap
table_size         | 2872 MB
total_size         | 2962 MB
scanned            | 2872 MB
vacuumed           | 2872 MB
scanned_pct        | 100.0
vacuumed_pct       | 100.0
index_vacuum_count | 2
dead_pct           | 78.8

Tue 01 Mar 2022 07:12:24 PM CST (every 3s)

-[ RECORD 1 ]------+----------------
pid                | 21270
duration           | 00:01:33.961224
waiting            | IO.DataFileRead
mode               | user
database           | postgres
table              | test_big2
phase              | truncating heap
table_size         | 2872 MB
total_size         | 2825 MB
scanned            | 2872 MB
vacuumed           | 2872 MB
scanned_pct        | 100.0
vacuumed_pct       | 100.0
index_vacuum_count | 2
dead_pct           | 78.8

我这台小破云主机出现的等待事件包括

•IO.DataFileRead:Waiting for a read from a relation data file.•IO.DataFilePrefetch:Waiting for an asynchronous prefetch from a relation data file.•IO.DataFileWrite:Waiting for a write to a relation data file.•IO.WALWrite:Waiting for a write to a WAL file.

那么搞明白这个之后,看下22号切到备机之后的状态

[postgres:5432@postgres] [02-22.17:45:54]=# select * from pg_stat_progress_vacuum ;
-[ RECORD 1 ]------+--------------
pid                | 170537
datid              | 17620
datname            | postgres
relid              | 68397
phase              | vacuuming indexes
heap_blks_total    | 88944972
heap_blks_scanned  | 12096005
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 178956970
num_dead_tuples    | 178956683

[postgres:5432@postgres] [02-22.18:24:14]=# select * from pg_stat_progress_vacuum ;
-[ RECORD 1 ]------+--------------
pid                | 170537
datid              | 17620
datname            | postgres
relid              | 68397
phase              | vacuuming indexes
heap_blks_total    | 88944972
heap_blks_scanned  | 12096005
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 178956970
num_dead_tuples    | 178956683

可以看到,半个小时,状态都没有变一下。这里我们根据前面分析的,可以得出如下结论:

•heap_blks_scanned小于heap_blks_total,所以后面还需要返回到scanning heap阶段•index_vacuum_count是0,还没有完成一轮索引扫描•heap_blks_vacuumed也是0,还没有开始清理堆表,或者已经清理了0

可以看到,在清除索引,并且出问题的时候效率又十分之低(3MB/s),那么看一下生产表索引的膨胀率

postgres=# select * from pgstatindex('xxx');
-[ RECORD 1 ]------+-----------
version            | 3
tree_level         | 3
index_size         | 41122717696
root_block_no      | 92932
internal_pages     | 23888
leaf_pages         | 4581586
empty_pages        | 0
deleted_pages      | 414388
avg_leaf_density   | 67.21
leaf_fragmentation | 86.59

postgres=# select leaf_fragmentation from pgstatindex('xxx');
-[ RECORD 1 ]------+--
leaf_fragmentation | 73.23

这个表上有7个索引,随便看了两个,碎片率就达到了惊人的86.59%和73.23%。其他索引想必也是类似,注意这么查看索引碎片率是需要顺序扫描的,表越大索引越大,扫描越慢,注意在生产中要谨慎使用,防止对业务产生影响。

可以直接使用SQL查询

SELECT
    CURRENT_CATALOG AS datname,
    nspname,
    relname,
    bs * tblpages AS size,
    CASE WHEN tblpages - est_tblpages_ff > 0 THEN
    (tblpages - est_tblpages_ff) / tblpages::float
ELSE
    0
    END AS ratio
FROM (
    SELECT
        ceil(reltuples / ((bs - page_hdr) * fillfactor / (tpl_size * 100))) + ceil(toasttuples / 4AS est_tblpages_ff,
        tblpages,
        fillfactor,
        bs,
        tblid,
        nspname,
        relname,
        is_na
    FROM (
        SELECT
            (4 + tpl_hdr_size + tpl_data_size + (2 * ma) - CASE WHEN tpl_hdr_size % ma = 0 THEN
                    ma
                ELSE
                    tpl_hdr_size % ma
                END - CASE WHEN ceil(tpl_data_size)::int % ma = 0 THEN
                    ma
                ELSE
                    ceil(tpl_data_size)::int % ma
                ENDAS tpl_size,
            (heappages + toastpages) AS tblpages,
            heappages,
            toastpages,
            reltuples,
            toasttuples,
            bs,
            page_hdr,
            tblid,
            nspname,
            relname,
            fillfactor,
            is_na
        FROM (
            SELECT
                tbl.oid AS tblid,
                ns.nspname,
                tbl.relname,
                tbl.reltuples,
                tbl.relpages AS heappages,
                coalesce(toast.relpages, 0AS toastpages,
                coalesce(toast.reltuples, 0AS toasttuples,
                coalesce(substring(array_to_string(tbl.reloptions, ' ')
                    FROM 'fillfactor=([0-9]+)')::smallint100AS fillfactor,
                current_setting('block_size')::numeric AS bs,
                CASE WHEN version() ~ 'mingw32'
                    OR version() ~ '64-bit|x86_64|ppc64|ia64|amd64' THEN
                    8
                ELSE
                    4
                END AS ma,
                24 AS page_hdr,
                23 + CASE WHEN MAX(coalesce(s.null_frac, 0)) > 0 THEN
                (7 + count(s.attname)) / 8
            ELSE
                0::int
                END + CASE WHEN bool_or(att.attname = 'oid'
                    AND att.attnum < 0THEN
                    4
                ELSE
                    0
                END AS tpl_hdr_size,
                sum((1 - coalesce(s.null_frac, 0)) * coalesce(s.avg_width, 0)) AS tpl_data_size,
                bool_or(att.atttypid = 'pg_catalog.name'::regtype)
                OR sum(
                    CASE WHEN att.attnum > 0 THEN
                        1
                    ELSE
                        0
                    END) <> count(s.attname) AS is_na
            FROM
                pg_attribute AS att
                JOIN pg_class AS tbl ON att.attrelid = tbl.oid
                JOIN pg_namespace AS ns ON ns.oid = tbl.relnamespace
                LEFT JOIN pg_stats AS s ON s.schemaname = ns.nspname
                    AND s.tablename = tbl.relname
                    AND s.inherited = FALSE
                    AND s.attname = att.attname
            LEFT JOIN pg_class AS toast ON tbl.reltoastrelid = toast.oid
        WHERE
            NOT att.attisdropped
            AND tbl.relkind = 'r'
            AND nspname NOT IN ('pg_catalog''information_schema')
        GROUP BY
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10AS s) AS s2) AS s3
WHERE
    NOT is_na;
 datname  | nspname |     relname      |   size               |         ratio         
----------+---------+------------------+--------------------+-----------------------
 xxx      | xxx        | xxx                 |     406752624640     |     0.977963007643938

可以看到,膨胀率达到了惊人的97%。

刻不容缓,立马联系运维进行重建,使用pg_repack重建索引,关于pg_repack,有一些注意事项:

1.pg_repack 需要多次获取重量级的 AccessExclusive 锁,不过是短暂性的。2.在高并发的情况下,几乎不可能拿到AccessExclusive锁。3.如果 pg_repack 在 wait-time 期间无法获得 AcessExclusive 锁,默认情况下,它会尝试杀掉冲突的语句4.如果总等待时间超过 wait-time 的两倍,它可能会杀掉会话。这可能会导致不良的结果和意外中断。5.pg_repack 的默认值可能不适用于关键系统。使用 --no-kill-backend 选项使其更加温和。6.不允许针对正在进行 pg_repack 的表使用 DDL,任何尝试这样做的会话都可能会被杀掉。

生产中记得加上--no-kill-backend。

最终,在6个小时的重建之后,索引总大小从600GB变成了20GB,并且更加奇特的是,表也不膨胀了,一直稳定了。后续运维也对表进行了重建。

小结

此次案例,通过一个PostgreSQL中很常见的表膨胀,进而引申出一系列的知识,包括我个人在分析这个问题上也花了大概三四天,虽然最后没有搞清楚为什么出问题的时候回收的速度仅有3MB/s,但是至少导致表膨胀的原因已经分析清楚,回收速度跟不上垃圾产生速度,并且对于vacuum的原理理解也更加深刻,下次各位再看到官网的vacuum phase说明不要再云里雾里了。

最后重建索引解决了问题,若根据结果来反推原因的话,大概率是索引膨胀的太厉害了(惊人的87%),vacuum在index上面耗费了太多时间(还没有切换到备机之前更是可怜的3MB/s,进一步加剧,先不论为何是3MB/s,切过去了之后膨胀缓慢),加之又在不断的update

小结一下,

1.生产中,对于Btree索引(也包括GIN、GIST),尤其是经常update的,由于索引页的回收复用机制和表不一样,更容易分裂碎片化,需要定期reindex2.默认的参数,比如vacuum_cost_limit、vacuum_cost_delay,都很保守,可以根据实际情况调整,最好是表级调整3.出了问题,分析溯源时我们可以通过pstack、strace这些操作系统提供给我们的命令,以及PostgreSQL提供给我们的丰富性能视图观察,随着版本不断提升,可观测性也再不断提升中4.当我们遇到莫名其妙的输出时,可能需要一定的代码阅读能力,并且要注意保持版本一致,比如我这个乌龙,负数的hits。5.切勿先入为主,这样会让我们带着有色眼镜分析问题,让我们潜意识里刻意往那个方向去靠,虽然目前这个案例也可能只有内存错误能说得通为何当时回收速度是3MB/s了,因为主机没有排查到异样,并且切换到备库后,回收速度就正常了6.pg_stat_progress_vacuum不包含关于哪个索引被清理以及已经处理了多少块的信息

现在对表膨胀的危害和原理理解了吧?

那么,各位读者不妨思考一下,当PostgreSQL在执行vacuum的过程中,新插入的数据可以复用已经被"vacuum"的空间吗?

参考

https://blog.51cto.com/wujianwei/2115037

https://www.postgresql.org/docs/current/monitoring-stats.html

https://pgpedia.info/p/pg_stat_progress_vacuum.html

https://dataegret.com/2017/10/deep-dive-into-postgres-stats-pg_stat_progress_vacuum/

https://postgres.fun/20160527165605.html

https://stackoverflow.com/questions/62329361/query-on-statistics-shown-on-pg-stat-progress-vacuum

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=15d13e82911f7cc9f04f0bf419a96fd827fd1a9d


文章有问题?点此查看未经处理的缓存