深入浅出VACUUM内核原理(上)
前言
很早就想写一篇深度剖析vacuum的文章了。赶巧今天在学徒①群、③群里都有位老铁咨询关于vacuum的问题:
“③群:autovacuum_vacuum_scale_factor * table_size + autovacuum_vacuum_threshold 这个会触发vacuum,autovacuum_freeze_min_age 这个会触发 lazy freeze, 这两个有啥区别?
借此机会,也与各位唠唠既熟悉又模式的vacuum,篇幅较长,打算以连载的方式,分为3篇,帮助各位彻底弄懂vacuum。
前世今生
为什么需要vacuum的原因就不再深入,PostgreSQL独特的MVCC实现方式所然,需要引入vacuum清理留下的死元祖,空间复用。
有一个很贴切的形容:PostgreSQL的MVCC实现方式就像信用卡,删除、更新和回滚留下的"债务"需要去偿还,偿还的动作便是由vacuum来做的,如果不偿还就会导致破产(bankruptcy),为此优化的HOT就像是现金支付。
此处我们让我们聚焦vacuum需要做哪些事情,vacuum大致会做如下事情:
清除UPDATE或DELETE操作后留下的"死元组" 跟踪表块中可用空间,更新free space map 更新visibility map,index only scan以及后续vacuum都会利用到 冻结表中的行,防止事务ID回卷 配合ANALYZE,定期收集统计信息 在可能的情况下,截断位于末尾的部分页面,返回给操作系统可以看到
那么正常来说,vacuum的步骤和流程是怎样的呢?其实pg_stat_progress_vacuum系统表能给出一二,但是官网上的解释很多人可能看得云里雾里,都是些什么玩意?
initializing → scanning heap → vacuuming indexes → vacuuming heap → cleaning up indexes → truncating heap → performing final cleanup
此处可以参照internal db的介绍
(1) FOR each table
(2) Acquire a ShareUpdateExclusiveLock lock for the target table
/* The first block */
(3) Scan all pages to get all dead tuples, and freeze old tuples if necessary
(4) Remove the index tuples that point to the respective dead tuples if exists
/* The second block */
(5) FOR each page of the table
(6) Remove the dead tuples, and Reallocate the live tuples in the page
(7) Update FSM and VM
END FOR
/* The third block */
(8) Clean up indexes
(9) Truncate the last page if possible
(10 Update both the statistics and system catalogs of the target table
Release the ShareUpdateExclusiveLock lock
END FOR
/* Post-processing */
(11) Update statistics and system catalogs
(12) Remove both unnecessary files and pages of the clog if possible
(1) Get each table from the specified tables.
(2) Acquire a ShareUpdateExclusiveLock lock for the table. This lock allows reading from other transactions.
(3) Scan all pages to get all dead tuples, and freeze old tuples if necessary.
(4) Remove the index tuples that point to the respective dead tuples if exists.
(5) Do the following tasks, step (6) and (7), for each page of the table.
(6) Remove the dead tuples and Reallocate the live tuples in the page.
(7) Update both the respective FSM and VM of the target table.
(8) Clean up the indexes by the index_vacuum_cleanup()@indexam.c function.
(9) Truncate the last page if the last one does not have any tuple.
(10) Update both the statistics and the system catalogs related to vacuum processing for the target table.
(11) Update both the statistics and the system catalogs related to vacuum processing.
(12) Remove both unnecessary files and pages of the clog if possible.
流程很清晰:
获取带清理的表列表 加4级锁,当然这个锁与大多数查询不冲突,但是不能创建索引(即使你是CIC也不行)、不能创建触发器、不能刷新物化视图、不能做DDL等。 扫描所有的页面(可能会跳过数据块,参照下方),获取死元祖,在这之前会根据需要做HOT prunning,并根据需要冻结死元组(所以到这里,③群同志问的问题就已经解答了) 遍历所有索引,移除索引中的死元组 根据上一步记录的TIDS移除表中死元组,同时还会做碎片整理(page fragmentation),必要时候vacuum fsm 更新FSM、VM(all-visible、all-frozen) 索引清理(参照下文),不如翻译成索引删除更为合适 截断文件末尾的页面 更新系统表元组,比如relallvisible,relfrozenxid等,当然还有可能会去更新relhasindex、relhasrules这些系统表字段 移除不必要的文件和CLOG
这里要提的是第一点,待清理的列表,可能有的同学会问了,假如有很多个库中的表达到了阈值,PostgreSQL是如何选择要清理的数据库的?其实有一套复杂的"打分"机制,autovacuum launcher每隔autovacuum_naptime去判断一下哪些库要进行清理,autovacuum launcher两次启动autovacuum worker的时间间隔不会大于autovacuum_naptime
优先对xid或者multixact快要回卷的数据库进行清理 越长时间没有经过自动清理的数据库优先被清理
另外也会类似于skip checkpoint机制,跳过最近没有活动的数据库
让我们看下vacuum支持哪些操作,以最新版的16为准
postgres=# \h vacuum
Command: VACUUM
Description: garbage-collect and optionally analyze a database
Syntax:
VACUUM [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ table_and_columns [, ...] ]
where option can be one of:
FULL [ boolean ]
FREEZE [ boolean ]
VERBOSE [ boolean ]
ANALYZE [ boolean ]
DISABLE_PAGE_SKIPPING [ boolean ]
SKIP_LOCKED [ boolean ]
INDEX_CLEANUP { AUTO | ON | OFF }
PROCESS_MAIN [ boolean ]
PROCESS_TOAST [ boolean ]
TRUNCATE [ boolean ]
PARALLEL integer
SKIP_DATABASE_STATS [ boolean ]
ONLY_DATABASE_STATS [ boolean ]
BUFFER_USAGE_LIMIT [ size ]
and table_and_columns is:
table_name [ ( column_name [, ...] ) ]
URL: https://www.postgresql.org/docs/16/sql-vacuum.html
DISABLE_PAGE_SKIPPING
顾名思义,跳过某些数据块,vacuum会跳过all-visible的数据块,也确实应该如此,如果每次都从头全部扫描一遍,即使数据块中不包含死元祖也扫描无疑太过于stupid,这会增加很多无用功。
当然,要跳过的数据块也有个阈值,在代码里写死了,默认是32个数据块,即至少包含32个全部可见的block才会跳过。我们可以通过设置DISABLE_PAGE_SKIPPING禁止跳过。
/*
* Before we consider skipping a page that's marked as clean in
* visibility map, we must've seen at least this many clean pages.
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
当然还有一个微妙的问题,假如表在不断写入,这个时候我触发了vacuum,我需要考虑新插入的数据块吗?答案是不会,代码里在处理的时候有个上限
for (blkno = 0; blkno < rel_pages; blkno++)
{
Buffer buf;
Page page;
bool all_visible_according_to_vm;
LVPagePruneState prunestate;
if (blkno == next_unskippable_block)
...
rel_pages可以理解成"执行时刻"的数据块数量。
其次在清理的时候,我们知道有一个"pinned"的概念,看个例子
postgres=# create table t1(id int);
CREATE TABLE
postgres=# insert into t1 values(generate_series(1,10));
INSERT 0 10
postgres=# delete from t1 where id < 3;
DELETE 2
postgres=# begin;
BEGIN
postgres=*# declare mycur cursor for select * from t1;
DECLARE CURSOR
postgres=*# fetch mycur;
id
----
3
(1 row)
开一个会话,让其钉住,然后清理
postgres=# vacuum verbose t1;
INFO: vacuuming "postgres.public.t1"
INFO: finished vacuuming "postgres.public.t1": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total)
tuples: 0 removed, 10 remain, 0 are dead but not yet removable
tuples missed: 2 dead from 1 pages not removed due to cleanup lock contention
removable cutoff: 952, which was 0 XIDs old when operation ended
new relfrozenxid: 950, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 64.744 MB/s
buffer usage: 22 hits, 0 misses, 3 dirtied
WAL usage: 1 records, 0 full page images, 188 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
postgres=# SELECT * FROM heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+------------
1 | 8160 | 1 | 28 | 950 | 951 | 0 | (0,1) | 8193 | 1280 | 24 | | | \x01000000
2 | 8128 | 1 | 28 | 950 | 951 | 0 | (0,2) | 8193 | 1280 | 24 | | | \x02000000
3 | 8096 | 1 | 28 | 950 | 0 | 0 | (0,3) | 1 | 2304 | 24 | | | \x03000000
4 | 8064 | 1 | 28 | 950 | 0 | 0 | (0,4) | 1 | 2304 | 24 | | | \x04000000
5 | 8032 | 1 | 28 | 950 | 0 | 0 | (0,5) | 1 | 2304 | 24 | | | \x05000000
6 | 8000 | 1 | 28 | 950 | 0 | 0 | (0,6) | 1 | 2304 | 24 | | | \x06000000
7 | 7968 | 1 | 28 | 950 | 0 | 0 | (0,7) | 1 | 2304 | 24 | | | \x07000000
8 | 7936 | 1 | 28 | 950 | 0 | 0 | (0,8) | 1 | 2304 | 24 | | | \x08000000
9 | 7904 | 1 | 28 | 950 | 0 | 0 | (0,9) | 1 | 2304 | 24 | | | \x09000000
10 | 7872 | 1 | 28 | 950 | 0 | 0 | (0,10) | 1 | 2304 | 24 | | | \x0a000000
(10 rows)
注意这一句
“tuples missed: 2 dead from 1 pages not removed due to cleanup lock contention
可以看到,数据还在,因为vacuum无法"钉住"我想要清理的数据块,这也是另外一个导致膨胀的原因。
当然,如果是触发了aggresive vacuum,需要去冻结元组,则进程会等待页上锁可用,而non-aggresive会跳过无法获取到锁的数据块,具体代码细节可以参照LockBufferForCleanup。
“Every time a backend takes a reference to a buffer, in gets a "pin".
INDEX_CLEANUP
这个是发生在索引里面对应的死元祖清理之后,进行的cleanup操作,即"页面删除"。以Btree为例,当一个索引页面中的所有元组都被删除,即索引页面空了之后,vacuum操作会将这个页面从Btree中删除,删除的思想源自《A Symmetric Concurrent B-Tree Algorithm》,具体细节可以参照lazy_cleanup_one_index()函数。
/*
* lazy_cleanup_one_index() -- do post-vacuum cleanup for index relation.
*
* Calls index AM's amvacuumcleanup routine. reltuples is the number
* of heap tuples and estimated_count is true if reltuples is an
* estimated value. See indexam.sgml for more info.
*
* Returns bulk delete stats derived from input stats
*/
static IndexBulkDeleteResult *
lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
double reltuples, bool estimated_count,
LVRelState *vacrel)
在reindex章节也有描述
“B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended.
TRUNCATE
这个我之前已经提过,vacuum会在最后阶段做截断的操作,将数据页返还给操作系统,这是普通vacuum唯一可能收缩大小的机制,当然这个机制也有触发阈值,因为此操作需要获取最重的AccessExclusiveLock
/*
* Timing parameters for truncate locking heuristics.
*
* These were not exposed as user tunable GUC values because it didn't seem
* that the potential for improvement was great enough to merit the cost of
* supporting them.
*/
#define VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
#define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
#define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */
/*
* Space/time tradeoff parameters: do these need to be user-tunable?
*
* To consider truncating the relation, we want there to be at least
* REL_TRUNCATE_MINIMUM or (relsize / REL_TRUNCATE_FRACTION) (whichever
* is less) potentially-freeable pages.
*/
#define REL_TRUNCATE_MINIMUM 1000
#define REL_TRUNCATE_FRACTION 16
默认情况下至少有1000个空闲数据块,或者要大于16%堆表大小,并且最多尝试5秒,另外假如开启了old_snapshot_threshold参数之后,则无法使用该特性。基于该特性,我们也可以使用一些特殊技巧
WITH a AS (
DELETE FROM t
WHERE ctid = ANY (
SELECT
ctid
FROM
t
ORDER BY
ctid::text DESC
LIMIT 10000)
RETURNING
*)
INSERT INTO t
SELECT
*
FROM
a;
VACUUM t;
核心思想是将文件末尾的行重新插入,可能会插入到前面的数据块中,最后再执行vacuum,会释放掉位于文件末尾的页,返还一部分的空间。这可能是当磁盘快要爆了的时候,你的救命稻草!切记此技巧,关键时候可能可以救你命。
小结
今天关于深入浅出vacuum内核原理的第一篇就到这里,下一篇我会聊聊关于物化视图、TOAST等清理机制,以及其他不为人知的底层细节,敬请期待。
在上周六OSC源创会我又去兼职了一天主持人😁,同时也分享了一篇《云程发轫,万里可期,PostgreSQL与向量数据库》的分享,相关材料我已经上传,各位可以自行下载 https://www.modb.pro/doc/118039
另外ACDU成都站我也会去分享一下如何快速入门PostgreSQL,从小工到专家,欢迎各位的捧场。