不为熟知的 FPI 之 Hint bits
1前言
昨天在一个交流群里看到一个不同于以往的 FPI 案例,
可以看到,FPI 全部集中在了第一行 XLOG
这个 RMGR
类型。正好今天上午同事也找到我问了一个相关的问题
主库有大量数据导入,产生的日志量较大,远程库延迟验证,但是传输速度实在太慢,5 天才传输了主库 14 个小时的日志量
于是我去看了一下,还是大量 FPI 导致的写放大问题,同时也可以看到大量的 XLOG 类型的 FPI。回忆一下,之前遇到的几次 FPI 案例主要是堆表索引等操作导致的,如下👇🏻
之前的案例就表过不提了,感兴趣的老铁自己翻一下
那么什么场景会产生这样的记录?简单分析一下。
2分析
既然是 pg_waldump utility,直接看看 pg_waldump.c 里面 RMGR 的介绍
static void
print_rmgr_list(void)
{
int i;
for (i = 0; i <= RM_MAX_BUILTIN_ID; i++)
{
printf("%s\n", GetRmgrDesc(i)->rm_name);
}
}
顺着代码找,很快就找到了这块的分类,对应的资源管理器定义结构如下:
/*
* List of resource manager entries. Note that order of entries defines the
* numerical values of each rmgr's ID, which is stored in WAL records. New
* entries should be added at the end, to avoid changing IDs of existing
* entries.
*
* Changes to this list possibly need an XLOG_PAGE_MAGIC bump.
*/
/* symbol name, textual name, redo, desc, identify, startup, cleanup */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
当数据库崩溃要执行恢复时,便会根据每个 XLOG record 所属的资源管理器操作来执行对应的函数,值得注意的是,pg_waldump 解析出来的每条记录和语句不是一一对应的,而是和修改的物理对象一一对应,这也是物理回放的缘由。
而这个 RM_XLOG_ID 又分为了如下这么多子类型
/* XLOG info values for XLOG rmgr */
#define XLOG_CHECKPOINT_SHUTDOWN 0x00
#define XLOG_CHECKPOINT_ONLINE 0x10
#define XLOG_NOOP 0x20
#define XLOG_NEXTOID 0x30
#define XLOG_SWITCH 0x40
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
#define XLOG_FPW_CHANGE 0x80
#define XLOG_END_OF_RECOVERY 0x90
#define XLOG_FPI_FOR_HINT 0xA0
#define XLOG_FPI 0xB0
/* 0xC0 is used in Postgres 9.5-11 */
#define XLOG_OVERWRITE_CONTRECORD 0xD0
有了这些子类型,那么就好分析了,
XLOG_CHECKPOINT_SHUTDOWN:数据库停机的时候会执行一次 checkpoint,不过是 smart 和 fast 类型才会,immediate 不会执行,即意味着"暴力"停库不会妥善关闭,所以数据库启动后要执行 recovery
XLOG_CHECKPOINT_ONLINE:正常运行时执行的 checkpoint,可以在 pg_stat_bgwriter 系统表中查看,比如执行 pg_basebackup 时 / 达到了 checkpoint_timeout 或者 max_wal_size / 创建删除数据库的时候等
XLOG_NOOP:代码里暂时没找到产生此类型的记录
XLOG_NEXTOID:产生一个新的 OID 的时候
XLOG_SWITCH:切换一个 WAL 日志
XLOG_BACKUP_END:备份结束的时候会写入一条 WAL record,比如 pg_stop_backup
XLOG_PARAMETER_CHANGE:对备库很重要的 GUC 参数如果发生了改变就会产生这个类型,比如 max_prepared_xacts、max_connections,我之前曾写过类似的文章,为什么备库的这些参数需要大于等于主库,此处就不再赘述
XLOG_RESTORE_POINT:建立恢复点的时候,PITR 对应的操作 pg_create_restore_point
XLOG_FPW_CHANGE:full_page_writes 参数改变就会记录这个类型
XLOG_END_OF_RECOVERY:数据库完成恢复之后
XLOG_FPI_FOR_HINT:假设未开启 checksum,开启了 wal_log_hints,修改标记位,会记录整页到 WAL 中;假设开启了 checksum,不管 wal_log_hints 如何,checkpoint 后第一次修改的页面都会记录整页到 WAL 中,即使是 hint bits;假设未开启 wal_log_hints,第一次使页面变脏的操作是修改 hint bints,不会记录FPI,所以此类型对应标志位产生的 FPI
XLOG_FPI:简单看了眼代码,初始化一个索引页的时候会有此类 FPI
/*
* emit a completed btree page, and release the working storage.
*/
static void
_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
{
/* XLOG stuff */
if (wstate->btws_use_wal)
{
/* We use the XLOG_FPI record type for this */
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
}
...
/*
* Write a WAL record containing a full image of a page. Caller is responsible
* for writing the page to disk after calling this routine.
*
* Note: If you're using this function, you should be building pages in private
* memory and writing them directly to smgr. If you're using buffers, call
* log_newpage_buffer instead.
*
* If the page follows the standard page layout, with a PageHeader and unused
* space between pd_lower and pd_upper, set 'page_std' to true. That allows
* the unused space to be left out from the WAL record, making it smaller.
*/
XLogRecPtr
log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
Page page, bool page_std)
{
int flags;
XLogRecPtr recptr;
flags = REGBUF_FORCE_IMAGE;
if (page_std)
flags |= REGBUF_STANDARD;
XLogBeginInsert();
XLogRegisterBlock(0, rnode, forkNum, blkno, page, flags);
recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
可以看到大致这么多场景会产生此类日志,那么 FPI 的话最简单就是 XLOG_FPI_FOR_HINT
了,针对标志位的修改导致 FPI。很好复现,打开 wal_log_hints
postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,100000));
INSERT 0 100000
postgres=# select pg_switch_wal();
pg_switch_wal
---------------
6/41FF0900
(1 row)
然后修改标志位即可,回顾一下标志位的设置时机,不是在事务结束时设置,而是在后面的 DML 或者 DQL ,VACUUM 等 SQL 扫描到对应的元组时,触发置位的操作,所以和 Oracle 一样,查询也是会产生 WAL record 的,就是因为设置标记位。所以最简单的方式,只需要全部查询一遍即可,就会设置 HEAP_XMIN_COMMITTED
等标志位,如果设置了标志位,那么就不再需要去调用 TransactionIdDidCommit 和 TransactionIdDidAbort 去获取事务的状态,可以高效地检查每个元组 xmin 和 xmax 对应的事务状态。可以看到每条元组都添加了 HEAP_XMIN_COMMITTED
postgres=# explain analyze select * from test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..1572.65 rows=112965 width=4) (actual time=0.038..26.806 rows=100000 loops=1)
Planning Time: 0.087 ms
Execution Time: 32.723 ms
(3 rows)
postgres=# select lp, t_xmin, t_xmax, t_ctid,
infomask(t_infomask, 1) as infomask,
infomask(t_infomask2, 2) as infomask2
from heap_page_items(get_raw_page('test', 0));
lp | t_xmin | t_xmax | t_ctid | infomask | infomask2
-----+--------+--------+---------+-----------------------------+-----------
1 | 12007 | 0 | (0,1) | XMAX_INVALID|XMIN_COMMITTED |
2 | 12007 | 0 | (0,2) | XMAX_INVALID|XMIN_COMMITTED |
3 | 12007 | 0 | (0,3) | XMAX_INVALID|XMIN_COMMITTED |
4 | 12007 | 0 | (0,4) | XMAX_INVALID|XMIN_COMMITTED |
5 | 12007 | 0 | (0,5) | XMAX_INVALID|XMIN_COMMITTED |
...
...
再去查看,就可以成功复现出来了。
3小结
不难想象这个场景出现的原因,流复制场景下,经常使用的便是 pg_rewind 来处理日志分支的情况,其中要求之一便是开启 wal_log_hints。
好了,That's all!
4参考
https://zhuanlan.zhihu.com/p/200749711