备库是否有自己的统计信息?
1前言
今天在 2 群里一位筒子提了这样一个问题
我们最近很多报表在读复制从节点的数据,但从节点的 pg_stat_all_tables 里面好像最后vacuum 和 analyze 的时间都没有,这个统计数据或者 vacuum 操作会在从节点也重放吗
首先可以肯定的是,vacuum 操作肯定会在备库回放,不然何来 snapshot conflict呢。所以 2 群某位小伙伴的如下观点是错误的
从库不会vacuum,只会同步主库的元组变更
问题很明了,备库的 pg_stat_all_tables 里面的 vacuum / analyze 相关字段没有数据,导致读取出来的数据没有意义。正好昨天把 狂飙 电视剧追完了,也需要重拾一下烂笔头了。
闲话少叙,让我们一起分析一下这个问题!
2复现
首先简单搭建个主从,然后主库上模拟几条操作
postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,100000));
INSERT 0 100000
postgres=# delete from test where id < 20000;
DELETE 19999
postgres=# vacuum analyze test;
VACUUM
postgres=# select relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]----+------------------------------
relname | test
last_vacuum | 2023-02-02 20:04:17.648657+08
last_autovacuum | 2023-02-02 20:03:33.133273+08
last_analyze | 2023-02-02 20:04:17.686317+08
last_autoanalyze | 2023-02-02 20:03:33.197652+08
可以看到自动触发了 autovacuum,手动执行 vacuum 也有记录。那去备库上瞅瞅
postgres=# select relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]----+-----
relname | test
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
postgres=# select * from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]-------+-------
relid | 16961
schemaname | public
relname | test
seq_scan | 0
seq_tup_read | 0
idx_scan |
idx_tup_fetch |
n_tup_ins | 0
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 0
n_dead_tup | 0
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 0
postgres=# vacuum analyze test;
ERROR: cannot execute VACUUM during recovery
果然,不仅 vacuum 相关的字段没有数据,其他字段也没有数据,而且备库也无法执行这类写操作。那为什么会产生的差异呢?
3分析
先从源头出发便可发现端倪,pg_stat_all_tables 原来仅仅是一个视图,just a view!并非表对象,是一个关于表上各个维度的计数器 👇🏻,这个统计的过程由 stats collector 进程来做。
SELECT c.oid AS relid,
n.nspname AS schemaname,
c.relname,
pg_stat_get_numscans(c.oid) AS seq_scan,
pg_stat_get_tuples_returned(c.oid) AS seq_tup_read,
sum(pg_stat_get_numscans(i.indexrelid))::bigint AS idx_scan,
sum(pg_stat_get_tuples_fetched(i.indexrelid))::bigint + pg_stat_get_tuples_fetched(c.oid) AS idx_tup_fetch,
pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins,
pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid) AS n_tup_del,
pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
pg_stat_get_ins_since_vacuum(c.oid) AS n_ins_since_vacuum,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,
pg_stat_get_vacuum_count(c.oid) AS vacuum_count,
pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count,
pg_stat_get_analyze_count(c.oid) AS analyze_count,
pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count
FROM pg_class c
LEFT JOIN pg_index i ON c.oid = i.indrelid
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char", 'p'::"char"])
GROUP BY c.oid, n.nspname, c.relname;
再回想一下我之前写过的关于统计信息收集的基础原理 👇🏻
每个进程执行 DML 会去调用 pgstat_report_stat 向 stats collector 进程反馈统计信息,后面会通过 pgstat_send_tabstat 发送表的统计信息,其他的诸如 函数/WAL/SLRU的统计信息,则会有对应的三个函数 pgstat_send_funcstats/pgstat_send_wal/pgstat_send_slru 来完成。
每个后端进程在事务提交/回滚时会发消息给进程 stats collector,stats collector 会汇总这份信息并记录到文件中,然后 autovacuum launcher 定期读取文件以决定何时触发 autovacuum vacuum进行工作。而 pg_stat_all_tables 所涉及到的 pg_stat_get_tuples_* 这类相关函数正是与 stats collector 进程打交道的,但是统计信息所涉及到的相关文件(pg_stat_tmp / pg_stat 目录下)又不会被复制到备库。纳尼?不会复制到备库,为什么不会复制到备库?
卖个关子,各位思考五分钟...
回顾一下流复制的原理,流复制本质上就是基于 WAL record 的流式传输,主库 walsender 进程解析 WAL record 发送到备库,备库 startup 进程再进行回放,而 pg_stat_tmp 这类文件非普通对象,都不会记录 WAL record,那么自然不会复制到备库。
除此之外,其实还有个细节可以侧面证明,一般我们搭建流复制都是使用 pg_basebackup,而 pg_basebackup 在备份的过程中是会跳过一些文件的
/*
* The contents of these directories are removed or recreated during server
* start so they are not included in backups. The directories themselves are
* kept and included as empty to preserve access permissions.
*
* Note: this list should be kept in sync with the filter lists in pg_rewind's
* filemap.c.
*/
static const char *const excludeDirContents[] =
{
/*
* Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
* when stats_temp_directory is set because PGSS_TEXT_FILE is always
* created there.
*/
PG_STAT_TMP_DIR,
/*
* It is generally not useful to backup the contents of this directory
* even if the intention is to restore to another primary. See backup.sgml
* for a more detailed description.
*/
"pg_replslot",
/* Contents removed on startup, see dsm_cleanup_for_mmap(). */
PG_DYNSHMEM_DIR,
/* Contents removed on startup, see AsyncShmemInit(). */
"pg_notify",
/*
* Old contents are loaded for possible debugging but are not required for
* normal operation, see SerialInit().
*/
"pg_serial",
/* Contents removed on startup, see DeleteAllExportedSnapshotFiles(). */
"pg_snapshots",
/* Contents zeroed on startup, see StartupSUBTRANS(). */
"pg_subtrans",
/* end of list */
NULL
};
/* Scan for directories whose contents should be excluded */
excludeFound = false;
for (excludeIdx = 0; excludeDirContents[excludeIdx] != NULL; excludeIdx++)
{
if (strcmp(de->d_name, excludeDirContents[excludeIdx]) == 0)
{
elog(DEBUG1, "contents of directory \"%s\" excluded from backup", de->d_name);
size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
excludeFound = true;
break;
}
}
可以看到 PG_STAT_TMP_DIR
便被跳过了。所以此时备库没这个文件(注意我这里用了此时)
[postgres@xiongcc ~]$ ll 14data/pg_stat_tmp/
total 20
-rw------- 1 postgres postgres 2225 Feb 2 21:08 db_0.stat
-rw------- 1 postgres postgres 11290 Feb 2 21:08 db_13892.stat
-rw------- 1 postgres postgres 1335 Feb 2 21:08 global.stat
[postgres@xiongcc ~]$ ll 14data_bak/pg_stat_tmp/
total 0
让我们验证一下,这里为了演示,使用 14 版本,因为在 15 的版本以后移除了 stats collector 进程,pg_stat_tmp 目录继续保留主要为了一些插件,比如 pg_stat_statements。
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
19224
(1 row)
postgres=# select relname,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from pg_stat_all_tables where relname = 'test'; ---查询一下
relname | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
---------+-------------------------------+------------------------------+-------------------------------+-------------------------------
test | 2023-02-02 21:05:11.320255+08 | 2023-01-16 15:32:56.88557+08 | 2023-02-02 21:05:28.816788+08 | 2023-01-16 15:32:58.050305+08
(1 row)
另一个终端用 strace 抓一下
[postgres@xiongcc ~]$ strace -p 19224
strace: Process 19224 attached
epoll_wait(4, [{EPOLLIN, {u32=42797616, u64=42797616}}], 1, -1) = 1
recvfrom(10, "Q\0\0\0}select relname,last_vacuum,"..., 8192, 0, NULL, NULL) = 126
lseek(17, 0, SEEK_END) = 106496
lseek(18, 0, SEEK_END) = 32768
lseek(19, 0, SEEK_END) = 49152
lseek(20, 0, SEEK_END) = 40960
lseek(21, 0, SEEK_END) = 32768
lseek(22, 0, SEEK_END) = 16384
lseek(23, 0, SEEK_END) = 16384
lseek(24, 0, SEEK_END) = 8192
lseek(25, 0, SEEK_END) = 16384
lseek(26, 0, SEEK_END) = 16384
lseek(24, 0, SEEK_END) = 8192
open("pg_stat_tmp/global.stat", O_RDONLY) = 33
fstat(33, {st_mode=S_IFREG|0600, st_size=1335, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb56f356000
read(33, "\242\274\245\1\335|m\374\265\226\2\0o\r\0\0\0\0\0\0\3\0\0\0\0\0\0\0\244(\4\0"..., 4096) = 1335
close(33) = 0
munmap(0x7fb56f356000, 4096) = 0
sendto(9, "\1\0\0\0 \0\0\0X>\234\375\265\226\2\08\235\224\375\265\226\2\0D6\0\0\265\226\2\0", 32, 0, NULL, 0) = 32
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout)
open("pg_stat_tmp/global.stat", O_RDONLY) = 33
fstat(33, {st_mode=S_IFREG|0600, st_size=1335, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb56f356000
read(33, "\242\274\245\1\344>\234\375\265\226\2\0o\r\0\0\0\0\0\0\3\0\0\0\0\0\0\0\244(\4\0"..., 4096) = 1335
close(33) = 0
munmap(0x7fb56f356000, 4096) = 0
open("pg_stat_tmp/global.stat", O_RDONLY) = 33
fstat(33, {st_mode=S_IFREG|0600, st_size=1335, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb56f356000
read(33, "\242\274\245\1\344>\234\375\265\226\2\0o\r\0\0\0\0\0\0\3\0\0\0\0\0\0\0\244(\4\0"..., 4096) = 1335
open("pg_stat_tmp/db_13892.stat", O_RDONLY) = 41 ---👈🏻这里
fstat(41, {st_mode=S_IFREG|0600, st_size=11290, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb56f355000
read(41, "\242\274\245\1T[\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(41, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(41, "\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0"..., 4096) = 3098
close(41) = 0
munmap(0x7fb56f355000, 4096) = 0
open("pg_stat_tmp/db_0.stat", O_RDONLY) = 41
fstat(41, {st_mode=S_IFREG|0600, st_size=2225, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb56f355000
read(41, "\242\274\245\1T\275\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 2225
close(41) = 0
munmap(0x7fb56f355000, 4096) = 0
close(33) = 0
munmap(0x7fb56f356000, 4096) = 0
sendto(9, "\2\0\0\0\300\3\0\0D6\0\0\10\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 960, 0, NULL, 0) = 960
sendto(10, "T\0\0\0\242\0\5relname\0\0\0000C\0\3\0\0\0\23\0@\377\377\377\377\0"..., 329, 0, NULL, 0) = 329
recvfrom(10, 0x1066700, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
很清晰了,该进程去读了 pg_stat_tmp 目录下的 db_13892.stat。
不过值得注意的是,备库是可以自己做checkpoint的,目的是避免每次重启备库都需要从上一个检查点(由主库产生,在WAL中回放出来的)回放后面所有的WAL,但是注意并不会去更新控制文件相关的位点,因为是从主库同步过来的。所以,备库执行了checkpoint,会生成相关文件。不过还是那个问题,刷到存储上的是备库"此时"的状态,所以还是0。
[postgres@xiongcc ~]$ ll 14data_bak/pg_stat_tmp/
total 12
-rw------- 1 postgres postgres 745 Feb 2 21:46 db_0.stat
-rw------- 1 postgres postgres 2040 Feb 2 21:46 db_13892.stat
-rw------- 1 postgres postgres 1335 Feb 2 21:46 global.stat
4小结
通过这个小案例,其实我们也能发散一下思维,思考的更深一点:
autovacuum launcher 定期读取 stats collector 进程收集的统计信息文件以决定何时触发 autovacuum,而备库是不包含此类统计信息文件的,因此假如有个表在主库上即将达到了触发阈值就快要进行 vacuum/analyze 了,结果这个时候做了一个 switchover 或者 failover,新的主库由于没有统计数据,需要新的活动才可以触发收集,因此这个表就会被"延迟"清理,时间可能就增加了一倍。怎么样,是不是 get 到了一波奇怪的知识。虽然大多数时候不会引起实质性的问题,但是假如系统在处于那种极端情况下,刚好这个表统计信息收集不及时,导致了性能问题,那么买个彩票吧。 同理,vm文件也不会进行复制,假如有个表要插入新行了,switchover 或者 failover 后第一次 vacuum 之前都不会复用这些空间,所以极端情况下可以做个 vacuum。
另外我发现最近很多人在问 PostgreSQL 巡检的事,正好我也有这个打算,使用 go 开发一个巡检工具,一来拾起生疏的 go,二来想着自己撸一个巡检工具出来,我行我上,拭目以待吧。
真是一个有趣的案例啊!好了,That's all,我又要去追三体了。