从实际案例分析统计信息收集流程
前言
今天上午一位负责数据库测试的同事找到我说,有这么一个现象:"有个表,为何增删改的行数达到了阈值,但是 autovacuum 一直不工作?" 测试脚本如下👇🏻
alter system set autovacuum_naptime = 5;
select pg_reload_conf();
create table walker(id int,name text);
select clock_timestamp() as t1 \gset
insert into walker select generate_series(1,100),'test';
select clock_timestamp() as t2 \gset
delete from walker where id < 80;
--wait some time to make autovacuum happen automatically
--during the first pg_sleep,regardless of how pg_sleep is set,autovacuum doesn't happen for table walker
select pg_sleep(100);
--select pg_sleep(10);
脚本中配置了 autovacuum_naptime,该参数用于设置两次系统自动清理操作之间的间隔时间,如果当前不存在空闲的 autovacuum worker,则休眠autovacuum_naptime,此处设置为5s,用于尽快触发 autovacuum。注意观察上方最后的注解:
“等待一段时间,以让 autovacuum 自动触发,在第一次的 pg_sleep 期间,无论 pg_sleep 设置多久,autovacuum都不会对表 walker 进行工作。
乍一看有点唬人,让我们一起分析下这个有趣的案例!
复现
这个测试的案例很简单,轻易就可以模拟出来。脚本初衷是为了测试自研数据库的闪回功能,脚本自动化,所以需要去观测数据库的 vacuum 行为以及触发阈值。简单模拟一下👇🏻
[postgres@xiongcc ~]$ cat test.sql
drop table if exists walker;
create table walker(id int,name text);
insert into walker select generate_series(1,100),'test';
delete from walker where id < 80;
select pg_sleep(100);
在PostgreSQL中,autovacuum 的触发时机是当表的删除和更新行数达到了reltuples * autovacuum_vacuum_scale_factor + autovacuum_vacuum_threshold
,因此上面的表触发阈值是70,理论上第4步便会触发vacuum,先手动操作一下,看看有没有幺蛾子
postgres=# show autovacuum_naptime ;
autovacuum_naptime
--------------------
5s
(1 row)
postgres=# drop table if exists walker;
DROP TABLE
postgres=# create table walker(id int,name text);
CREATE TABLE
postgres=# insert into walker select generate_series(1,100),'test';
INSERT 0 100
postgres=# select now();
now
-------------------------------
2022-12-27 16:39:56.981739+08
(1 row)
postgres=# delete from walker where id < 80;
DELETE 79
postgres=# select last_vacuum,last_autovacuum from pg_stat_all_tables where relname = 'walker';
-[ RECORD 1 ]---+------------------------------
last_vacuum |
last_autovacuum | 2022-12-27 16:40:02.060374+08
毫无疑问,手工操作的话没有问题。那让我们放到脚本里来操作一下看看:
[postgres@xiongcc ~]$ psql -f test.sql
DROP TABLE
CREATE TABLE
INSERT 0 100
DELETE 79
---睡眠100秒中...
同时开启另外一个窗口观察 pg_stat_all_tables(为什么要看这个表,此处不再赘述,参照之前的文章 小案例之鸡生蛋还是蛋生鸡?)
Tue 27 Dec 2022 04:51:38 PM CST (every 1s)
-[ RECORD 1 ]-------+-------
relid | 16458
schemaname | public
relname | walker
seq_scan | 0
seq_tup_read | 0
idx_scan |
idx_tup_fetch |
n_tup_ins | 0
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 0
n_dead_tup | 0
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 0
...
...
Tue 27 Dec 2022 04:52:44 PM CST (every 1s)
-[ RECORD 1 ]-------+-------
relid | 16458
schemaname | public
relname | walker
seq_scan | 0
seq_tup_read | 0
idx_scan |
idx_tup_fetch |
n_tup_ins | 0
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 0
n_dead_tup | 0
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 0
...
...
Tue 27 Dec 2022 04:53:12 PM CST (every 1s)
-[ RECORD 1 ]-------+-------
relid | 16458
schemaname | public
relname | walker
seq_scan | 1
seq_tup_read | 100
idx_scan |
idx_tup_fetch |
n_tup_ins | 100
n_tup_upd | 0
n_tup_del | 79
n_tup_hot_upd | 0
n_live_tup | 21
n_dead_tup | 79
n_mod_since_analyze | 179
n_ins_since_vacuum | 100
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze |
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 0
Tue 27 Dec 2022 04:53:13 PM CST (every 1s)
-[ RECORD 1 ]-------+------------------------------
relid | 16458
schemaname | public
relname | walker
seq_scan | 1
seq_tup_read | 100
idx_scan |
idx_tup_fetch |
n_tup_ins | 100
n_tup_upd | 0
n_tup_del | 79
n_tup_hot_upd | 0
n_live_tup | 21
n_dead_tup | 0
n_mod_since_analyze | 0
n_ins_since_vacuum | 0
last_vacuum |
last_autovacuum | 2022-12-27 16:53:12.950886+08
last_analyze |
last_autoanalyze | 2022-12-27 16:53:12.952087+08
vacuum_count | 0
autovacuum_count | 1
analyze_count | 0
autoanalyze_count | 1
注意观察时间,可以看到在休眠期间,关于walker表的统计信息一直是0,因此 autovacuum 不触发也就不足为奇了,直到休眠完成,n_tup_del 才有了值,然后autovacuum launcher 创建 autovacuum worker 进程去进行了清理,那么为什么放在脚本里会有这样奇怪的行为而手工操作是正常的呢?
老办法
当我们不知道代码跑了哪些流程的时候,还是老样子,用笨办法全部一股脑抓下来即可。另外根据过往经验,stats collector 负责收集统计信息,然后 autovacuum launcher 定期去读这些文件,再选择性地创建工作进程去进行清理,抓一下和 stats 相关的流程
[postgres@xiongcc ~]$ cat stap.out | grep stat
postgres: pgstat_recv_inquiry
postgres: pgstat_recv_slru
postgres: pgstat_recv_tabstat
postgres: pgstat_recv_vacuum
postgres: pgstat_recv_wal
postgres: pgstat_report_activity
postgres: pgstat_report_analyze
postgres: pgstat_report_appname
postgres: pgstat_report_autovac
postgres: pgstat_report_disconnect
postgres: pgstat_report_query_id
postgres: pgstat_report_stat 👈🏻
postgres: pgstat_report_vacuum 👈🏻
postgres: pgstat_report_wait_end
postgres: pgstat_report_wait_start
postgres: pgstat_report_xact_timestamp
postgres: pgstat_reset_wait_event_storage
postgres: pgstat_send
postgres: pgstat_send_bgwriter
postgres: pgstat_send_funcstats
postgres: pgstat_send_inquiry
postgres: pgstat_send_slru
postgres: pgstat_send_tabstat
postgres: pgstat_send_wal
postgres: pgstat_setheader
postgres: pgstat_setup_memcxt
...
可以看到有几个关键的 pgstat_report_xxx,顾名思义,反馈统计信息,那看下这块的代码
/* ----------
* pgstat_report_stat() -
*
* Must be called by processes that performs DML: tcop/postgres.c, logical
* receiver processes, SPI worker, etc. to send the so far collected
* per-table and function usage statistics to the collector. Note that this
* is called only when not within a transaction, so it is fair to use
* transaction stop time as an approximation of current time.
*
* "disconnect" is "true" only for the last call before the backend
* exits. This makes sure that no data is lost and that interrupted
* sessions are reported correctly.
* ----------
*/
void
pgstat_report_stat(bool disconnect)
{
/* we assume this inits to all zeroes: */
static const PgStat_TableCounts all_zeroes;
static TimestampTz last_report = 0;
TimestampTz now;
PgStat_MsgTabstat regular_msg;
PgStat_MsgTabstat shared_msg;
TabStatusArray *tsa;
int i;
/*
* Don't expend a clock check if nothing to do.
*
* To determine whether any WAL activity has occurred since last time, not
* only the number of generated WAL records but also the numbers of WAL
* writes and syncs need to be checked. Because even transaction that
* generates no WAL records can write or sync WAL data when flushing the
* data pages.
*/
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
pgWalUsage.wal_records == prevWalUsage.wal_records &&
WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
!have_function_stats && !disconnect)
return;
/*
* Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
* msec since we last sent one, or the backend is about to exit.
*/
now = GetCurrentTransactionStopTimestamp();
if (!disconnect &&
!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
return;
last_report = now;
...
/*
* Send partial messages. Make sure that any pending xact commit/abort
* and connection stats get counted, even if there are no table stats to
* send.
*/
if (regular_msg.m_nentries > 0 ||
pgStatXactCommit > 0 || pgStatXactRollback > 0 || disconnect)
pgstat_send_tabstat(®ular_msg, now);
if (shared_msg.m_nentries > 0)
pgstat_send_tabstat(&shared_msg, now);
/* Now, send function statistics */
pgstat_send_funcstats();
/* Send WAL statistics */
pgstat_send_wal(true);
/* Finally send SLRU statistics */
pgstat_send_slru();
}
注解很清晰,每个进程执行 DML 会去调用 pgstat_report_stat 向 stats collector 进程反馈统计信息,后面会通过 pgstat_send_tabstat 发送表的统计信息,其他的诸如 函数/WAL/SLRU的统计信息,则会有对应的三个函数 pgstat_send_funcstats/pgstat_send_wal/pgstat_send_slru 来完成,此处我们只关注表的,因此聚焦到 pgstat_send_tabstat 这个函数即可,该函数调用链不复杂,通过 pgstat_report_stat -> pgstat_send -> send 发送消息给 stats collector 。
/*
* Subroutine for pgstat_report_stat: finish and send a tabstat message
*/
static void
pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg, TimestampTz now)
{
int n;
int len;
...
/* ----------
* pgstat_send() -
*
* Send out one statistics message to the collector
* ----------
*/
static void
pgstat_send(void *msg, int len)
{
int rc;
if (pgStatSock == PGINVALID_SOCKET)
return;
((PgStat_MsgHdr *) msg)->m_size = len;
/* We'll retry after EINTR, but ignore all other failures */
do
{
rc = send(pgStatSock, msg, len, 0);
} while (rc < 0 && errno == EINTR);
#ifdef USE_ASSERT_CHECKING
/* In debug builds, log send failures ... */
if (rc < 0)
elog(LOG, "could not send to statistics collector: %m");
#endif
}
至此逻辑分析清楚了,每个后端进程事务提交/回滚时会发消息给进程 stats collector,stats collector 会汇总这份信息并记录到文件中,autovacuum launcher 定期读取文件以决定何时触发 autovacuum vacuum进行工作。
那问题出在了哪里?其实很明了,看这一块代码👇🏻
/*
* Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
* msec since we last sent one, or the backend is about to exit.
*/
now = GetCurrentTransactionStopTimestamp();
if (!disconnect &&
!TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
return;
last_report = now;
PGSTAT_STAT_INTERVAL 为 500毫秒,所以事务完成之后,会等待一小段时间,超过500毫秒之后才发送给 stats collector 进程,否则直接返回。而当我们在脚本里操作的话,基本上零延迟执行各个SQL,最后进程休眠,直到进程休眠完成,才得空发送消息给统计信息收集进程,假如你睡眠1万秒,等到1万秒之后才会进行清理,所以假如直接拷贝脚本里的文件到终端直接粘贴执行,一样可以模拟出来,而当我们手动操作的话,各个间隔期间肯定超过了500毫秒,这个时候已经发送给了 stats collector 进程,然后间隔 autovacuum_naptime 5秒之后(当然这是理想情况,autovacuum 有一套自己的得分机制,优先对xid或者multixact 超限的数据库进行自动清理,越长时间没有经过自动清理的数据库优先被清理),创建工作进程进行清理。
用 GDB 追踪一下
postgres=# create table t1(id int);
CREATE TABLE
postgres=# select relfilenode from pg_class where relname = 't1';
relfilenode
-------------
16463
(1 row)
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
22257
(1 row)
敲个断点
(gdb) info b
Num Type Disp Enb Address What
1 breakpoint keep y 0x00000000008a62fd in pgstat_report_stat at pgstat.c:865
2 breakpoint keep y 0x00000000008a66b9 in pgstat_send_tabstat at pgstat.c:978
插入一行数据
postgres=# insert into t1 values(1);
--卡住
追踪下流程
(gdb)
908 for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
(gdb)
910 for (i = 0; i < tsa->tsa_used; i++)
(gdb)
912 PgStat_TableStatus *entry = &tsa->tsa_entries[i];
(gdb)
917 Assert(entry->trans == NULL);
(gdb) p *entry ---👈🏻表在这里
$13 = {t_id = 16463, t_shared = false, trans = 0x0, t_counts = {t_numscans = 0, t_tuples_returned = 0,
t_tuples_fetched = 0, t_tuples_inserted = 1, t_tuples_updated = 0, t_tuples_deleted = 0, t_tuples_hot_updated = 0,
t_truncated = false, t_delta_live_tuples = 1, t_delta_dead_tuples = 0, t_changed_tuples = 1, t_blocks_fetched = 1,
t_blocks_hit = 1}}
Breakpoint 2, pgstat_send_tabstat (tsmsg=0x7ffe7542a6f0, now=725449139924852) at pgstat.c:978
978 if (pgStatSock == PGINVALID_SOCKET)
(gdb) n
985 if (OidIsValid(tsmsg->m_databaseid))
(gdb) p pgStatSock ---socket 是9
$14 = 9
(gdb) p *tsmsg
$16 = {m_hdr = {m_type = 1967302512, m_size = 598334853}, m_databaseid = 13892, m_nentries = 1, m_xact_commit = 1,
m_xact_rollback = 0, m_block_read_time = 0, m_block_write_time = 0, m_session_time = 263254569,
m_active_time = 249639837, m_idle_in_xact_time = 0, m_entry = {{t_id = 16463👈🏻表在这里, t_counts = {t_numscans = 0,
t_tuples_returned = 0, t_tuples_fetched = 0, t_tuples_inserted = 1, t_tuples_updated = 0, t_tuples_deleted = 0,
t_tuples_hot_updated = 0, t_truncated = false, t_delta_live_tuples = 1, t_delta_dead_tuples = 0,
t_changed_tuples = 1, t_blocks_fetched = 1, t_blocks_hit = 1}}, {t_id = 39822192, t_counts = {t_numscans = 48,
t_tuples_returned = 140177974178456, t_tuples_fetched = 140177974178456, t_tuples_inserted = 15,
t_tuples_updated = 505637712536, t_tuples_deleted = 13523245, t_tuples_hot_updated = 9802459, t_truncated = 212,
t_delta_live_tuples = 40590560, t_delta_dead_tuples = 1, t_changed_tuples = 40590672, t_blocks_fetched = 0,
t_blocks_hit = 0}}, {t_id = 0, t_counts = {t_numscans = 1, t_tuples_returned = 0, t_tuples_fetched = 13523240,
t_tuples_inserted = 140730865723552, t_tuples_updated = 12058266, t_tuples_deleted = 140730865723600,
t_tuples_hot_updated = 13523240, t_truncated = 64, t_delta_live_tuples = 140730865723840,
t_delta_dead_tuples = 40590712, t_changed_tuples = 11772004, t_blocks_fetched = 140730865723850,
t_blocks_hit = 140730865723840}}, {t_id = 1967303167, t_counts = {t_numscans = 0,
t_tuples_returned = 139637976727552, t_tuples_fetched = 40590032, t_tuples_inserted = 140730865723808,
t_tuples_updated = 12058484, t_tuples_deleted = 872, t_tuples_hot_updated = 13523240, t_truncated = 64,
t_delta_live_tuples = 140730865723840, t_delta_dead_tuples = 206158430248, t_changed_tuples = 140730865723824,
t_blocks_fetched = 140730865723632, t_blocks_hit = 42989503920}}, {t_id = 2048, t_counts = {
t_numscans = 39830986, t_tuples_returned = 140730865723776, t_tuples_fetched = 13519316, t_tuples_inserted = 1,
t_tuples_updated = 0, t_tuples_deleted = 72057594037927936, t_tuples_hot_updated = 39712056, t_truncated = 176,
t_delta_live_tuples = 8053947, t_delta_dead_tuples = 0, t_changed_tuples = 140730865723851, t_blocks_fetched = 0,
t_blocks_hit = 11}}, {t_id = 1967303072, t_counts = {t_numscans = 8054706, t_tuples_returned = 13901050,
t_tuples_fetched = 11, t_tuples_inserted = 140730865723840, t_tuples_updated = 287762810112,
t_tuples_deleted = 39690904, t_tuples_hot_updated = 1080863910568927232, t_truncated = 32,
t_delta_live_tuples = 9902164, t_delta_dead_tuples = 10557237504, t_changed_tuples = 140730865723984,
t_blocks_fetched = 72, t_blocks_hit = 39710344}}, {t_id = 0, t_counts = {t_numscans = 39710416,
t_tuples_returned = 140730865724000, t_tuples_fetched = 11778787, t_tuples_inserted = 0,
t_tuples_updated = 39702288, t_tuples_deleted = 16, t_tuples_hot_updated = 16, t_truncated = 224,
t_delta_live_tuples = 7920, t_delta_dead_tuples = 13530524, t_changed_tuples = 1024, t_blocks_fetched = 56,
t_blocks_hit = 30064771072}}, {t_id = 1967303344, t_counts = {t_numscans = 11774795, t_tuples_returned = 1024,
t_tuples_fetched = 39702288, t_tuples_inserted = 140730865724112, t_tuples_updated = 30076544661,
t_tuples_deleted = 39702552, t_tuples_hot_updated = 1024, t_truncated = 16, t_delta_live_tuples = 39702552,
t_delta_dead_tuples = 0, t_changed_tuples = 7928, t_blocks_fetched = 88, t_blocks_hit = 39702512}}}}
可以看到 t_tuples_inserted 为1,证明了我们的猜想。
继续追可以看到 stats collector 进程的 socket 是9 👇🏻 确实如此,进一步验证了我们的猜想,给 stats collector 发送 msg。
[root@xiongcc ~]# ls -l /proc/20359/fd
total 0
lr-x------ 1 postgres postgres 64 Dec 27 17:41 0 -> /dev/null
l-wx------ 1 postgres postgres 64 Dec 27 17:41 1 -> pipe:[76482]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 10 -> anon_inode:[signalfd]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 11 -> anon_inode:[eventpoll]
l-wx------ 1 postgres postgres 64 Dec 27 17:41 2 -> pipe:[76482]
lr-x------ 1 postgres postgres 64 Dec 27 17:41 3 -> pipe:[76481]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 4 -> anon_inode:[eventpoll]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 9 -> socket:[76492]
[root@xiongcc ~]# ps -ef | grep stats
postgres 20359 20352 0 15:31 ? 00:00:03 postgres: stats collector
root 22896 22761 0 17:45 pts/7 00:00:00 grep --color=auto stats
小结
通过这样一个案例,相信各位对于统计信息和autovacuum这个鸡生蛋还是蛋生鸡孰先孰后的问题应该清晰了吧?That's all.