从实际案例分析统计信息收集流程

Original xiongcc PostgreSQL学徒

2024-09-30

前言

今天上午一位负责数据库测试的同事找到我说，有这么一个现象："有个表，为何增删改的行数达到了阈值，但是 autovacuum 一直不工作？" 测试脚本如下👇🏻

alter system set autovacuum_naptime = 5;
select pg_reload_conf();

create table walker(id int,name text);
select clock_timestamp() as t1 \gset
insert into walker select generate_series(1,100),'test';
select clock_timestamp() as t2 \gset
delete from walker where id < 80;

--wait some time to make autovacuum happen automatically
--during the first pg_sleep,regardless of how pg_sleep is set,autovacuum doesn't happen for table walker
select pg_sleep(100);
--select pg_sleep(10);

脚本中配置了 autovacuum_naptime，该参数用于设置两次系统自动清理操作之间的间隔时间，如果当前不存在空闲的 autovacuum worker，则休眠autovacuum_naptime，此处设置为5s，用于尽快触发 autovacuum。注意观察上方最后的注解：

“
等待一段时间，以让 autovacuum 自动触发，在第一次的 pg_sleep 期间，无论 pg_sleep 设置多久，autovacuum都不会对表 walker 进行工作。

乍一看有点唬人，让我们一起分析下这个有趣的案例！

复现

这个测试的案例很简单，轻易就可以模拟出来。脚本初衷是为了测试自研数据库的闪回功能，脚本自动化，所以需要去观测数据库的 vacuum 行为以及触发阈值。简单模拟一下👇🏻

[postgres@xiongcc ~]$ cat test.sql 
drop table if exists walker;
create table walker(id int,name text);
insert into walker select generate_series(1,100),'test';
delete from walker where id < 80;
select pg_sleep(100);

在PostgreSQL中，autovacuum 的触发时机是当表的删除和更新行数达到了reltuples * autovacuum_vacuum_scale_factor + autovacuum_vacuum_threshold，因此上面的表触发阈值是70，理论上第4步便会触发vacuum，先手动操作一下，看看有没有幺蛾子

postgres=# show autovacuum_naptime ;
 autovacuum_naptime 
--------------------
 5s
(1 row)

postgres=# drop table if exists walker;
DROP TABLE
postgres=# create table walker(id int,name text);
CREATE TABLE
postgres=# insert into walker select generate_series(1,100),'test';
INSERT 0 100
postgres=# select now();
              now              
-------------------------------
 2022-12-27 16:39:56.981739+08
(1 row)

postgres=# delete from walker where id < 80;
DELETE 79
postgres=# select last_vacuum,last_autovacuum from pg_stat_all_tables where relname = 'walker';
-[ RECORD 1 ]---+------------------------------
last_vacuum     | 
last_autovacuum | 2022-12-27 16:40:02.060374+08

毫无疑问，手工操作的话没有问题。那让我们放到脚本里来操作一下看看：

[postgres@xiongcc ~]$ psql -f test.sql 
DROP TABLE
CREATE TABLE
INSERT 0 100
DELETE 79
---睡眠100秒中...

同时开启另外一个窗口观察 pg_stat_all_tables（为什么要看这个表，此处不再赘述，参照之前的文章小案例之鸡生蛋还是蛋生鸡？）

Tue 27 Dec 2022 04:51:38 PM CST (every 1s)

-[ RECORD 1 ]-------+-------
relid               | 16458
schemaname          | public
relname             | walker
seq_scan            | 0
seq_tup_read        | 0
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 0
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 0
n_dead_tup          | 0
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 
last_autovacuum     | 
last_analyze        | 
last_autoanalyze    | 
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 0
autoanalyze_count   | 0


...
...

Tue 27 Dec 2022 04:52:44 PM CST (every 1s)

-[ RECORD 1 ]-------+-------
relid               | 16458
schemaname          | public
relname             | walker
seq_scan            | 0
seq_tup_read        | 0
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 0
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 0
n_dead_tup          | 0
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 
last_autovacuum     | 
last_analyze        | 
last_autoanalyze    | 
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 0
autoanalyze_count   | 0

...
...

Tue 27 Dec 2022 04:53:12 PM CST (every 1s)

-[ RECORD 1 ]-------+-------
relid               | 16458
schemaname          | public
relname             | walker
seq_scan            | 1
seq_tup_read        | 100
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 100
n_tup_upd           | 0
n_tup_del           | 79
n_tup_hot_upd       | 0
n_live_tup          | 21
n_dead_tup          | 79
n_mod_since_analyze | 179
n_ins_since_vacuum  | 100
last_vacuum         | 
last_autovacuum     | 
last_analyze        | 
last_autoanalyze    | 
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 0
autoanalyze_count   | 0

Tue 27 Dec 2022 04:53:13 PM CST (every 1s)

-[ RECORD 1 ]-------+------------------------------
relid               | 16458
schemaname          | public
relname             | walker
seq_scan            | 1
seq_tup_read        | 100
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 100
n_tup_upd           | 0
n_tup_del           | 79
n_tup_hot_upd       | 0
n_live_tup          | 21
n_dead_tup          | 0
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 
last_autovacuum     | 2022-12-27 16:53:12.950886+08
last_analyze        | 
last_autoanalyze    | 2022-12-27 16:53:12.952087+08
vacuum_count        | 0
autovacuum_count    | 1
analyze_count       | 0
autoanalyze_count   | 1

注意观察时间，可以看到在休眠期间，关于walker表的统计信息一直是0，因此 autovacuum 不触发也就不足为奇了，直到休眠完成，n_tup_del 才有了值，然后autovacuum launcher 创建 autovacuum worker 进程去进行了清理，那么为什么放在脚本里会有这样奇怪的行为而手工操作是正常的呢？

老办法

当我们不知道代码跑了哪些流程的时候，还是老样子，用笨办法全部一股脑抓下来即可。另外根据过往经验，stats collector 负责收集统计信息，然后 autovacuum launcher 定期去读这些文件，再选择性地创建工作进程去进行清理，抓一下和 stats 相关的流程

[postgres@xiongcc ~]$ cat stap.out | grep stat
postgres: pgstat_recv_inquiry
postgres: pgstat_recv_slru
postgres: pgstat_recv_tabstat
postgres: pgstat_recv_vacuum
postgres: pgstat_recv_wal
postgres: pgstat_report_activity
postgres: pgstat_report_analyze
postgres: pgstat_report_appname
postgres: pgstat_report_autovac
postgres: pgstat_report_disconnect
postgres: pgstat_report_query_id
postgres: pgstat_report_stat   👈🏻
postgres: pgstat_report_vacuum  👈🏻
postgres: pgstat_report_wait_end
postgres: pgstat_report_wait_start
postgres: pgstat_report_xact_timestamp
postgres: pgstat_reset_wait_event_storage
postgres: pgstat_send
postgres: pgstat_send_bgwriter
postgres: pgstat_send_funcstats
postgres: pgstat_send_inquiry
postgres: pgstat_send_slru
postgres: pgstat_send_tabstat
postgres: pgstat_send_wal
postgres: pgstat_setheader
postgres: pgstat_setup_memcxt
...

可以看到有几个关键的 pgstat_report_xxx，顾名思义，反馈统计信息，那看下这块的代码

/* ----------
 * pgstat_report_stat() -
 *
 * Must be called by processes that performs DML: tcop/postgres.c, logical
 * receiver processes, SPI worker, etc. to send the so far collected
 * per-table and function usage statistics to the collector.  Note that this
 * is called only when not within a transaction, so it is fair to use
 * transaction stop time as an approximation of current time.
 *
 * "disconnect" is "true" only for the last call before the backend
 * exits.  This makes sure that no data is lost and that interrupted
 * sessions are reported correctly.
 * ----------
 */
void
pgstat_report_stat(bool disconnect)
{
 /* we assume this inits to all zeroes: */
 static const PgStat_TableCounts all_zeroes;
 static TimestampTz last_report = 0;

 TimestampTz now;
 PgStat_MsgTabstat regular_msg;
 PgStat_MsgTabstat shared_msg;
 TabStatusArray *tsa;
 int   i;

 /*
  * Don't expend a clock check if nothing to do.
  *
  * To determine whether any WAL activity has occurred since last time, not
  * only the number of generated WAL records but also the numbers of WAL
  * writes and syncs need to be checked. Because even transaction that
  * generates no WAL records can write or sync WAL data when flushing the
  * data pages.
  */
 if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
  pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
  pgWalUsage.wal_records == prevWalUsage.wal_records &&
  WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
  !have_function_stats && !disconnect)
  return;

 /*
  * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
  * msec since we last sent one, or the backend is about to exit.
  */
 now = GetCurrentTransactionStopTimestamp();
 if (!disconnect &&
  !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
  return;

 last_report = now;
  ...
    
  
 /*
  * Send partial messages.  Make sure that any pending xact commit/abort
  * and connection stats get counted, even if there are no table stats to
  * send.
  */
 if (regular_msg.m_nentries > 0 ||
  pgStatXactCommit > 0 || pgStatXactRollback > 0 || disconnect)
  pgstat_send_tabstat(&regular_msg, now);
 if (shared_msg.m_nentries > 0)
  pgstat_send_tabstat(&shared_msg, now);

 /* Now, send function statistics */
 pgstat_send_funcstats();

 /* Send WAL statistics */
 pgstat_send_wal(true);

 /* Finally send SLRU statistics */
 pgstat_send_slru();
}

注解很清晰，每个进程执行 DML 会去调用 pgstat_report_stat 向 stats collector 进程反馈统计信息，后面会通过 pgstat_send_tabstat 发送表的统计信息，其他的诸如函数/WAL/SLRU的统计信息，则会有对应的三个函数 pgstat_send_funcstats/pgstat_send_wal/pgstat_send_slru 来完成，此处我们只关注表的，因此聚焦到 pgstat_send_tabstat 这个函数即可，该函数调用链不复杂，通过 pgstat_report_stat -> pgstat_send -> send 发送消息给 stats collector 。

/*
 * Subroutine for pgstat_report_stat: finish and send a tabstat message
 */
static void
pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg, TimestampTz now)
{
 int   n;
 int   len;
  ...
    
/* ----------
 * pgstat_send() -
 *
 *  Send out one statistics message to the collector
 * ----------
 */
static void
pgstat_send(void *msg, int len)
{
 int   rc;

 if (pgStatSock == PGINVALID_SOCKET)
  return;

 ((PgStat_MsgHdr *) msg)->m_size = len;

 /* We'll retry after EINTR, but ignore all other failures */
 do
 {
  rc = send(pgStatSock, msg, len, 0);
 } while (rc < 0 && errno == EINTR);

#ifdef USE_ASSERT_CHECKING
 /* In debug builds, log send failures ... */
 if (rc < 0)
  elog(LOG, "could not send to statistics collector: %m");
#endif
}

至此逻辑分析清楚了，每个后端进程事务提交/回滚时会发消息给进程 stats collector，stats collector 会汇总这份信息并记录到文件中，autovacuum launcher 定期读取文件以决定何时触发 autovacuum vacuum进行工作。

那问题出在了哪里？其实很明了，看这一块代码👇🏻

 /*
  * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
  * msec since we last sent one, or the backend is about to exit.
  */
 now = GetCurrentTransactionStopTimestamp();
 if (!disconnect &&
  !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
  return;

 last_report = now;

PGSTAT_STAT_INTERVAL 为 500毫秒，所以事务完成之后，会等待一小段时间，超过500毫秒之后才发送给 stats collector 进程，否则直接返回。而当我们在脚本里操作的话，基本上零延迟执行各个SQL，最后进程休眠，直到进程休眠完成，才得空发送消息给统计信息收集进程，假如你睡眠1万秒，等到1万秒之后才会进行清理，所以假如直接拷贝脚本里的文件到终端直接粘贴执行，一样可以模拟出来，而当我们手动操作的话，各个间隔期间肯定超过了500毫秒，这个时候已经发送给了 stats collector 进程，然后间隔 autovacuum_naptime 5秒之后（当然这是理想情况，autovacuum 有一套自己的得分机制，优先对xid或者multixact 超限的数据库进行自动清理，越长时间没有经过自动清理的数据库优先被清理），创建工作进程进行清理。

用 GDB 追踪一下

postgres=# create table t1(id int);
CREATE TABLE
postgres=# select relfilenode from pg_class where relname = 't1';
 relfilenode 
-------------
       16463
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          22257
(1 row)

敲个断点

(gdb) info b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000008a62fd in pgstat_report_stat at pgstat.c:865
2       breakpoint     keep y   0x00000000008a66b9 in pgstat_send_tabstat at pgstat.c:978

插入一行数据

postgres=# insert into t1 values(1);
--卡住

追踪下流程

(gdb) 
908             for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
(gdb) 
910                     for (i = 0; i < tsa->tsa_used; i++)
(gdb) 
912                             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
(gdb) 
917                             Assert(entry->trans == NULL);
(gdb) p *entry   ---👈🏻表在这里
$13 = {t_id = 16463, t_shared = false, trans = 0x0, t_counts = {t_numscans = 0, t_tuples_returned = 0, 
    t_tuples_fetched = 0, t_tuples_inserted = 1, t_tuples_updated = 0, t_tuples_deleted = 0, t_tuples_hot_updated = 0, 
    t_truncated = false, t_delta_live_tuples = 1, t_delta_dead_tuples = 0, t_changed_tuples = 1, t_blocks_fetched = 1, 
    t_blocks_hit = 1}}
 
 
Breakpoint 2, pgstat_send_tabstat (tsmsg=0x7ffe7542a6f0, now=725449139924852) at pgstat.c:978
978             if (pgStatSock == PGINVALID_SOCKET)
(gdb) n
985             if (OidIsValid(tsmsg->m_databaseid))
(gdb) p pgStatSock  ---socket 是9
$14 = 9
(gdb) p *tsmsg
$16 = {m_hdr = {m_type = 1967302512, m_size = 598334853}, m_databaseid = 13892, m_nentries = 1, m_xact_commit = 1, 
  m_xact_rollback = 0, m_block_read_time = 0, m_block_write_time = 0, m_session_time = 263254569, 
  m_active_time = 249639837, m_idle_in_xact_time = 0, m_entry = {{t_id = 16463👈🏻表在这里, t_counts = {t_numscans = 0, 
        t_tuples_returned = 0, t_tuples_fetched = 0, t_tuples_inserted = 1, t_tuples_updated = 0, t_tuples_deleted = 0, 
        t_tuples_hot_updated = 0, t_truncated = false, t_delta_live_tuples = 1, t_delta_dead_tuples = 0, 
        t_changed_tuples = 1, t_blocks_fetched = 1, t_blocks_hit = 1}}, {t_id = 39822192, t_counts = {t_numscans = 48, 
        t_tuples_returned = 140177974178456, t_tuples_fetched = 140177974178456, t_tuples_inserted = 15, 
        t_tuples_updated = 505637712536, t_tuples_deleted = 13523245, t_tuples_hot_updated = 9802459, t_truncated = 212, 
        t_delta_live_tuples = 40590560, t_delta_dead_tuples = 1, t_changed_tuples = 40590672, t_blocks_fetched = 0, 
        t_blocks_hit = 0}}, {t_id = 0, t_counts = {t_numscans = 1, t_tuples_returned = 0, t_tuples_fetched = 13523240, 
        t_tuples_inserted = 140730865723552, t_tuples_updated = 12058266, t_tuples_deleted = 140730865723600, 
        t_tuples_hot_updated = 13523240, t_truncated = 64, t_delta_live_tuples = 140730865723840, 
        t_delta_dead_tuples = 40590712, t_changed_tuples = 11772004, t_blocks_fetched = 140730865723850, 
        t_blocks_hit = 140730865723840}}, {t_id = 1967303167, t_counts = {t_numscans = 0, 
        t_tuples_returned = 139637976727552, t_tuples_fetched = 40590032, t_tuples_inserted = 140730865723808, 
        t_tuples_updated = 12058484, t_tuples_deleted = 872, t_tuples_hot_updated = 13523240, t_truncated = 64, 
        t_delta_live_tuples = 140730865723840, t_delta_dead_tuples = 206158430248, t_changed_tuples = 140730865723824, 
        t_blocks_fetched = 140730865723632, t_blocks_hit = 42989503920}}, {t_id = 2048, t_counts = {
        t_numscans = 39830986, t_tuples_returned = 140730865723776, t_tuples_fetched = 13519316, t_tuples_inserted = 1, 
        t_tuples_updated = 0, t_tuples_deleted = 72057594037927936, t_tuples_hot_updated = 39712056, t_truncated = 176, 
        t_delta_live_tuples = 8053947, t_delta_dead_tuples = 0, t_changed_tuples = 140730865723851, t_blocks_fetched = 0, 
        t_blocks_hit = 11}}, {t_id = 1967303072, t_counts = {t_numscans = 8054706, t_tuples_returned = 13901050, 
        t_tuples_fetched = 11, t_tuples_inserted = 140730865723840, t_tuples_updated = 287762810112, 
        t_tuples_deleted = 39690904, t_tuples_hot_updated = 1080863910568927232, t_truncated = 32, 
        t_delta_live_tuples = 9902164, t_delta_dead_tuples = 10557237504, t_changed_tuples = 140730865723984, 
        t_blocks_fetched = 72, t_blocks_hit = 39710344}}, {t_id = 0, t_counts = {t_numscans = 39710416, 
        t_tuples_returned = 140730865724000, t_tuples_fetched = 11778787, t_tuples_inserted = 0, 
        t_tuples_updated = 39702288, t_tuples_deleted = 16, t_tuples_hot_updated = 16, t_truncated = 224, 
        t_delta_live_tuples = 7920, t_delta_dead_tuples = 13530524, t_changed_tuples = 1024, t_blocks_fetched = 56, 
        t_blocks_hit = 30064771072}}, {t_id = 1967303344, t_counts = {t_numscans = 11774795, t_tuples_returned = 1024, 
        t_tuples_fetched = 39702288, t_tuples_inserted = 140730865724112, t_tuples_updated = 30076544661, 
        t_tuples_deleted = 39702552, t_tuples_hot_updated = 1024, t_truncated = 16, t_delta_live_tuples = 39702552, 
        t_delta_dead_tuples = 0, t_changed_tuples = 7928, t_blocks_fetched = 88, t_blocks_hit = 39702512}}}}

可以看到 t_tuples_inserted 为1，证明了我们的猜想。

继续追可以看到 stats collector 进程的 socket 是9 👇🏻 确实如此，进一步验证了我们的猜想，给 stats collector 发送 msg。

[root@xiongcc ~]# ls -l /proc/20359/fd
total 0
lr-x------ 1 postgres postgres 64 Dec 27 17:41 0 -> /dev/null
l-wx------ 1 postgres postgres 64 Dec 27 17:41 1 -> pipe:[76482]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 10 -> anon_inode:[signalfd]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 11 -> anon_inode:[eventpoll]
l-wx------ 1 postgres postgres 64 Dec 27 17:41 2 -> pipe:[76482]
lr-x------ 1 postgres postgres 64 Dec 27 17:41 3 -> pipe:[76481]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 4 -> anon_inode:[eventpoll]
lrwx------ 1 postgres postgres 64 Dec 27 17:41 9 -> socket:[76492]
[root@xiongcc ~]# ps -ef | grep stats
postgres 20359 20352  0 15:31 ?        00:00:03 postgres: stats collector 
root     22896 22761  0 17:45 pts/7    00:00:00 grep --color=auto stats

小结

通过这样一个案例，相信各位对于统计信息和autovacuum这个鸡生蛋还是蛋生鸡孰先孰后的问题应该清晰了吧？That's all.

继续滑动看下一个

PostgreSQL学徒

向上滑动看下一个

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

从实际案例分析统计信息收集流程

前言

复现

老办法

小结

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

生成图片，分享到微信朋友圈

从实际案例分析统计信息收集流程

前言

复现

老办法

小结

您可能也对以下帖子感兴趣