使用动态追踪Dtrace分析问题

Original xiongcc PostgreSQL学徒

2024-09-29

前言

之前在群里有位小伙伴问了这么一个问题："各位专家，PG删除了表但是数据文件任然存在的可能性是什么呀"，乍一听有点违背常识。按理删除了表那么对应的文件也会被删除，然后释放所占存储，不然磁盘快爆了结果删除表又不能释放空间就扯淡了。但是事实真是如此吗？让我们一起来分析一下这个有趣的案例 ~

现象

先复现一下，随便建个表然后删除

postgres=# create table t1(id int primary key,info text);
CREATE TABLE
postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,100000) as n;
INSERT 0 100000
postgres=# checkpoint ;
CHECKPOINT
postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/5/18424
(1 row)

postgres=# select pg_relation_filepath('t1_pkey');
 pg_relation_filepath 
----------------------
 base/5/18429
(1 row)

postgres=# \! ls -l 15data/base/5/18424
-rw------- 1 postgres postgres 6832128 Sep  4 16:14 15data/base/5/18424
postgres=# \! ls -l 15data/base/5/18429
-rw------- 1 postgres postgres 2260992 Sep  4 16:14 15data/base/5/18429

postgres=# create sequence myseq;
CREATE SEQUENCE
postgres=# select pg_relation_filepath('myseq');
 pg_relation_filepath 
----------------------
 base/5/18431
(1 row)

postgres=# \! ls -l 15data/base/5/18431
-rw------- 1 postgres postgres 8192 Sep  4 16:15 15data/base/5/18431

postgres=# create table t1 as select * from generate_series(1,100000);
SELECT 100000
postgres=# create materialized view myview as select * from t1;
SELECT 100000
postgres=# select pg_relation_filepath('myview');
 pg_relation_filepath 
----------------------
 base/5/18435
(1 row)

postgres=# \! ls -l 15data/base/5/18435
-rw------- 1 postgres postgres 3629056 Sep  4 16:26 15data/base/5/18435

可以看到虽然表被删除了但是对应的文件还在文件系统上，不过是个空文件，序列和物化视图行为相同 👇🏻

postgres=# drop table t1;
DROP TABLE
postgres=# \! ls -l 15data/base/5/18424
-rw------- 1 postgres postgres 0 Sep  4 16:15 15data/base/5/18424
postgres=# \! ls -l 15data/base/5/18429
-rw------- 1 postgres postgres 0 Sep  4 16:15 15data/base/5/18429

postgres=# drop sequence myseq ;   ---序列也是类似
DROP SEQUENCE
postgres=# \! ls -l 15data/base/5/18431
-rw------- 1 postgres postgres 0 Sep  4 16:16 15data/base/5/18431

postgres=# drop materialized view myview ;   ---物化视图亦是
DROP MATERIALIZED VIEW
postgres=# \! ls -l 15data/base/5/18435
-rw------- 1 postgres postgres 0 Sep  4 16:27 15data/base/5/18435

但是临时表的行为有点不同，此例事务提交之后临时表就会消失，可以看到表文件也随之删除了，其他临时表也是类似，会话断了之后随之消失。（临时表的文件命名方式即 t_xxx 此处不再赘述，感兴趣的童鞋可以翻一下之前的文章）

postgres=# begin;
BEGIN
postgres=*# create temp table t2 on commit drop as select * from generate_series(1,100);  ---事务提交后临时表就消失
SELECT 100
postgres=*# select pg_relation_filepath('t2');
 pg_relation_filepath 
----------------------
 base/5/t3_18418
(1 row)

postgres=*# \! ls -l 15data/base/5/t3_18418
-rw------- 1 postgres postgres 8192 Sep  4 16:07 15data/base/5/t3_18418
postgres=*# commit ;
COMMIT
postgres=# \! ls -l 15data/base/5/t3_18418
ls: cannot access 15data/base/5/t3_18418: No such file or directory

不难想象，假如有糟糕的业务设计，大量删除对象之后，文件系统上会留有大量的空文件。我们知道在 Linux 中，一切皆文件，文件数据都存在数据块中，那么很显然，我们还必须找到一个地方储存文件的元信息，比如文件的创建者、文件的创建日期、文件的大小等等。这种储存文件元信息的区域就叫做 inode ，中文译名为"索引节点"。每一个文件都有对应的 inode ，里面包含了与该文件有关的一些信息，因此假如小文件太多，会导致所在磁盘的所有元数据空间爆满，inode 耗尽，并且 inode 自身也是会占据一定大小的、

[root@xiongcc ~]# dumpe2fs -h /dev/vda2 | grep "Inode size"
dumpe2fs 1.42.9 (28-Dec-2013)
Inode size:               256

那么 PostgreSQL 是如何处理这类空文件的，即何时删除这些空文件？

分析

先在网上搜一圈，在 mail list 里，找到一个类似的问题，drop table but file still exists

It's protecting against some obscure race condition involving reassignment of the relfilenode number to a new table. I forget the details but you could troll the archives for it if you really care --- I think we put that in about two or so years ago.

不在表删除时立马删除对应的文件是为了解决 relfilenode 重分配时的一些问题，这些空文件会在下次 checkpoint 时删除。

让我们试验一下：

postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/5/18432
(1 row)

postgres=# drop table t1;
DROP TABLE
postgres=# \! ls -l 15data/base/5/18432
-rw------- 1 postgres postgres 0 Sep  4 16:49 15data/base/5/18432
postgres=# checkpoint ;
CHECKPOINT
postgres=# \! ls -l 15data/base/5/18432
ls: cannot access 15data/base/5/18432: No such file or directory

可以看到确实如此。但是这位小伙伴反馈执行了 checkpoint 表文件依然存在 👇🏻

难道有类似 vacuum 清除死元祖由于 database horizon 而无法删除的问题？既然常规分析不了，我们只能瞅瞅代码逻辑了。

源码分析

但是新的问题又来了，以往分析源码的前提是我们知道大致的错误位置然后找到相关函数，打断点分析，但是此例我们只知道是在 checkpoint 时移除，checkpoint.c 里面大1000行，直接闷着头看代码是不现实的，这个时候我们就要借助另外一项技术了——Dynamic Tracing 动态追踪。PostgreSQL 自身也提供了很多监测点，https://www.postgresql.org/docs/current/dynamic-trace.html，编译的时候指定 --enable-dtrace即可。

PostgreSQL provides facilities to support dynamic tracing of the database server. This allows an external utility to be called at specific points in the code and thereby trace execution.
A number of probes or trace points are already inserted into the source code. These probes are intended to be used by database developers and administrators. By default the probes are not compiled into PostgreSQL; the user needs to explicitly tell the configure script to make the probes available.
Currently, the DTrace utility is supported, which, at the time of this writing, is available on Solaris, macOS, FreeBSD, NetBSD, and Oracle Linux. The SystemTap project for Linux provides a DTrace equivalent and can also be used. Supporting other dynamic tracing utilities is theoretically possible by changing the definitions for the macros in src/include/utils/probes.h.

在 Linux 上，对应的是SystemTap工具。SystemTap是一个Linux非常有用的调试（跟踪/探测）工具，常用于Linux内核或者应用程序的信息采集，比如：获取一个函数里面运行时的变量、调用堆栈，甚至可以直接修改变量的值（之前吕大师演示的 full_page_writes 块折断就是借助的SystemTap），对诊断性能或功能问题非常有帮助。

SystemTap常用在这些场景：

定位（内核）函数位置
查看函数被调用时的调用堆栈、局部变量、参数
查看函数指针变量实际指的是哪个函数
查看代码的执行轨迹（哪些行被执行了）
查看内核或者进程的执行流程
调试内存泄露或者内存重复释放
统计函数调用次数
...

SystemTap的语法类似于C语言，具体的语法这里不做过多阐述，网上的教程和书籍挺多。

让我们小试牛刀一下，由于SystemTap静态分析放置探针的位置时离不开调试信息，一些用户空间事件需要给定PID或可执行文件的路径，此处我们直接指定全路径

process("PATH").function("function")：进入可执行文件PATH的用户空间函数function。允许使用通配符和.return后缀。
process("PATH").statement("statement")：代码中第一次执行statement的地方。

常用的可打印变量

tid(), 当前线程id
pid(), 当前进程id
uid(), 当前用户id
execname(), 当前进程名称
cpu(), 当前cpu编号
gettimeofday_s(), 秒时间戳
get_cycles(), 硬件周期计数器快照
pp(), 探测点事件名称
ppfunc(), 探测点触发的函数名称
thread_indent(n)，补充空格
带s会返回堆栈字符串
print_ubacktrace(), 打印用户空间栈

简单写一个，因为我们不知道具体的函数名称，只能一把梭写个通配符 * 将执行过程所有的函数都打印出来，并且打印对应调用的函数名称。

probe process("/usr/pgsql-15/bin/postgres").function("*") {
        printf("%s: %s\n", execname(), ppfunc());
}

现在就可以复现了，看下执行 drop table 后调用了哪些函数。

根据一些过往经验，md.c 就是涉及到和磁盘打交道的主文件，看下我们的脚本抓到了与 md 相关的有哪些内容

[postgres@xiongcc ~]$ cat systmptap.out | sort | uniq  | grep md
postgres: mdclose
postgres: mdexists
postgres: mdnblocks
postgres: _mdnblocks
postgres: mdopen
postgres: mdopenfork
postgres: mdunlink    👈🏻
postgres: mdunlinkfork  👈🏻

这里有两个函数，根据函数名就可猜到大概 mdunlink，unlink解除关系。

阅读一下注释

/*
 * mdunlink() -- Unlink a relation.
 *
 * Note that we're passed a RelFileNodeBackend --- by the time this is called,
 * there won't be an SMgrRelation hashtable entry anymore.
 *
 * forkNum can be a fork number to delete a specific fork, or InvalidForkNumber
 * to delete all forks.
 *
 * For regular relations, we don't unlink the first segment file of the rel,
 * but just truncate it to zero length, and record a request to unlink it after
 * the next checkpoint.  Additional segments can be unlinked immediately,
 * however.  Leaving the empty file in place prevents that relfilenode
 * number from being reused.  The scenario this protects us from is:
 * 1. We delete a relation (and commit, and actually remove its file).
 * 2. We create a new relation, which by chance gets the same relfilenode as
 *   the just-deleted one (OIDs must've wrapped around for that to happen).
 * 3. We crash before another checkpoint occurs.
 * During replay, we would delete the file and then recreate it, which is fine
 * if the contents of the file were repopulated by subsequent WAL entries.
 * But if we didn't WAL-log insertions, but instead relied on fsyncing the
 * file after populating it (as we do at wal_level=minimal), the contents of
 * the file would be lost forever.  By leaving the empty file until after the
 * next checkpoint, we prevent reassignment of the relfilenode number until
 * it's safe, because relfilenode assignment skips over any existing file.
 *
 * We do not need to go through this dance for temp relations, though, because
 * we never make WAL entries for temp rels, and so a temp rel poses no threat
 * to the health of a regular rel that has taken over its relfilenode number.
 * The fact that temp rels and regular rels have different file naming
 * patterns provides additional safety.
 *
 * All the above applies only to the relation's main fork; other forks can
 * just be removed immediately, since they are not needed to prevent the
 * relfilenode number from being recycled.  Also, we do not carefully
 * track whether other forks have been created or not, but just attempt to
 * unlink them unconditionally; so we should never complain about ENOENT.
 *
 * If isRedo is true, it's unsurprising for the relation to be already gone.
 * Also, we should remove the file immediately instead of queuing a request
 * for later, since during redo there's no possibility of creating a
 * conflicting relation.
 *
 * Note: any failure should be reported as WARNING not ERROR, because
 * we are usually not in a transaction anymore when this is called.
 */

十分清晰，对于普通的表，第一个 segment 段文件不会被unlink，仅仅是 truncate 为空文件，其余的 segment 段文件则会被立即删除，对于第一个 segment 段文件则添加请求至下一次checkpoint中，这么做是为了防止注释中所说的新建的表又正好获得了此删除文件的 oid，那么在 wal_level = minimal 情况下可能会发生数据丢失。而对于 free space map 和 visibility map 文件则会被立即删除。

static void
mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
{
 char    *path;
 int   ret;

 path = relpath(rnode, forkNum);

 /*
  * Delete or truncate the first segment.
  */
 if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 {
  if (!RelFileNodeBackendIsTemp(rnode))
  {
   /* Prevent other backends' fds from holding on to the disk space */
   ret = do_truncate(path);

   /* Forget any pending sync requests for the first segment */
   register_forget_request(rnode, forkNum, 0 /* first seg */ );
  }
  else
   ret = 0;

  /* Next unlink the file, unless it was already found to be missing */
  if (ret == 0 || errno != ENOENT)
  {
   ret = unlink(path);
   if (ret < 0 && errno != ENOENT)
    ereport(WARNING,
      (errcode_for_file_access(),
       errmsg("could not remove file \"%s\": %m", path)));
  }
 }
 else
 {
  /* Prevent other backends' fds from holding on to the disk space */
  ret = do_truncate(path);

  /* Register request to unlink first segment later */
  register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
 }

 /*
  * Delete any additional segments.
  */
 if (ret >= 0)
 {
  char    *segpath = (char *) palloc(strlen(path) + 12);
  BlockNumber segno;

  /*
   * Note that because we loop until getting ENOENT, we will correctly
   * remove all inactive segments as well as active ones.
   */
  for (segno = 1;; segno++)
  {
   sprintf(segpath, "%s.%u", path, segno);

   if (!RelFileNodeBackendIsTemp(rnode))
   {
    /*
     * Prevent other backends' fds from holding on to the disk
     * space.
     */
    if (do_truncate(segpath) < 0 && errno == ENOENT)
     break;

    /*
     * Forget any pending sync requests for this segment before we
     * try to unlink.
     */
    register_forget_request(rnode, forkNum, segno);
   }

   if (unlink(segpath) < 0)
   {
    /* ENOENT is expected after the last segment... */
    if (errno != ENOENT)
     ereport(WARNING,
       (errcode_for_file_access(),
        errmsg("could not remove file \"%s\": %m", segpath)));
    break;
   }
  }
  pfree(segpath);
 }

 pfree(path);
}

让我们简单追踪一下，先创建一个大于 1GB 的表文件

postgres=# create table t1(id int primary key,info text);
CREATE TABLE
postgres=# insert into t1 select n,md5(random()::text) from generate_series(1,20000000) as n;
INSERT 0 20000000
postgres=# select pg_size_pretty(pg_relation_size('t1'));
 pg_size_pretty 
----------------
 1302 MB
(1 row)

postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/13892/16434
(1 row)

现在有两个 segment 段文件

[postgres@xiongcc 13892]$ ls -lrth 16434*
-rw------- 1 postgres postgres 1.0G Sep  4 19:18 16434
-rw------- 1 postgres postgres 344K Sep  4 19:26 16434_fsm
-rw------- 1 postgres postgres 279M Sep  4 19:26 16434.1
-rw------- 1 postgres postgres  48K Sep  4 19:26 16434_vm

打个断点

Breakpoint 1, mdunlinkfork (rnode=..., forkNum=MAIN_FORKNUM, isRedo=false) at md.c:319
319             path = relpath(rnode, forkNum);
(gdb) c
Continuing.

Breakpoint 1, mdunlinkfork (rnode=..., forkNum=FSM_FORKNUM, isRedo=false) at md.c:319
319             path = relpath(rnode, forkNum);
(gdb) c
Continuing.

Breakpoint 1, mdunlinkfork (rnode=..., forkNum=VISIBILITYMAP_FORKNUM, isRedo=false) at md.c:319
319             path = relpath(rnode, forkNum);
(gdb) c
Continuing.

Breakpoint 1, mdunlinkfork (rnode=..., forkNum=INIT_FORKNUM, isRedo=false) at md.c:319
319             path = relpath(rnode, forkNum);
(gdb) c
Continuing.

可以看到第一次是主分支 MAIN_FORKNUM，接着是挨个处理其他分支，具体的流程这里就不追踪了，看看结果

[postgres@xiongcc 13892]$ ls -lrth 16434*
-rw------- 1 postgres postgres 0 Sep  4 19:29 16434

如前面分析，只保留了主分支的第一个 segment，其余的 FSM、VM都被移除了。那么什么情况下会移除失败？

if (unlink(segpath) < 0)
{
  /* ENOENT is expected after the last segment... */
  if (errno != ENOENT)
    ereport(WARNING,
       (errcode_for_file_access(),
        errmsg("could not remove file \"%s\": %m", segpath)));
  break;
}

可以看到 👆🏻，PostgreSQL使用的是标准C函数 unlink，成功返回0，失败返回 -1。不过值得注意的是执行unlink函数并不一定会真正的删除文件，它先会检查文件系统中此文件的连接数是否为1，如果不是1说明此文件还有其他链接对象，因此只对此文件的连接数进行减1操作。若连接数为1，并且在此时没有任何进程打开该文件，此内容才会真正地被删除掉。在有进程打开此文件的情况下，则暂时不会删除，直到所有打开该文件的进程都结束时文件就会被删除。

下面是一些关于文件操作的errcode，比如不是superuser，没有权限，只读的文件系统，IO失败，磁盘空间满了等等。当发生这些情况时，都会出现删除失败的情况。因此这位小伙伴出现的问题原因也就呼之欲出了，大概率是出现了 IO 错误。

/*
 * errcode_for_file_access --- add SQLSTATE error code to the current error
 *
 * The SQLSTATE code is chosen based on the saved errno value.  We assume
 * that the failing operation was some type of disk file access.
 *
 * NOTE: the primary error message string should generally include %m
 * when this is used.
 */
int
errcode_for_file_access(void)
{
 ErrorData  *edata = &errordata[errordata_stack_depth];

 /* we don't bother incrementing recursion_depth */
 CHECK_STACK_DEPTH();

 switch (edata->saved_errno)
 {
   /* Permission-denied failures */
  case EPERM:    /* Not super-user */
  case EACCES:   /* Permission denied */
#ifdef EROFS
  case EROFS:    /* Read only file system */
#endif
   edata->sqlerrcode = ERRCODE_INSUFFICIENT_PRIVILEGE;
   break;

   /* File not found */
  case ENOENT:   /* No such file or directory */
   edata->sqlerrcode = ERRCODE_UNDEFINED_FILE;
   break;

   /* Duplicate file */
  case EEXIST:   /* File exists */
   edata->sqlerrcode = ERRCODE_DUPLICATE_FILE;
   break;

   /* Wrong object type or state */
  case ENOTDIR:   /* Not a directory */
  case EISDIR:   /* Is a directory */
#if defined(ENOTEMPTY) && (ENOTEMPTY != EEXIST) /* same code on AIX */
  case ENOTEMPTY:   /* Directory not empty */
#endif
   edata->sqlerrcode = ERRCODE_WRONG_OBJECT_TYPE;
   break;

   /* Insufficient resources */
  case ENOSPC:   /* No space left on device */
   edata->sqlerrcode = ERRCODE_DISK_FULL;
   break;

  case ENFILE:   /* File table overflow */
  case EMFILE:   /* Too many open files */
   edata->sqlerrcode = ERRCODE_INSUFFICIENT_RESOURCES;
   break;

   /* Hardware failure */
  case EIO:    /* I/O error */
   edata->sqlerrcode = ERRCODE_IO_ERROR;
   break;

   /* All else is classified as internal errors */
  default:
   edata->sqlerrcode = ERRCODE_INTERNAL_ERROR;
   break;
 }

 return 0;     /* return value does not matter */
}

自己尝试复现了一下，好家伙直接 segment fault 了~

postgres=# drop table t1;
WARNING:  AbortTransaction while in COMMIT state
ERROR:  could not open file "base/13892/24584": Permission denied
PANIC:  cannot abort transaction 774, it was already committed
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

小结

小结一下，

PostgreSQL使用的标准C函数 unlink 移除文件，假如遇到了不是superuser，没有权限，只读的文件系统，IO失败，磁盘空间满了等情况，就可能出现此案例类似的情况，删除了表但是表文件还在，自己手动移除一下即可
删除表时，会立即删除0号 segment 段文件以外的其他文件，包括 fsm、vm、超过1GB的其他段文件，而第一个 segment 段文件会被截断，大小为0，但是文件还在，在下次checkpoint时彻底移除，这么做是为了防止注释中所说的新建的表又正好获得了此删除文件的 oid，那么在 wal_level = minimal 情况下可能会发生数据丢失
临时表会立马删除表文件

另外 Dtrace 是一项利器，DBA们可以掌握一下，比如利用 Dtrace 分析带有索引和不带有索引表的插入性能对比，insert 和 copy 的性能对比等等，下面是一个对比不同的 pending list 和 work_mem 对于GIN索引的性能对比 👇🏻

GIN索引为PostgreSQL数据库多值类型的倒排索引，一条记录可能涉及到多个GIN索引中的KEY，所以如果写入时实时合并索引，会导致IO急剧增加，写入RT必然增加。为了提高写入吞吐，PG允许用户开启GIN索引的延迟合并技术，开启后，数据会先写入pending list，并不是直接写入索引页，当pending list达到一定大小，或者autovacuum对应表时，会触发pending list合并到索引的动作。

gin_diagram

#pragma D option quiet
BEGIN {printf("%20s | %s\n", "WHAT", "TIME");}
pid$target::ginInsertCleanup:entry {
 flush = timestamp;
}
pid$target::ginInsertCleanup:return {
 printf("%20s | %d\n",
 "Flushing pending list", timestamp - flush);
}
pid$target::standard_ExecutorRun:entry {
 insert = timestamp;
}
pid$target::standard_ExecutorRun:return {
 printf("%20s | %d\n", "exec",
 timestamp - insert);
}

参考

https://spacewander.gitbooks.io/systemtapbeginnersguide_zh/content/3_2_SystemTapScripts.html

https://github.com/digoal/blog/blob/master/201810/20181025_01.md

https://zhuanlan.zhihu.com/p/479896736

https://wiki.postgresql.org/wiki/DTrace

继续滑动看下一个

PostgreSQL学徒

向上滑动看下一个

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

使用动态追踪Dtrace分析问题

前言

现象

分析

源码分析

小结

参考

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

生成图片，分享到微信朋友圈

使用动态追踪Dtrace分析问题

前言

现象

分析

源码分析

小结

参考

您可能也对以下帖子感兴趣