聊一聊基础备份与FPI

原创 xiongcc PostgreSQL学徒 2023-02-26

收录于合集 #备份恢复 9个

1前言

昨天早上一位同事找到我说："有个库 WAL 日志量很大，远程库重建总是导致磁盘告警，除了扩容还有其他解决办法吗？最后紧急将备份进程给kill了。" 现象很明了，在使用 pg_basebackup 重建远程库时，导致主库磁盘告警了，那有无优化手段以处理这种问题呢？一起分析一下！

2分析

pg_basebackup 是基于 pg_backup_start 和 pg_backup_stop 封装的工具，以非排它式的方式进行备份，在 15 以前，还是叫 pg_start_backup 和 pg_stop_backup，由于有 backup_label 的存在会导致一些问题（参照之前的文章），所以从 15 以后移除了排它式的备份。在 15 里面新增了不少特性，比如并行压缩，支持指定备份目标等，这里就不过多叨叨了。

pg_basebackup 支持在线热备，说白了就是不停机备份，那么它是如何保证一致性的呢？首先开始的时候会强制做一个检查点，将内存中的脏页全部刷盘，确保数据库是一致的状态，其次由于是热备，数据库还在持续地读写中，数据库的页面默认8KB，而操作系统一般是4KB

所以如果在内存数据刷盘得时候正好又去拷贝数据，很有可能只拷贝到一半的数据 👇🏻

为了解决这个块折断的问题，PostgreSQL 引入了全页写机制，每次检查点之后对页面的首次变更会记录整个页面到 WAL 日志中，redo 操作时通过 checksum 发现“部分写”的数据页，并将 WAL 中保存的这个完整数据页覆盖当前损坏的数据页，然后再继续 redo 就可以恢复整个数据库了。

通过全页写和检查点，以此推到一个一致性的状态，至于热备过程中产生的变更操作已经都记录在了WAL日志中，因此通过基础备份 + 备份期间产生的 WAL 就可以恢复到指定时间点了，只要确保备份期间产生的 WAL 归档号或者传输过去即可。所以 pg_basebackup 提供了可选的方式来处理 WAL

[postgres@xiongcc ~]$ pg_basebackup --help | grep -A 1 wal-method
  -X, --wal-method=none|fetch|stream
                         include required WAL files with specified method

none，不处理，由 DBA 自行处理，假如指定为none，pg_basebackup 会很贴心的打印一条提醒，如下👇🏻 要开启归档，否则需要确保备份期间的 WAL 都归档好了。
[postgres@xiongcc ~]$ pg_basebackup -Fp -Xn -D pgdata_bak -P -v --checkpoint=fast pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup ...
fetch，在备份结束之后再传输 WAL，因此需要把 wal_keep_size 设置的足够大以防止需要的 WAL 被移除，否则备份会失败
stream，默认的方式，会再额外启用一个 walsender 进程流式发送 WAL，这样可以解决 fetch模式带来的问题

因此按照我们的分析，在备份期间会保留这些 WAL 日志流式发送（当然随着正常的推进，目标端接收到了，那么 restart_lsn 也会往前推进，保留的 WAL 也会随之往前推进），假如一个库比较大并且很繁忙的话，备份就会耗时挺久，同事这个库就是1.5TB，其实也还好。

另外值得注意的是，pg_basebackup 并不会无脑全部扒拉下来，代码里会做判断，跳过一些不需要的文件，比如临时表、无日志表、统计信息临时文件等等

static int64
sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
  bool sendtblspclinks, backup_manifest_info *manifest,
  const char *spcoid)
{
 DIR     *dir;
 struct dirent *de;
  ...
    
   /* Exclude all forks for unlogged tables except the init fork */
  if (isDbDir &&
   parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
            &relForkNum))
  {
   /* Never exclude init forks */
   if (relForkNum != INIT_FORKNUM)
        
   
  /* Exclude temporary relations */
  if (isDbDir && looks_like_temp_rel_name(de->d_name))
  {
   elog(DEBUG2,
     "temporary relation file \"%s\" excluded from backup",
     de->d_name);

   continue;
  }
      
   
  /*
   * Exclude contents of directory specified by statrelpath if not set
   * to the default (pg_stat_tmp) which is caught in the loop above.
   */
  if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
  {
   elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
   size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
   continue;
  }

那统计一下备份期间生成的 WAL，使用如下 SQL 查询

with tmp_file as (
  select t1.file,
      t1.file_ls,
      (pg_stat_file(t1.file)).size as size,
      (pg_stat_file(t1.file)).access as access,
      (pg_stat_file(t1.file)).modification as last_update_time,
      (pg_stat_file(t1.file)).change as change,
      (pg_stat_file(t1.file)).creation as creation,
      (pg_stat_file(t1.file)).isdir as isdir
   from (select dir||'/'||pg_ls_dir(t0.dir) as file,
          pg_ls_dir(t0.dir) as file_ls
       from ( select 'xxx'::text as dir
           --需要修改这个物理路径
           --select '/mnt/nas_dbbackup/archivelog'::text as dir
           --select setting as dir from pg_settings where name='log_directory'
          ) t0
      ) t1 
   where 1=1
   order by (pg_stat_file(file)).modification desc
) 
select to_char(date_trunc('day',tf0.last_update_time),'yyyymmdd') as day_id,
    sum(case when date_part('hour',tf0.last_update_time) >=0 and date_part('hour',tf0.last_update_time) <24 then 1 else 0 end) as wal_num_all,
    sum(case when date_part('hour',tf0.last_update_time) >=0 and date_part('hour',tf0.last_update_time) <1 then 1 else 0 end) as wal_num_00_01,
    sum(case when date_part('hour',tf0.last_update_time) >=1 and date_part('hour',tf0.last_update_time) <2 then 1 else 0 end) as wal_num_01_02,
    sum(case when date_part('hour',tf0.last_update_time) >=2 and date_part('hour',tf0.last_update_time) <3 then 1 else 0 end) as wal_num_02_03,
    sum(case when date_part('hour',tf0.last_update_time) >=3 and date_part('hour',tf0.last_update_time) <4 then 1 else 0 end) as wal_num_03_04,
    sum(case when date_part('hour',tf0.last_update_time) >=4 and date_part('hour',tf0.last_update_time) <5 then 1 else 0 end) as wal_num_04_05,
    sum(case when date_part('hour',tf0.last_update_time) >=5 and date_part('hour',tf0.last_update_time) <6 then 1 else 0 end) as wal_num_05_06,
    sum(case when date_part('hour',tf0.last_update_time) >=6 and date_part('hour',tf0.last_update_time) <7 then 1 else 0 end) as wal_num_06_07,
    sum(case when date_part('hour',tf0.last_update_time) >=7 and date_part('hour',tf0.last_update_time) <8 then 1 else 0 end) as wal_num_07_08,
    sum(case when date_part('hour',tf0.last_update_time) >=8 and date_part('hour',tf0.last_update_time) <9 then 1 else 0 end) as wal_num_08_09,
    sum(case when date_part('hour',tf0.last_update_time) >=9 and date_part('hour',tf0.last_update_time) <10 then 1 else 0 end) as wal_num_09_10,
    sum(case when date_part('hour',tf0.last_update_time) >=10 and date_part('hour',tf0.last_update_time) <11 then 1 else 0 end) as wal_num_10_11,
    sum(case when date_part('hour',tf0.last_update_time) >=11 and date_part('hour',tf0.last_update_time) <12 then 1 else 0 end) as wal_num_11_12,
    sum(case when date_part('hour',tf0.last_update_time) >=12 and date_part('hour',tf0.last_update_time) <13 then 1 else 0 end) as wal_num_12_13,
    sum(case when date_part('hour',tf0.last_update_time) >=13 and date_part('hour',tf0.last_update_time) <14 then 1 else 0 end) as wal_num_13_14,
    sum(case when date_part('hour',tf0.last_update_time) >=14 and date_part('hour',tf0.last_update_time) <15 then 1 else 0 end) as wal_num_14_15,
    sum(case when date_part('hour',tf0.last_update_time) >=15 and date_part('hour',tf0.last_update_time) <16 then 1 else 0 end) as wal_num_15_16,
    sum(case when date_part('hour',tf0.last_update_time) >=16 and date_part('hour',tf0.last_update_time) <17 then 1 else 0 end) as wal_num_16_17,
    sum(case when date_part('hour',tf0.last_update_time) >=17 and date_part('hour',tf0.last_update_time) <18 then 1 else 0 end) as wal_num_17_18,
    sum(case when date_part('hour',tf0.last_update_time) >=18 and date_part('hour',tf0.last_update_time) <19 then 1 else 0 end) as wal_num_18_19,
    sum(case when date_part('hour',tf0.last_update_time) >=19 and date_part('hour',tf0.last_update_time) <20 then 1 else 0 end) as wal_num_19_20,
    sum(case when date_part('hour',tf0.last_update_time) >=20 and date_part('hour',tf0.last_update_time) <21 then 1 else 0 end) as wal_num_20_21,
    sum(case when date_part('hour',tf0.last_update_time) >=21 and date_part('hour',tf0.last_update_time) <22 then 1 else 0 end) as wal_num_21_22,
    sum(case when date_part('hour',tf0.last_update_time) >=22 and date_part('hour',tf0.last_update_time) <23 then 1 else 0 end) as wal_num_22_23, 
    sum(case when date_part('hour',tf0.last_update_time) >=23 and date_part('hour',tf0.last_update_time) <24 then 1 else 0 end) as wal_num_23_24
from tmp_file tf0
where 1=1
 and tf0.file_ls not in ('archive_status')
group by to_char(date_trunc('day',tf0.last_update_time),'yyyymmdd')
order by to_char(date_trunc('day',tf0.last_update_time),'yyyymmdd') desc
;

当然这个 SQL 有一些误差，因为是统计现有的，假如被回收了的话会有误差。不过可以用来作为参考，可以看到，WAL 产生的频率挺快 👇🏻

wal_num_07_08 | 253
wal_num_08_09 | 101
wal_num_09_10 | 126
...

在我们生产中，将 WAL 的大小改为了 512 MB，因此一个小时生成了十分可观的 WAL 日志。兜兜绕绕又回到了之前的话题，WAL堆积 / WAL暴涨的因素其实就那么几个：

老样子使用 pg_waldump -z 检查一下 FPI 的比例，随便检查了一个可以看到接近 60%多，看了一下参数， max_wal_size 是 2GB，太小了，以 9 点到 10 来算的话，每分钟大概 2 个，将checkpoint_timeout 调整到 20min 的话，那么就是 20 * 2 = 40 个，总共 40 * 512 / 1024 = 20GB，所以调整一下 checkpoint_timeout=20min 和 max_wal_size = 25GB，调整了之后下午再看的时候 FPI 只有 7% 的比例了，可以一定程度上降低 WAL 的量。

除了调整 checkpoint_timeout 之外，在之前的文章中有聊到 FPI，毕竟调整检查点周期是有代价的，首先是索引列最好不用 UUID，用序列，👇🏻

而 UUID 的话，因为无序性，随机插入，就会涉及到叶子节点的分裂与合并，导致 FPI

另外再考虑一个很常见的场景：表上有 N 多个索引，那么这些索引也会导致 FPI，看个例子，假设目前的状态如下，有两个索引

然后假如是使用了 HOT 技术的话，那么就会导致一次 FPI（堆表数据块更新，索引未更新）

但是假如是普通插入的话，堆表数据没变，但是索引变了。让我们回顾一下 FPI 的原理

PostgreSQL 在 checkpoint 之后在对数据页面的第一次写的时候会将整个数据页面写到 WAL 里面。当出现主机断电或者OS崩溃时，redo 操作时通过 checksum 发现“部分写”的数据页，并将 WAL 中保存的这个完整数据页覆盖当前损坏的数据页，然后再继续 redo 就可以恢复整个数据库了。

因此这种情况下普通的插入在上一次的基础上没有导致一次额外的 FPI，但是索引更新了，索引数据块写入了新的数据，所以又新增了一次 FPI。那么为什么会推荐使用 HOT呢？再看个例子，假如是非 HOT 的话，那么会导致一次堆表 FPI + 索引 FPI，因此优化 FPI 的一个方式便是使用 HOT。

3小结

既然分析清楚了 pg_basebackup 的原理，可以看到其实对于大库我们能做的手段十分有限，只能提升带宽，提升读写速度，降低备份期间 WAL 的生成量这些迂回手段了，要么就使用存储级别或者文件系统级别的快照，但是需要自己妥善处理备份期间的 WAL 日志（backup_label可以帮助我们）。因此这也说明了在 PostgreSQL 里面一个很常见的坑，假如大库备库断了，假如没有归档就是一件比较麻烦的事情，要么就使用复制槽一劳永逸（监控必须得带上），要么就得在业务低谷期间进行重建了。

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

聊一聊基础备份与FPI

1前言

2分析

3小结

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

生成图片，分享到微信朋友圈

聊一聊基础备份与FPI

1前言

2分析

3小结

您可能也对以下帖子感兴趣