活久见——不同用户不同执行计划
1前言
这几天被一个生产问题折磨得死去活来,CPU愣是给我干烧了,在上周某天晚上,业务反馈业务响应速度骤降,但是DBA上去执行慢SQL却特别快,经过排查,居然是不同的用户,生成的执行计划不一样!DBA表示惊呆了,虽然我很早之前也分享过一篇类似的案例(11的原生分区) 👉🏻 不同用户的执行计划居然会不一样?,但是当时没有深究,结果好巧不巧,前阵子某位同事踩了一次(13的原生分区),没过几天另一位同事又踩了一次,并且都是分区表。接二连三掉到坑里,看样子躲是躲不过了,看样子只有解决了才能睡个安稳觉了。
2现象
现象很明了,但是成因很费解,此处以 test 表作为替代脱敏
可以看到仅仅是切换了一下用户,生成的执行计划居然有差异!根据过往经验(经验不足),不同用户执行计划不一样,可能是用户级配置了某些优化器参数
postgres=# alter user u1 set enable_seqscan to off;
ALTER ROLE
postgres=# select * from pg_user;
usename | usesysid | usecreatedb | usesuper | userepl | usebypassrls | passwd | valuntil | useconfig
----------+----------+-------------+----------+---------+--------------+----------+----------+----------------------
postgres | 10 | t | t | t | t | ******** | |
u1 | 16452 | f | f | t | f | ******** | | {enable_seqscan=off}
u2 | 25165 | f | f | f | f | ******** | |
(3 rows)
但是排查了一下,并未发现有任何特定的设置。至此常规的思维又卡住了,莫非又是什么离奇的BUG?
根据前面的现象,可以看到超级用户预估的行数和实际执行的行数接近,扫描6月7月均使用的顺序扫描(都是240W),但是业务用户扫描6月分区表却选择了索引扫描,的确若按照优化器预估的行数 (rows=1)来选择, 走索引扫描是最快的,但是真实情况却返回了240W行,因此优化器不得不频繁回表,导致大量的离散IO,那天业务雪崩变慢的原因也就说得通了。
至此,原因明了了,但是原因呢?为什么不同的用户执行计划会不一样?假如不同用户看到的统计信息是不一样的,这样就乱套了,每个用户跑出来的SQL都可能不一样,岂不是还要看脸决定跑的孰快孰慢。
3SQL执行逻辑
遇事不决,只能看代码了。由于涉及到执行计划的生成,又要回顾一下SQL的执行原理了,具体细节这里就不再赘述。友情提示:下方涉及到大量优化器代码,可能会枯燥,没有一定基础的同学看起来会比较费力。
一条SQL进来之后,会经过Parser → Analyzer → Rewriter → Planner → Executor这一系列步骤。若是DDL语句,无需进行优化,到utility模块处理,对于DML则需要按照完整的流程。在 Planner 阶段,会经过逻辑优化与物理优化,最终生成一颗最优的计划树。
查询规划整体流程参照下图 (巨人肩膀:https://www.cnblogs.com/flying-tiger/p/6063709.html)
standard_planner 作为优化器的入口,由于流程过于复杂冗长,这里我们只关心其中几个核心函数:
/* primary planning entry point (may recurse for subqueries) */
root = subquery_planner(glob, parse, NULL,
false, tuple_fraction);
/* Select best Path and turn it into a Plan */
final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);
top_plan = create_plan(root, best_path);
subquery_planner会做一些预处理,比如提升子链接 (pull_up_sublinks)、提升子查询(pull_up_subqueries)、函数内联 (inline_set_returning_functions) 等等,接着选择一条成本最低的路劲,接着基于此路径构建执行计划 subquery_planner → get_cheapest_fractional_path → create_plan。
其中 subquery_planner 做完预处理的步骤之后,会进入到 grouping_planner 做进一步的优化,参照下图
/*
* Do the main planning. If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation &&
rt_fetch(parse->resultRelation, parse->rtable)->inh)
inheritance_planner(root);
else
grouping_planner(root, false, tuple_fraction);
然后进入到 query_planner,生成一条成本最低的路径
/*
* Generate the best unsorted and presorted paths for the scan/join
* portion of this Query, ie the processing represented by the
* FROM/WHERE clauses. (Note there may not be any presorted paths.)
* We also generate (in standard_qp_callback) pathkey representations
* of the query's sort clause, distinct clause, etc.
*/
current_rel = query_planner(root, tlist,
standard_qp_callback, &qp_extra);
/*
* query_planner
* Generate a path (that is, a simplified plan) for a basic query,
* which may involve joins but not any fancier features.
*
* Since query_planner does not handle the toplevel processing (grouping,
* sorting, etc) it cannot select the best path by itself. Instead, it
* returns the RelOptInfo for the top level of joining, and the caller
* (grouping_planner) can choose among the surviving paths for the rel.
*
* root describes the query to plan
* tlist is the target list the query should produce
* (this is NOT necessarily root->parse->targetList!)
* qp_callback is a function to compute query_pathkeys once it's safe to do so
* qp_extra is optional extra data to pass to qp_callback
*
* Note: the PlannerInfo node also includes a query_pathkeys field, which
* tells query_planner the sort order that is desired in the final output
* plan. This value is *not* available at call time, but is computed by
* qp_callback once we have completed merging the query's equivalence classes.
* (We cannot construct canonical pathkeys until that's done.)
*/
RelOptInfo *
query_planner(PlannerInfo *root, List *tlist,
query_pathkeys_callback qp_callback, void *qp_extra)
{
Query *parse = root->parse;
List *joinlist;
RelOptInfo *final_rel;
Index rti;
double total_pages;
/*
* If the query has an empty join tree, then it's something easy like
* "SELECT 2+2;" or "INSERT ... VALUES()". Fall through quickly.
*/
if (parse->jointree->fromlist == NIL)
{
/* We need a dummy joinrel to describe the empty set of baserels */
final_rel = build_empty_join_rel(root);
/*
* If query allows parallelism in general, check whether the quals are
* parallel-restricted. (We need not check final_rel->reltarget
* because it's empty at this point. Anything parallel-restricted in
* the query tlist will be dealt with later.)
*/
if (root->glob->parallelModeOK)
final_rel->consider_parallel =
is_parallel_safe(root, parse->jointree->quals);
/* The only path for it is a trivial Result path */
add_path(final_rel, (Path *)
create_result_path(root, final_rel,
final_rel->reltarget,
(List *) parse->jointree->quals));
/* Select cheapest path (pretty easy in this case...) */
set_cheapest(final_rel);
...
然后最终传递给 make_one_rel,返回一个RelOptInfo节点,然后用于创建最终计划。
/*
* make_one_rel
* Finds all possible access paths for executing a query, returning a
* single rel that represents the join of all base rels in the query.
*/
RelOptInfo *
make_one_rel(PlannerInfo *root, List *joinlist)
{
RelOptInfo *rel;
Index rti;
/*
* Construct the all_baserels Relids set.
*/
/* Mark base rels as to whether we care about fast-start plans */
set_base_rel_consider_startup(root);
/*
* Compute size estimates and consider_parallel flags for each base rel,
* then generate access paths.
*/
set_base_rel_sizes(root);
set_base_rel_pathlists(root);
SQL语句说白了就是从数据库中获取元组,之后再进行增删改查的操作。因此对于一个查询计划来说,重要的是告诉查询执行模块如何获取到要操作的元组。而这些元组要么来自于某张表,要么来自于一些基本表连接而成的"连接表"。对于这些连接表来说,会存在多种不同的连接方式,从而形成多种连接树(逻辑结构)。这样的每一棵树在postgresql中都成为一条路径。查询规划的目的莫过于从这些路径中选取一条最优的路径并生成对应的查询计划。
第一步主要涉及表大小的估算,也就是我们在 explain 中看到的成本和行数,接着 set_base_rel_pathlists → set_plain_rel_pathlist 这里进入到索引相关路径的构建,由于代码过于繁琐,这里就直接贴一下流程:create_index_paths → get_index_paths → build_index_paths → create_index_path → cost_index
/*
* set_plain_rel_pathlist
* Build access paths for a plain relation (no subquery, no inheritance)
*/
static void
set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
{
…
/* Consider sequential scan */
add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
/* If appropriate, consider parallel sequential scan */
if (rel->consider_parallel && required_outer == NULL)
create_plain_partial_paths(root, rel);
/* Consider index scans */
create_index_paths(root, rel);
/* Consider TID scans */
create_tidscan_paths(root, rel);
}
/*
* build_index_paths
* Given an index and a set of index clauses for it, construct zero
* or more IndexPaths. It also constructs zero or more partial IndexPaths.
给定一个索引和一组索引子句,构造零个或多个IndexPath。它还能构造零个或多个部分IndexPath。
*
* We return a list of paths because (1) this routine checks some cases
* that should cause us to not generate any IndexPath, and (2) in some
* cases we want to consider both a forward and a backward scan, so as
* to obtain both sort orders. Note that the paths are just returned
* to the caller and not immediately fed to add_path().
最终进入到 cost_index 中,进入到索引的代码估算。
/*
* cost_index
* Determines and returns the cost of scanning a relation using an index.
*
* 'path' describes the indexscan under consideration, and is complete
* except for the fields to be set by this routine
* 'loop_count' is the number of repetitions of the indexscan to factor into
* estimates of caching behavior
...
...
/*
* In the perfectly correlated case, the number of pages touched by
* each scan is selectivity * table_size, and we can use the Mackert
* and Lohman formula at the page level to estimate how much work is
* saved by caching across scans. We still assume all the fetches are
* random, though, which is an overestimate that's hard to correct for
* without double-counting the cache effects. (But in most cases
* where such a plan is actually interesting, only one page would get
* fetched per scan anyway, so it shouldn't matter much.)
*/
pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
pages_fetched = index_pages_fetched(pages_fetched * loop_count,
baserel->pages,
(double) index->pages,
root);
索引预估返回的页数可以看到由 indexSelectivity(索引选择率) 乘上页数,每一类索引都有自己的花费基本值估算函数,比如最常见的 Btree 估算函数是 btcostestimate 函数
/*
* If index is unique and we found an '=' clause for each column, we can
* just assume numIndexTuples = 1 and skip the expensive
* clauselist_selectivity calculations. However, a ScalarArrayOp or
* NullTest invalidates that theory, even though it sets eqQualHere.
*/
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
else
{
List *selectivityQuals;
Selectivity btreeSelectivity;
/*
* If the index is partial, AND the index predicate with the
* index-bound quals to produce a more accurate idea of the number of
* rows covered by the bound conditions.
*/
selectivityQuals = add_predicate_to_quals(index, indexBoundQuals);
btreeSelectivity = clauselist_selectivity(root, selectivityQuals, ---👈🏻在这
index->rel->relid,
JOIN_INNER,
NULL);
numIndexTuples = btreeSelectivity * index->rel->tuples;
这里就进入最关键的流程了, btreeSelectivity = clauselist_selectivity,这一块我在执行计划篇章有过详细的源码剖析,clauselist_selectivity → clause_selectivity 。当我调试了几十次之后,发现每次经过 restriction_selectivity 这个步骤之后,不同用户产生不同执行计划的分叉点就产生了
/*
* clause_selectivity -
* Compute the selectivity of a general boolean expression clause.
*
* The clause can be either a RestrictInfo or a plain expression. If it's
* a RestrictInfo, we try to cache the selectivity for possible re-use,
* so passing RestrictInfos is preferred.
*
* varRelid is either 0 or a rangetable index
*/
...
...
else if (is_opclause(clause) || IsA(clause, DistinctExpr))
{
OpExpr *opclause = (OpExpr *) clause;
Oid opno = opclause->opno;
if (treat_as_join_clause(clause, rinfo, varRelid, sjinfo))
{
/* Estimate selectivity for a join clause. */
s1 = join_selectivity(root, opno,
opclause->args,
opclause->inputcollid,
jointype,
sjinfo);
}
else
{
/* Estimate selectivity for a restriction clause. */
s1 = restriction_selectivity(root, opno,
opclause->args,
opclause->inputcollid,
varRelid);
}
超级用户(正常),s1是选择率
业务用户
可以看到超级用户计算出来的选择率是1(<=now() 的选择率),而业务用户计算出来的选择率是0.5,回顾一下我在执行计划篇章写过的内容,对于多列选择率,默认情况下,PostgreSQL会将各列的选择率相乘,但是优化器并没有这么stupid,他也有自己的一系列算法。
举个栗子,下面这几条SQL如何估算返回的行数?
where a = xx or b =xx where a > xx or a < xx 同一个变量的范围查询 where a not in (xx) where a is not null where a > xxx and b < xx
对于第二种情况,同一个变量的范围查询,即 where a > xxx and a < xxx,他的选择率是 rqlist->hibound + rqlist->lobound - 1.0,按照我们刚刚调试的结果,第一个谓词 (date_start_time >='2023-06-16 22:49:46' )的选择率是 0.49724336722683249,第二个谓词 (date_start_time <= now() )的选择率是 0.5,那么 0.5 + 0.49724336722683249 - 1 = - 0.00275663277,大于 -0.01
// 我们还识别"范围查询",例如"x > 34 AND x < 42"。如果子句是其运算符具有 scalarltsel
// 或相关函数作为其约束选择性估计器的约束 opclause,则子句被视为可能的范围查询成分。我们
// 将引用相同变量的这种形式的子句配对。这种不可配对的子句以正常方式简单地乘以选择性乘积。但
// 是当我们找到一对时,我们知道选择性代表了列范围内下限和上限的相对位置,因此我们可以将其计
// 算为 hisel + osel - 1,而不是将其计算为 hisel * lostl。(为了可视化这一点,假设
// hisel 是范围低于上限的比率,而 lossl 是高于下限的比率;因此 hisel 可以直接解释为0..1
// 值,但我们需要将 lossl 转换为 1- lossl 在将其解释为值之前。那么可用范围是 1-losel +
// hisel。但是,这个计算双重排除了空值,所以我们真的需要 hisel + lostl + null_frac- 1
* We also recognize "range queries", such as "x > 34 AND x < 42". Clauses
* are recognized as possible range query components if they are restriction
* opclauses whose operators have scalarltsel or a related function as their
* restriction selectivity estimator. We pair up clauses of this form that
* refer to the same variable. An unpairable clause of this kind is simply
* multiplied into the selectivity product in the normal way. But when we
* find a pair, we know that the selectivities represent the relative
* positions of the low and high bounds within the column's range, so instead
* of figuring the selectivity as hisel * losel, we can figure it as hisel +
* losel - 1. (To visualize this, see that hisel is the fraction of the range
* below the high bound, while losel is the fraction above the low bound; so
* hisel can be interpreted directly as a 0..1 value but we need to convert
* losel to 1-losel before interpreting it as a value. Then the available
* range is 1-losel to hisel. However, this calculation double-excludes
* nulls, so really we need hisel + losel + null_frac - 1.)
else
{
s2 = rqlist->hibound + rqlist->lobound - 1.0;
/* Adjust for double-exclusion of NULLs */
s2 += nulltestsel(root, IS_NULL, rqlist->var,
varRelid, jointype, sjinfo);
/*
* A zero or slightly negative s2 should be converted into a
* small positive value; we probably are dealing with a very
* tight range and got a bogus result due to roundoff errors.
* However, if s2 is very negative, then we probably have
* default selectivity estimates on one or both sides of the
* range that we failed to recognize above for some reason.
*/
if (s2 <= 0.0)
{
if (s2 < -0.01) ---👈🏻代码流程
{
/*
* No data available --- use a default estimate that
* is small, but not real small.
*/
s2 = DEFAULT_RANGE_INEQ_SEL;
}
else
{
/*
* It's just roundoff error; use a small positive
* value
*/
s2 = 1.0e-10;
}
}
}
所以他会走到else的流程中,那么选择率就会被计算成 s2 = 1.0e-10,也就是 0.000000001,那么乘以元组数,所以预估出来的 rows = 1 就是这么计算出来的。
至此,优化器预估的 1 行的原因找到了,但是另外一个问题抛出来了——为什么超级用户计算出来的选择率是1,而业务用户是0.5?看样子需要深入分析 restriction_selectivity 这个函数做了什么了。
4restriction_selectivity
在此之前,先让我们了解一下什么是 restriction,顾名思义——限制,
/*
* Restriction clause info.
*
* We create one of these for each AND sub-clause of a restriction condition
* (WHERE or JOIN/ON clause). Since the restriction clauses are logically
* ANDed, we can use any one of them or any subset of them to filter out
* tuples, without having to evaluate the rest. The RestrictInfo node itself
* stores data used by the optimizer while choosing the best query plan.
*
* If a restriction clause references a single base relation, it will appear
* in the baserestrictinfo list of the RelOptInfo for that base rel.
...
*/
typedef struct RestrictInfo
{
NodeTag type;
Expr *clause; /* the represented clause of WHERE or JOIN */
bool is_pushed_down; /* true if clause was pushed down in level */
bool outerjoin_delayed; /* true if delayed by lower outer join */
bool can_join; /* see comment above */
bool pseudoconstant; /* see comment above */
bool leakproof; /* true if known to contain no leaked Vars */
Index security_level; /* see comment above */
/* The set of relids (varnos) actually referenced in the clause: */
Relids clause_relids;
/* The set of relids required to evaluate the clause: */
Relids required_relids;
/* If an outer-join clause, the outer-side relations, else NULL: */
Relids outer_relids;
/* The relids used in the clause that are nullable by lower outer joins: */
Relids nullable_relids;
/* These fields are set for any binary opclause: */
Relids left_relids; /* relids in left side of clause */
Relids right_relids; /* relids in right side of clause */
/* This field is NULL unless clause is an OR clause: */
Expr *orclause; /* modified clause with RestrictInfos */
/* This field is NULL unless clause is potentially redundant: */
EquivalenceClass *parent_ec; /* generating EquivalenceClass */
/* cache space for cost and selectivity */
QualCost eval_cost; /* eval cost of clause; -1 if not yet set */
Selectivity norm_selec; /* selectivity for "normal" (JOIN_INNER)
* semantics; -1 if not yet set; >1 means a
* redundant clause */
Selectivity outer_selec; /* selectivity for outer join semantics; -1 if
* not yet set */
/* valid if clause is mergejoinable, else NIL */
List *mergeopfamilies; /* opfamilies containing clause operator */
/* cache space for mergeclause processing; NULL if not yet set */
EquivalenceClass *left_ec; /* EquivalenceClass containing lefthand */
EquivalenceClass *right_ec; /* EquivalenceClass containing righthand */
EquivalenceMember *left_em; /* EquivalenceMember for lefthand */
EquivalenceMember *right_em; /* EquivalenceMember for righthand */
List *scansel_cache; /* list of MergeScanSelCache structs */
/* transient workspace for use while considering a specific join path */
bool outer_is_left; /* T = outer var on left, F = on right */
/* valid if clause is hashjoinable, else InvalidOid: */
Oid hashjoinoperator; /* copy of clause operator */
/* cache space for hashclause processing; -1 if not yet set */
Selectivity left_bucketsize; /* avg bucketsize of left side */
Selectivity right_bucketsize; /* avg bucketsize of right side */
Selectivity left_mcvfreq; /* left side's most common val's freq */
Selectivity right_mcvfreq; /* right side's most common val's freq */
} RestrictInfo;
我们可以简单理解成——对于查询中每个表,Postgres 都会生成两个"限制"子句列表:
基本限制子句:WHERE 子句的一部分,并且是仅涉及表本身的表达式 Join 子句:通常是 JOIN 子句的一部分,以及涉及多个表的表达式,比如 on test.id = test2.id
之所以称这些"限制"子句是因为它们会"限制"(或过滤)从表返回的数据量。
/*
* restriction_selectivity
*
* Returns the selectivity of a specified restriction operator clause.
* This code executes registered procedures stored in the
* operator relation, by calling the function manager.
*
* See clause_selectivity() for the meaning of the additional parameters.
*/
Selectivity
restriction_selectivity(PlannerInfo *root,
Oid operatorid,
List *args,
Oid inputcollid,
int varRelid)
{
RegProcedure oprrest = get_oprrest(operatorid);
float8 result;
/*
* if the oprrest procedure is missing for whatever reason, use a
* selectivity of 0.5
*/
if (!oprrest)
return (Selectivity) 0.5;
result = DatumGetFloat8(OidFunctionCall4Coll(oprrest,
inputcollid,
PointerGetDatum(root),
ObjectIdGetDatum(operatorid),
PointerGetDatum(args),
Int32GetDatum(varRelid)));
if (result < 0.0 || result > 1.0)
elog(ERROR, "invalid restriction selectivity: %f", result);
return (Selectivity) result;
}
根据注释,我们可以了解到 restriction_selectivity 返回某个操作符字符的选择率,系统表 pg_operator 中记录了每个操作符对应的函数
oprrest regproc (references pg_proc.oid):Restriction selectivity estimation function for this operator (zero if none)
oprjoin regproc (references pg_proc.oid):Join selectivity estimation function for this operator (zero if none)
经过查询,用到的函数是scalarlesel,其实我在执行计划篇章里面也有写过
#define F_SCALARLTSEL 103
#define F_SCALARLESEL 336
#define F_SCALARGTSEL 104
#define F_SCALARGESEL 337
postgres=# select oid,proname from pg_proc where oid in ('103','104','336','337');
oid | proname
-----+-------------
103 | scalarltsel ---less than,小于
104 | scalargtsel ---greater than,大于
336 | scalarlesel ---less equal,小于等于
337 | scalargesel ---greate equal,大于等于
(4 rows)
查询引擎在查询语法树的WHERE子句中识别出比较条件,再到pg_operator中根据操作符和数据类型找到oprrest为scalarlesel,这是通用的标量数据类型的小于等于操作符的代价估算函数,然后在ineq_histogram_selectivity进行直方图的估算和高频值的估算mcv_selectivity
/*
* scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.
*
* The caller has commuted the clause, if necessary, so that we can treat
* the variable as being on the left. The caller must also make sure that
* the other side of the clause is a non-null Const, and dissect that into
* a value and datatype. (This definition simplifies some callers that
* want to estimate against a computed value instead of a Const node.)
*
* This routine works for any datatype (or pair of datatypes) known to
* convert_to_scalar(). If it is applied to some other datatype,
* it will return an approximate estimate based on assuming that the constant
* value falls in the middle of the bin identified by binary search.
*/
static double
scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
VariableStatData *vardata, Datum constval, Oid consttype)
{
Form_pg_statistic stats;
FmgrInfo opproc;
double mcv_selec,
hist_selec,
sumcommon;
double selec;
if (!HeapTupleIsValid(vardata->statsTuple))
{
/* no stats available, so default result */
return DEFAULT_INEQ_SEL;
}
stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
fmgr_info(get_opcode(operator), &opproc);
/*
* If we have most-common-values info, add up the fractions of the MCV
* entries that satisfy MCV OP CONST. These fractions contribute directly
* to the result selectivity. Also add up the total fraction represented
* by MCV entries.
*/
mcv_selec = mcv_selectivity(vardata, &opproc, constval, true,
&sumcommon);
/*
* If there is a histogram, determine which bin the constant falls in, and
* compute the resulting contribution to selectivity.
*/
hist_selec = ineq_histogram_selectivity(root, vardata,
&opproc, isgt, iseq,
constval, consttype);
...
而这两个函数里面都有一个关键的地方
/*
* ineq_histogram_selectivity - Examine the histogram for scalarineqsel
*
* Determine the fraction of the variable's histogram population that
* satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST.
* The isgt and iseq flags distinguish which of the four cases apply.
*
* Returns -1 if there is no histogram (valid results will always be >= 0).
*
* Note that the result disregards both the most-common-values (if any) and
* null entries. The caller is expected to combine this result with
* statistics for those portions of the column population.
*/
static double
ineq_histogram_selectivity(PlannerInfo *root,
VariableStatData *vardata,
FmgrInfo *opproc, bool isgt, bool iseq,
Datum constval, Oid consttype)
{
double hist_selec;
AttStatsSlot sslot;
hist_selec = -1.0;
/*
* Someday, ANALYZE might store more than one histogram per rel/att,
* corresponding to more than one possible sort ordering defined for the
* column type. However, to make that work we will need to figure out
* which staop to search for --- it's not necessarily the one we have at
* hand! (For example, we might have a '<=' operator rather than the '<'
* operator that will appear in staop.) For now, assume that whatever
* appears in pg_statistic is sorted the same way our operator sorts, or
* the reverse way if isgt is true.
*/
if (HeapTupleIsValid(vardata->statsTuple) &&
statistic_proc_security_check(vardata, opproc->fn_oid) &&
get_attstatsslot(&sslot, vardata->statsTuple,
STATISTIC_KIND_HISTOGRAM, InvalidOid,
ATTSTATSSLOT_VALUES))
没错——statistic_proc_security_check,安全检查!让我们看看这个函数里面做了什么
/*
* Check whether it is permitted to call func_oid passing some of the
* pg_statistic data in vardata. We allow this either if the user has SELECT
* privileges on the table or column underlying the pg_statistic data or if
* the function is marked leak-proof.
检查是否允许调用func_oid并传递vardata中的一些pg_statistic数据。如果用户对pg_statistic数据下的表或列具有SELECT权限,或者函数被标记为防泄漏,我们就允许这样做。
*/
bool
statistic_proc_security_check(VariableStatData *vardata, Oid func_oid)
{
if (vardata->acl_ok)
return true;
if (!OidIsValid(func_oid))
return false;
if (get_func_leakproof(func_oid))
return true;
ereport(DEBUG2,
(errmsg_internal("not using statistics because function \"%s\" is not leak-proof",
get_func_name(func_oid))));
return false;
}
这个oid经过打印,发现是2521,即 timestamp_le_timestamptz,根据名字来判断的话,判断 timestamp 类型是否小于等于 timestamp with time zone 的比较函数,确实查看表结构,date_start_time是timestamp without time zone,而我传递的now是timestamp。
postgres=# select * from pg_proc where oid = 2521;
-[ RECORD 1 ]---+-------------------------
oid | 2521
proname | timestamp_le_timestamptz
pronamespace | 11
proowner | 10
prolang | 12
procost | 1
prorows | 0
provariadic | 0
prosupport | -
prokind | f
prosecdef | f
proleakproof | f
proisstrict | t
proretset | f
provolatile | s
proparallel | s
pronargs | 2
pronargdefaults | 0
prorettype | 16
proargtypes | 1114 1184
proallargtypes |
proargmodes |
proargnames |
proargdefaults |
protrftypes |
prosrc | timestamp_le_timestamptz
probin |
prosqlbody |
proconfig |
proacl |
难道这个用户没有权限?打印一下
果然!普通用户返回的是 false!而超级用户是 true!
vardata是个VariableStatData类型的结构体,其中包括RelOptInfo,表的基本信息
/* Return data from examine_variable and friends */
typedef struct VariableStatData
{
Node *var; /* the Var or expression tree */
RelOptInfo *rel; /* Relation, or NULL if not identifiable */
HeapTuple statsTuple; /* pg_statistic tuple, or NULL if none */
/* NB: if statsTuple!=NULL, it must be freed when caller is done */
void (*freefunc) (HeapTuple tuple); /* how to free statsTuple */
Oid vartype; /* exposed type of expression */
Oid atttype; /* actual type (after stripping relabel) */
int32 atttypmod; /* actual typmod (after stripping relabel) */
bool isunique; /* matches unique index or DISTINCT clause */
bool acl_ok; /* result of ACL check on table or column */
} VariableStatData;
让我们打印出来确认一下是什么表没有权限
可以很清晰的看到,pages = 335017,tuples = 4912245,没错,正是6月子分区这个表!
postgres=# select reltuples,relpages from pg_class where relname = 'xxx';
-[ RECORD 1 ]----
reltuples | 4.91224e+06
relpages | 335017
于是我手动使用set将 vardata ->acl_ok设为true之后,再去打印执行计划,这次执行计划就变成正确的了。
为了清晰可见,我特意将对比图打印了出来:第一个执行计划是我手动调试生成的,第二个是默认情况生成的。可以看到,当有了表的查询权限之后,执行计划就正确了!
那让我们手动授予权限试一下,grant select on 子表名 to 业务用户。
可以很清晰的看到,执行计划终于正确了。预估行数和超级用户一样。另外细心的同学可能发现了,7月的执行计划预估行数和超级用户的也不一样,其实原理也是一样的,授予了select权限即可。
5小结
至此这个案例的成因已经明了,分区表在预估行数的时候,代码逻辑可能会去判断子表的查询权限,如果没有,那么可能就会生成一个莫名其妙的选择率出来,正常来说,PostgreSQL会在Analyzer阶段去检查表的权限,业务用户本身也不需要子表的查询权限,查询父表即可,但是实际情况是:PostgreSQL需要检测业务用户对于子表的查询权限(pg_statistic对应数据,参照statistic_proc_security_check函数)。
当然这个案例十分罕见,我玩了这么久也仅仅遇到三次,在学徒①群里也有小伙伴反馈遇到过这个问题
所以为了稳妥起见,建议给所有的子表都授予查询权限,至少可以预防这种百年一见的奇怪案例。
6参考
https://www.postgresql.org/docs/14/index-cost-estimation.html
https://www.cnblogs.com/flying-tiger/p/6087184.html
https://blog.csdn.net/cuichao1900/article/details/100394716