被很多人忽视的NULL值对NOT IN子查询结果的影响问题

JennyYu 老叶茶馆 2024-07-08

* GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。

前言
开发人员写的 SQL 语句中经常会用到 in，exists，not in，not exists 这类子查询，通常，含 in、exists 的子查询称为半连接（semijoin），含 not in、 not exists 的子查询被称之为反连接。
经常会有技术人员来评论 in 与 exists 效率孰高孰低的问题。
我在 SQL 优化工作中也经常对这类子查询做优化改写，比如半连接改为内连接，反连接改为外连接等，哪个效率高是要根据执行计划做出判断的。
不过本文不是为了讨论效率问题，是要提醒一点：not in 子查询的结果集含 NULL 值时，会导致整个语句结果集返回空，这可能造成与 SQL 语句书写初衷不符。
实验
创建实验表 t1，t2
`greatsql> create table t1( c1 int primary key, c2 varchar(10), key idx_c1(c2)); greatsql> create table t2( c1 int primary key, c2 varchar(10), key idx_c1(c2)); greatsql> insert into t1 values(1,'a'),(2,'b'); greatsql> insert into t2 values(1,'a'),(2,'c');`
观察下面两条语句：
`select * from t1 where   t1.c2 not in (   select t2.c2 from t2); select * from t1 where   not exists (   select 1 from t2 where   t2.c2=t1.c2);`
这两个语句，从表达的含义来看是等价的，都是查询 t1 表中 c2 列值在 t2 表的 c2 列值中不存在的记录。
从子查询类型来看，第一条语句属于非关联查询，第二条语句属于关联子查询。
所谓非关联子查询就是子查询中内查询可以独立执行，与外查询没有关系，互不影响。
而关联子查询的执行依赖于外部查询，通常情况下都是因为子查询中的表用到了外部的表，并进行了条件关联，因此每执行一次外部查询，子查询都要重新计算一次。
从连接类型来看，使用 not in 与 not exists 子查询构造的语句都属于反连接。
为了控制连接顺序与连接方式，这种反连接经常被改写为外连接，t1 与 t2 使用左外连接，条件加上右表 t2 的连接列 is null，也就是左外连接时没有关联上右表的数据，表达了这个含义：t1 表中 c2 列值在 t2 表的 c2 列值中不存在的记录。
反连接改写为外连接，不会导致关联结果集放大，因为没有关联上的 t1 表数据只显示 1 条，半连接改为内连接时要注意去重。
外连接语句如下所示：
`greatsql> select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null;`
所以本质表达含义上，上面的三条语句都等价。
下面看一下三条语句的执行结果：
greatsql> select * from t1 where   t1.c2 not in (   select t2.c2 from t2); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec) greatsql> select * from t1 where   not exists (   select 1 from t2 where   t2.c2=t1.c2); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.01 sec) greatsql> select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null; +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec)
可以看出就目前的数据，三条语句执行结果是相同的。
下面向子查询的 t2 中插入一条 c2 列为 null 的记录。
`greatsql> insert into t2 values(3,null);`
再观察一下三条语句的执行结果：
greatsql> select * from t1 where   t1.c2 not in (   select t2.c2 from t2); Empty set (0.00 sec) greatsql> select * from t1 where   not exists (   select 1 from t2 where   t2.c2=t1.c2); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec) greatsql> select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null; +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec)
可以看出，not exists 表示的关联子查询与外连接方式表达的两条语句结果相同，而 not in 表示的非关联子查询的结果集为空。
这是因为子查询`select t2.c2 from t2` 查询结果含有 NULL 值导致的。NULL 属于未知值，无法与其他值进行比较，无从判断，返回最终结果集为空。
这一点在 MySQL 与 Oracle 中返回结果都是一致的。
如果想表达最初的含义，需要将子查询中 NULL 值去除。
`greatsql> select * from t1 where   t1.c2 not in (   select t2.c2 from t2 where   t2.c2 is not null); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.02 sec)`
微调下，这时如果 t1 表的 c2 列也插入一条 NULL 值的记录后，结果集会怎样呢，两个表都存在 c2 列为 NULL 的值数据，那么 t1 表这条 NULL 值数据能否出现在最终结果集中呢？
greatsql> insert into t1 values(3,null); Query OK, 1 row affected (0.07 sec) greatsql> select * from t1 where   t1.c2 not in (   select t2.c2 from t2 where   t2.c2 is not null); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec) greatsql> select * from t1 where   not exists (   select 1 from t2 where   t2.c2=t1.c2); +----+------+ | c1 | c2   | +----+------+ |  3 | NULL | |  2 | b    | +----+------+ 2 rows in set (0.00 sec) greatsql> select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null; +----+------+ | c1 | c2   | +----+------+ |  3 | NULL | |  2 | b    | +----+------+ 2 rows in set (0.00 sec)
从执行结果来看，使用 not in 非关联子查询，其执行结果与其他两条语句的执行结果还是不同，因为 t1.c2 使用 not in 在参与比较时就隐含了 t1.c2 is not null 的含义，所以最终结果集中不含(3, NULL)这条数据。
而 not exists 关联子查询，在将外查询的 NULL 值传递给内查询时执行子查询 `select * from t2 where t2.c2=NULL`，子查询中找不到记录，所以条件返回 false，表示 not exists 为 true，则最终结果集中含(3, NULL)这条记录。
左外 left join 与 not exists 相同，左表的 NULL 值在右表中关联不上数据，所以要返回（3, NULL)这条数据。这里要注意 NULL 不等于 NULL。
`greatsql> select NULL=NULL; +-----------+ | NULL=NULL | +-----------+ |      NULL | +-----------+ 1 row in set (0.01 sec)`
说到这里，GreatSQL 支持`<=>`安全等于这个符号，用来判断 NULL 值：当两个操作数均为 NULL 时，其返回值为 1 而不为 NULL；而当一个操作数为 NULL 时，其返回值为 0 而不为 NULL。
`greatsql> select NULL<=>NULL; +-------------+ | NULL<=>NULL | +-------------+ |           1 | +-------------+ 1 row in set (0.00 sec) greatsql> select 1<=>NULL; +----------+ | 1<=>NULL | +----------+ |        0 | +----------+ 1 row in set (0.00 sec)`
所以 not exists 子查询中的`=` 换成 `<=>` 时，最终结果集中去除了(3,NULL)这条数据。
`greatsql> select * from t1 where   not exists (   select 1 from t2 where   t2.c2<=>t1.c2); +----+------+ | c1 | c2   | +----+------+ |  2 | b    | +----+------+ 1 row in set (0.00 sec)`
注意，一般表关联时不使用`<=>`安全等于这个符号，想象一下，如果关联的两个表在关联字段上都存在很多 NULL 记录，关联后的结果集对 NULL 记录的关联是以笛卡尔积的形式体现的，严重影响效率，严格来说关联字段都为 NULL 值不能算作能匹配上。
结论
使用 not in 的非关联子查询注意 NULL 值对结果集的影响，为避免出现空结果集，需要子查询中查询列加 `is not null`条件将 NULL 值去除。
实际使用时注意：需求表达的含义是否要将外查询关联字段值为 NULL 的数据输出，not in 隐含了不输出。
一般认为 not exists 关联子查询与外连接语句是等价的，可以进行相互改写。
`select * from t1 where not exists (   select 1 from t2 where   t2.c2=t1.c2); select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null;`
如果不需要输出外查询中关联字段为 NULL 值的数据，还需再加条件 t1.c2 is not null。
`select * from t1 where not exists (   select 1 from t2 where   t2.c2=t1.c2)   and t1.c2 is not null; select t1.* from t1 left join   t2 on t1.c2=t2.c2 where   t2.c2 is null   and t1.c2 is not null;`
这样写就与下面的SQL等价了：
`select * from t1 where   t1.c2 not in (   select t2.c2 from t2 where   t2.c2 is not null);`
全文完。

Enjoy GreatSQL :)

《深入浅出MGR》视频课程

戳此小程序即可直达B站

https://www.bilibili.com/medialist/play/1363850082?business=space_collection&business_id=343928&desc=0

文章推荐：

重现一条简单SQL的优化过程

题图由阿里通义万相生成
提示词：意外踩坑

想看更多技术好文，点个“在看”吧！

继续滑动看下一个

老叶茶馆

向上滑动看下一个

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

为了这部描述从“反右”到“文革”的禁片，田壮壮付出了十年不能拍片的代价

关于字节基建

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

中国紫砂博物馆出品！2025紫砂五福杯：五福云集，宜藏宜礼！

被很多人忽视的NULL值对NOT IN子查询结果的影响问题

前言

实验

结论

您可能也对以下帖子感兴趣

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

为了这部描述从“反右”到“文革”的禁片，田壮壮付出了十年不能拍片的代价

关于字节基建

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

中国紫砂博物馆出品！2025紫砂五福杯：五福云集，宜藏宜礼！

生成图片，分享到微信朋友圈

被很多人忽视的NULL值对NOT IN子查询结果的影响问题

前言

实验

结论

您可能也对以下帖子感兴趣