PostgreSQL SQL优化 数据严重倾斜连接场景优化
1 背景知识
本文介绍对于两个大表进行连接的场景。当 WHERE 条件
的所过滤字段值的数据量存在高度倾斜,为了减少资源消耗,则可以考虑根据反向滤值操作。
2 数据准备
2.1 t01_state 表
1、状态表 t01_state 里面记录t01 中的多种状态
DROP TABLE t01_state;
CREATE TABLE t01_state(id int PRIMARY KEY,name varchar(40),issuc varchar(2));
INSERT INTO t01_state VALUEStext),'Y';
SELECT count(*) FROM t01_state;
UPDATE t01_state SET issuc = 'N' WHERE id <=100;
CREATE INDEX ON public.t01_state (issuc);
SELECT issuc,count(*) FROM t01_state GROUP BY issuc ORDER BY issuc;
2.2 t01表
DROP TABLE t01;
CREATE TABLE t01(id int PRIMARY KEY,name varchar(40));
INSERT INTO t01 valuestext);
SELECT count(*) FROM t01;
3 优化前的情况
3.1 原始SQL
下面SQL 中的过滤条件为 issuc='Y'
。查看执行计划。
EXPLAIN SELECT * FROM t01 WHERE id IN (SELECT id FROM t01_state WHERE issuc = 'Y');
3.2 执行计划
执行计划解析:t01_state
数据千万级,两张表关联必然要消耗大量的资源。从下面执行计划可以看到,主要的时间消耗在hash join
上。
---------------------------------------------------------------------------------
Hash Semi Join (cost=373103.12..1002454.73 rows=10000046 width=37)
Hash Cond: (t01.id = t01_state.id)
-> Seq Scan on t01 (cost=0.00..184034.46 rows=10000046 width=37)
-> Hash (cost=209037.06..209037.06 rows=10000165 width=4)
-> Seq Scan on t01_state (cost=0.00..209037.06 rows=10000165 width=4)
Filter: text = 'Y'::text
4 优化方案1:使用NOT IN 代替 IN
4.1 优化思路
因为 where
条件 id IN (....)
只匹配少量数据,可以改写 in
为 not in
,以减少对 t01
表的过滤操作。
改写后的SQL 如下:
EXPLAIN SELECT * FROM t01 WHERE t01.id not in (SELECT id FROM t01_state WHERE issuc <> 'Y' or issuc is null);
4.2 执行计划
修改后的SQL,虽然使用了 filter
过滤方式,但由于子查询的 结果集很小,效率提高了不少。
test=# EXPLAIN SELECT * FROM t01 WHERE t01.id not in (SELECT id FROM t01_state WHERE issuc <> 'Y' or issuc is null);
QUERY PLAN
-------------------------------------------------------------------------------------
Seq Scan on t01 (cost=137119.30..346153.87 rows=5000023 width=37)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Gather (cost=1000.00..137119.29 rows=1 width=4)
Workers Planned: 2
-> Parallel Seq Scan on t01_state (cost=0.00..136119.19 rows=1 width=4)
Filter: text <> 'Y'::text) OR (issuc IS NULL)
(7 行记录)
5 优化方案2:使用not between 代替 <>
5.1 优化思路
在此例中,可以使用 not between
替代 <>
操作符,这样可以提高子查询的执行效率。
EXPLAIN SELECT *
FROM t01
WHERE t01.id NOT IN (SELECT id FROM t01_state WHERE issuc NOT BETWEEN 'Y' AND 'Y' OR issuc IS NULL);
5.2 执行计划
QUERY PLAN
--------------------------------------------------------------------
Seq Scan on t01 (cost=17.35..209050.31 rows=4999958 width=37)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Bitmap Heap Scan on t01_state (cost=13.33..17.34 rows=1 width=4)
Recheck Cond: text < 'Y'::text) OR ((issuc)::text > 'Y'::text) OR (issuc IS NULL)
-> BitmapOr (cost=13.33..13.33 rows=1 width=0)
-> Bitmap Index Scan on t01_state_issuc_idx (cost=0.00..4.44 rows=1 width=0)
Index Cond: text < 'Y'::text
-> Bitmap Index Scan on t01_state_issuc_idx (cost=0.00..4.44 rows=1 width=0)
Index Cond: text > 'Y'::text
-> Bitmap Index Scan on t01_state_issuc_idx (cost=0.00..4.44 rows=1 width=0)
Index Cond: (issuc IS NULL)
(12 行记录)