PostgreSQL SQL优化数据严重倾斜连接场景优化

#PostgreSQL #SQL优化 #数据严重倾斜连接场景优化

1 背景知识

本文介绍对于两个大表进行连接的场景。当 WHERE 条件的所过滤字段值的数据量存在高度倾斜，为了减少资源消耗，则可以考虑根据反向滤值操作。

2 数据准备

2.1 t01_state 表

1、状态表 t01_state 里面记录t01 中的多种状态

DROP TABLE t01_state;
CREATE TABLE t01_state(id int PRIMARY KEY,name varchar(40),issuc varchar(2));
 INSERT INTO t01_state VALUEStext),'Y';
 SELECT count(*) FROM t01_state;
UPDATE t01_state SET issuc = 'N' WHERE id <=100;

CREATE INDEX ON public.t01_state (issuc);
 SELECT issuc,count(*) FROM t01_state GROUP BY issuc ORDER BY issuc;

2.2 t01表

DROP TABLE t01;
CREATE TABLE t01(id int PRIMARY KEY,name varchar(40));
INSERT INTO t01 valuestext);
SELECT count(*) FROM t01;

3 优化前的情况

3.1 原始SQL

下面SQL 中的过滤条件为 issuc='Y' 。查看执行计划。

EXPLAIN SELECT * FROM t01 WHERE id IN (SELECT id FROM t01_state WHERE issuc = 'Y');

3.2 执行计划

执行计划解析：t01_state 数据千万级，两张表关联必然要消耗大量的资源。从下面执行计划可以看到，主要的时间消耗在hash join上。

---------------------------------------------------------------------------------
 Hash Semi Join  (cost=373103.12..1002454.73 rows=10000046 width=37)
   Hash Cond: (t01.id = t01_state.id)
   ->  Seq Scan on t01  (cost=0.00..184034.46 rows=10000046 width=37)
   ->  Hash  (cost=209037.06..209037.06 rows=10000165 width=4)
         ->  Seq Scan on t01_state  (cost=0.00..209037.06 rows=10000165 width=4)
               Filter: text = 'Y'::text

4 优化方案1：使用NOT IN 代替 IN

4.1 优化思路

因为 where 条件 id IN (....) 只匹配少量数据，可以改写 in 为 not in，以减少对 t01 表的过滤操作。

改写后的SQL 如下：

EXPLAIN SELECT * FROM t01 WHERE t01.id not in (SELECT id FROM t01_state WHERE issuc <> 'Y' or issuc is null);

4.2 执行计划

修改后的SQL，虽然使用了 filter 过滤方式，但由于子查询的结果集很小，效率提高了不少。

test=# EXPLAIN SELECT * FROM t01 WHERE t01.id not in (SELECT id FROM t01_state WHERE issuc <> 'Y' or issuc is null);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on t01  (cost=137119.30..346153.87 rows=5000023 width=37)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Gather  (cost=1000.00..137119.29 rows=1 width=4)
           Workers Planned: 2
           ->  Parallel Seq Scan on t01_state  (cost=0.00..136119.19 rows=1 width=4)
                 Filter: text <> 'Y'::text) OR (issuc IS NULL)
(7 行记录)

5 优化方案2：使用not between 代替 <>

5.1 优化思路

在此例中，可以使用 not between 替代 <> 操作符，这样可以提高子查询的执行效率。

EXPLAIN SELECT *
FROM t01
WHERE t01.id NOT IN (SELECT id FROM t01_state WHERE issuc NOT BETWEEN 'Y' AND 'Y' OR issuc IS NULL);

5.2 执行计划


                    QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on t01  (cost=17.35..209050.31 rows=4999958 width=37)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Bitmap Heap Scan on t01_state  (cost=13.33..17.34 rows=1 width=4)
           Recheck Cond: text < 'Y'::text) OR ((issuc)::text > 'Y'::text) OR (issuc IS NULL)
           ->  BitmapOr  (cost=13.33..13.33 rows=1 width=0)
                 ->  Bitmap Index Scan on t01_state_issuc_idx  (cost=0.00..4.44 rows=1 width=0)
                       Index Cond: text < 'Y'::text
                 ->  Bitmap Index Scan on t01_state_issuc_idx  (cost=0.00..4.44 rows=1 width=0)
                       Index Cond: text > 'Y'::text
                 ->  Bitmap Index Scan on t01_state_issuc_idx  (cost=0.00..4.44 rows=1 width=0)
                       Index Cond: (issuc IS NULL)
(12 行记录)