PostgreSQL 数据库性能统计信息直方图

#PostgreSQL #数据库性能 #统计信息直方图

1 环境准备

drop table if exists tb01;
create table tb01(id integer,name character varying,age integer);
insert into tb01 select generate_series(1,10000),'john',(random()*1000)::integer;
insert into tb01 select generate_series(10001,10100),'tom',(random()*1000)::integer;
ALTER TABLE tb01 ALTER COLUMN id set STATISTICS 10;
analyze tb01;

1.1 查看统计信息

select * from pg_stats where tablename='tb01' and attname='id';
-[ RECORD 1 ]----------+-------------------------------------------------------
schemaname             | public （表所在的schema）
tablename              | tb13  （表名）
attname                | id   （字段名）
inherited              | f  （是否是继承而来的字段，t：是；f：否）
null_frac              | 0  （null值的百分比，这里为0%）
avg_width              | 4  （该字段的平均长度）
n_distinct             | -1 （表示该字段的唯一值的个数，-1：表示该字段有唯一约束，大于0的整数，比如m：表示该字段有m个唯一值）
most_common_vals       | （高频值，这里没有，因为是主键）
most_common_freqs      | （高频值的出现的频率）
histogram_bounds       | {1,1010,2020,3030,4040,5050,6060,7070,8080,9090,10100}  (该字段除高频值以外值的的柱状图信息)
correlation            | 1 （表中记录的逻辑顺序与存储的物理顺序的关系，-1到1之间，1表示逻辑顺序与存储的物理顺序相同，-1表示逻辑顺序与存储的物理顺序相反） 
most_common_elems      | （该字段是数组元素的统计信息，高频元素）
most_common_elem_freqs | （该字段是数组元素的统计信息，高频元素出现的频率）
elem_count_histogram   | （该字段是数组元素的统计信息，该列元素唯一值个数平均分布柱状图）

序号	取值范围	数据占比
1	1 - 1010	0.1
2	1010 - 2020	0.1
3	2020 - 3030	0.1
4	3030 - 4040	0.1
5	4040 - 5050	0.1
6	5050 - 6060	0.1
7	6060 - 7070	0.1
8	7070 - 8080	0.1
9	8080 - 9090	0.1
10	9090 - 10100	0.1

如图所示，x轴表示id值，y轴表示的是记录数，表示PostgreSQL统计分析后的结果是，id字段值为1~1010的记录数有1009（1010-1）条，1010~2020的记录数有1010条，…. … ，展示的是id字段的值的分布区间图。
　　如果statistics target越大，取得的区间数就越多，统计就越精确。

1.2 数据占比计算

id中10至100范围数据占比

P(10,100) = （100-10)/(1010-0)*0.1  
= 0.089108

**id中100至8000范围数据占比**  
P(100,8000)= (1010-100)/(1010-0)*0.1 +0.1*6+(8000-7070)/(8080-7070)*0.1  
=0.09009900+0.6+0.092079207 
=0.7821782178

2 代价估算

2.1 查看统计信息中的函数

select reltuples from pg_class where relname ='tb01';

2.2 计算出预估的行数

rows = P(100,8000) * reltuples = 0.7821782178*10100 = 7899

2.3 查看执行计划

explain (analyze) select * from tb01 where id > 100 and id <8000;

3 统计信息精度不够导致的问题

delete from tb01 where id < 7999;

select  *from sys_stats where tablename ='tb01' and attname ='id';

analyze tb01;
explain (analyze) select * from tb01 where id > 100 and id <4000;