PostgreSQL Page页结构解析 B-Tree索引Special Space 存储结构

本文简单介绍了在PG数据库B-Tree索引的物理存储结构中Special space部分,包括根节点、左右兄弟节点等相关索引结构信息,以及初步探讨了PG在物理存储上如何保证索引的有序性。

环境准备

测试数据同上一节,索引文件raw data:


 hexdump -C $PGDATA/base/13758/16441
00000000  00 00 00 00 08 9f 79 01  00 00 00 00 48 00 f0 1f  |......y.....H...|
00000010  f0 1f 04 20 00 00 00 00  62 31 05 00 04 00 00 00  |... ....b1......|
00000020  01 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 f0 bf  |................|
00000040  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001ff0  00 00 00 00 00 00 00 00  00 00 00 00 08 00 00 00  |................|
00002000  00 00 00 00 78 9e 79 01  00 00 00 00 28 00 b0 1f  |....x.y.....(...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  b0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

索引结构信息

索引物理存储结构在上一节已大体介绍,这里主要介绍索引的结构信息。通过pageinspect插件的bt_page_stats函数可以获得索引结构信息,包括root/leaf page,next & previous page:

1 定位root_page

SELECT * FROM bt_metap('pk_t_index');
 magic  | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage 
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+---------------
 340322 |       4 |    1 |     0 |        1 |         0 |                         0 |                      -1 | t
(1 row)

2 查看SpecialSpace.stats

postgres=# select * from bt_page_stats('pk_t_index',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo_level | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |          0 |          3
(1 row)

相关的数据结构如下:

//---------------------- src/include/access/nbtree.h



typedef struct BTPageOpaqueData
{
	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
	uint32		btpo_level;		/* tree level --- zero for leaf pages */
	uint16		btpo_flags;		/* flag bits, see below */
	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
} BTPageOpaqueData;

typedef BTPageOpaqueData *BTPageOpaque;

/* Bits defined in btpo_flags */
#define BTP_LEAF		(1 << 0)	/* leaf page, i.e. not internal page */
#define BTP_ROOT		(1 << 1)	/* root page (has no parent) */
#define BTP_DELETED		(1 << 2)	/* page has been deleted from tree */
#define BTP_META		(1 << 3)	/* meta-page */
#define BTP_HALF_DEAD	(1 << 4)	/* empty, but still in tree */
#define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
#define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
#define BTP_HAS_FULLXID	(1 << 8)	/* contains BTDeletedPageData */

查询结果说明

输出 说明
type l(字母l),表示这个block(page)是leaf block
live_items 在这个block中有4个item(live_items=4),
dead_items 没有废弃的items(dead_items=0),
left sibling 没有left sibling(btpo_prev =0
right sibling 也没有right sibling(btpo_next =0)
btpo btpo是一个union,值为0,表示该page为叶子page,
btpo_flags 3即BTP_LEAF 或 BTP_ROOT,既是叶子page也是根page。
Warning

1.这些信息物理存储在先前介绍过的PageHeader中的pd_special指向的物理位置中,pd_special共占用2个字节:
2. Special Space 占用16 个字节。

page
16384(bytes)
page...
page
16384(bytes)
page...
page
16384(bytes)
page...
block number
block number
0th
0th
1th
1th
N-th
N-th
heap tuples
(record data)
heap tuples...
index file
index file
pd_lsn
pd_lsn
pd_checmsum
pd_checmsum
pd_flags
pd_flags
pd_lower
pd_lower
pd_upper
pd_upper
pd_special
pd_special
pd_pagesize_
version
pd_pagesize_...
pg_prune_xid
pg_prune_xid
1
1
2
2
tuple 1
tuple 1
line pointers
line pointers
free space
(hole)
free space...
pd_lower
pd_lower
pd_upper
pd_upper
lp_len
lp_len
lp_flags
lp_flags
lp_off
lp_off
2bits
2bits
15 bits
15 bits
15bits
15bits
t_tid
t_tid
t_info
t_info
ip_blkid
ip_blkid
ip_posid
ip_posid
4bytes
4bytes
2bytes
2bytes
2bytes
2bytes
8bytes
8bytes
values
values
4bytes
4bytes
IndexTupleData
IndexTupleData
Bitmap
Bitmap
4bytes
4bytes
8bytes
8bytes
16bytes
16bytes
Special Space
Special Space
2bytes
2bytes
btpo_prev
btpo_prev
btpo_next
btpo_next
btpo_level
btpo_level
btpo_flags
btpo_flags
btpo_cycleid
btpo_cycle...
4bytes
4bytes
4bytes
4bytes
4bytes
4bytes
2bytes
2bytes
2bytes
2bytes
16bytes
16bytes
24bytes
24bytes
8bytes
8bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
4bytes
4bytes
0
0
Viewer does not support full SVG 1.1

物理位置分析

1 special space 计算位置

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# select 8192+8176;
 ?column? 
----------
    16368
(1 row)

[xdb@localhost utf8db]$ hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 16
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

2 btpo_prev

hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 4
00003ff0  00 00 00 00                                       |....|
00003ff4

3 btpo_next

hexdump -C $PGDATA/base/13758/16441 -s 16372 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16372 -n 4
00003ff4  00 00 00 00                                       |....|
00003ff8

4 btpo_level

hexdump -C $PGDATA/base/13758/16441 -s 16376 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16376 -n 4
00003ff8  00 00 00 00                                       |....|
00003ffc

5 btpo_flags

hexdump -C $PGDATA/base/13758/16441 -s 16380 -n 2
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16380 -n 2
00003ffc  03 00                                             |..|
00003ffe

6 btpo_cycleid

hexdump -C $PGDATA/base/13758/16441 -s 16382 -n 2
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16382 -n 2
00003ffa  00 00                                             |..|
00003ffc

索引有序性

我们都知道,B-Tree索引是有序的,下面我们看看在物理存储结构上如何保证有序性。
插入数据,id=18

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# -- 插入数据,id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6498 |        0 |     0 |    44 |  8096 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump -C $PGDATA/base/13758/16441 -s 8192
00002000  01 00 00 00 98 64 0e db  00 00 00 00 2c 00 a0 1f  |.....d......,...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  a0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

插入数据,id=17

testdb=# -- 插入数据,id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 查看索引数据页头数据
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6808 |        0 |     0 |    48 |  8080 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump  -C base/16477/26637 -s 8192
00002000  01 00 00 00 08 68 0e db  00 00 00 00 30 00 90 1f  |.....h......0...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  90 9f 20 00 a0 9f 20 00  |.. ... ... ... .|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003f90  00 00 00 00 06 00 10 00  11 00 00 00 00 00 00 00  |................|
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

索引的数据区,并没有按照大小顺序排序,\x11(17)在\x12(18)的后面(从尾部开始往前),但在索引页的头部ItemId区域是有序的,第5个ItemId(\x00209f90)指向的是17,而第6个ItemId(\x00209fa0)指向的是18,通过索引数据指针的有序保证索引有序性。

0.1 四、小结

小结一下,知识要点:
1、Special Space:介绍了索引PageHeaderData的Special Space的存储结构,通过Special Space可以获得B-Tree的root、左右sibling等信息;
2、有序性:索引Page通过索引项指针保证有序性。