PostgreSQL Page页结构解析 表的行的存储结构

本文介绍了PG数据页Page中存储的原始内容以及如何阅读它们,这一节主要介绍行数据(Items)。

环境准备

详见上一节,数据文件中的内容如下:

[postgres@node1 bin]$ hexdump -C $PGDATA/base/13758/16429
00000000  00 00 00 00 20 44 77 01  00 00 00 00 28 00 60 1f  |.... Dw.....(.`.|
00000010  00 20 04 20 00 00 00 00  d8 9f 4e 00 b0 9f 4e 00  |. . ......N...N.|
00000020  88 9f 4e 00 60 9f 4e 00  00 00 00 00 00 00 00 00  |..N.`.N.........|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001f60  e6 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f70  04 00 03 00 02 09 18 00  04 00 00 00 13 34 20 20  |.............4  |
00001f80  20 20 20 20 20 05 64 00  e5 02 00 00 00 00 00 00  |     .d.........|
00001f90  00 00 00 00 00 00 00 00  03 00 03 00 02 09 18 00  |................|
00001fa0  03 00 00 00 13 33 20 20  20 20 20 20 20 05 63 00  |.....3       .c.|
00001fb0  e4 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001fc0  02 00 03 00 02 09 18 00  02 00 00 00 13 32 20 20  |.............2  |
00001fd0  20 20 20 20 20 05 62 00  e3 02 00 00 00 00 00 00  |     .b.........|
00001fe0  00 00 00 00 00 00 00 00  01 00 03 00 02 09 18 00  |................|
00001ff0  01 00 00 00 13 31 20 20  20 20 20 20 20 05 61 00  |.....1       .a.|
00002000

Items(Tuples)

typedef uint32 BlockNumber;
#define InvalidBlockNumber		((BlockNumber) 0xFFFFFFFF)
#define MaxBlockNumber			((BlockNumber) 0xFFFFFFFE)

通过以上定义,一个数据文件最大不超过31 TB .

每个Tuple包括两部分,第一部分是Tuple头部信息,第二部分是实际的数据。

HeapTupleHeader

相关数据结构如下:

1.    
    //--------------------- src/include/storage/off.h
    
2.  /*
    
3.  * OffsetNumber:
    
4.  *
    
5.  * this is a 1-based index into the linp (ItemIdData) array in the
    
6.  * header of each disk page.
    
7.  */
    
8.  typedef uint16 OffsetNumber;
    

10.  //--------------------- src/include/storage/block.h
    
11.  /*
    
12.  * BlockId:
    
13.  *
    
14.  * this is a storage type for BlockNumber. in other words, this type
    
15.  * is used for on-disk structures (e.g., in HeapTupleData) whereas
    
16.  * BlockNumber is the type on which calculations are performed (e.g.,
    
17.  * in access method code).
    
18.  *
    
19.  * there doesn't appear to be any reason to have separate types except
    
20.  * for the fact that BlockIds can be SHORTALIGN'd (and therefore any
    
21.  * structures that contains them, such as ItemPointerData, can also be
    
22.  * SHORTALIGN'd). this is an important consideration for reducing the
    
23.  * space requirements of the line pointer (ItemIdData) array on each
    
24.  * page and the header of each heap or index tuple, so it doesn't seem
    
25.  * wise to change this without good reason.
    
26.  */
    
27.  typedef struct BlockIdData
    
28.  {
    
29.  uint16 bi_hi;
    
30.  uint16 bi_lo;
    
31.  } BlockIdData;
    

33.  typedef BlockIdData *BlockId; /* block identifier */
    

35.  //--------------------- src/include/storage/itemptr.h
    
36.  /*
    
37.  * ItemPointer:
    
38.  *
    
39.  * This is a pointer to an item within a disk page of a known file
    
40.  * (for example, a cross-link from an index to its parent table).
    
41.  * blkid tells us which block, posid tells us which entry in the linp
    
42.  * (ItemIdData) array we want.
    
43.  *
    
44.  * Note: because there is an item pointer in each tuple header and index
    
45.  * tuple header on disk, it's very important not to waste space with
    
46.  * structure padding bytes. The struct is designed to be six bytes long
    
47.  * (it contains three int16 fields) but a few compilers will pad it to
    
48.  * eight bytes unless coerced. We apply appropriate persuasion where
    
49.  * possible. If your compiler can't be made to play along, you'll waste
    
50.  * lots of space.
    
51.  */
    
52.  typedef struct ItemPointerData
    
53.  {
    
54.  BlockIdData ip_blkid;
    
55.  OffsetNumber ip_posid;
    
56.  }
    

58.  //--------------------- src/include/access/htup_details.h
    
59.  typedef struct HeapTupleFields
    
60.  {
    
61.  TransactionId t_xmin; /* inserting xact ID */
    
62.  TransactionId t_xmax; /* deleting or locking xact ID */
    
63.  union
    
64.  {
    
65.  CommandId t_cid; /* inserting or deleting command ID, or both */
    
66.  TransactionId t_xvac; /* old-style VACUUM FULL xact ID */
    
67.  } t_field3;
    
68.  } HeapTupleFields;
    

70.  typedef struct DatumTupleFields
    
71.  {
    
72.  int32 datum_len_; /* varlena header (do not touch directly!) */
    

74.  int32 datum_typmod; /* -1, or identifier of a record type */
    

76.  Oid datum_typeid; /* composite type OID, or RECORDOID */
    

78.  /*
    
79.  * datum_typeid cannot be a domain over composite, only plain composite,
    
80.  * even if the datum is meant as a value of a domain-over-composite type.
    
81.  * This is in line with the general principle that CoerceToDomain does not
    
82.  * change the physical representation of the base type value.
    
83.  *
    
84.  * Note: field ordering is chosen with thought that Oid might someday
    
85.  * widen to 64 bits.
    
86.  */
    
87.  } DatumTupleFields;
    

89.  struct HeapTupleHeaderData
    
90.  {
    
91.  union
    
92.  {
    
93.  HeapTupleFields t_heap;
    
94.  DatumTupleFields t_datum;
    
95.  } t_choice;
    

97.  ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
    
98.  * speculative insertion token) */
    

100.  /* Fields below here must match MinimalTupleData! */
    

102.  #define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
    
103.  uint16 t_infomask2; /* number of attributes + various flags */
    

105.  #define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
    
106.  uint16 t_infomask; /* various flag bits, see below */
    

108.  #define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
    
109.  uint8 t_hoff; /* sizeof header incl. bitmap, padding */
    

111.  /* ^ - 23 bytes - ^ */
    

113.  #define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
    
114.  bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */

116.  /* MORE DATA FOLLOWS AT END OF STRUCT */
    
117.  };

结构体展开,详见下表:

   
 Field Type Length Offset Description
 
 t_xmin TransactionId 4 bytes 0 insert XID stamp
 
 t_xmax TransactionId 4 bytes 4 delete XID stamp
 
 t_cid CommandId 4 bytes 8 insert and/or delete CID stamp (overlays with t_xvac)
 
 t_xvac TransactionId 4 bytes 8 XID for VACUUM operation moving a row version
 
 t_ctid ItemPointerData 6 bytes 12 current TID of this or newer row version
 
 t_infomask2 uint16 2 bytes 18 number of attributes, plus various flag bits
 
 t_infomask uint16 2 bytes 20 various flag bits
 
 t_hoff uint8 1 byte 22 offset to user data
 
//注意:t_cid和t_xvac为联合体,共用存储空间

从上一节我们已经得出第1个Tuple的偏移为8152,下面使用hexdump对其中的数据逐个解析:

xmin(4bytes)

[xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8152 -n 4
00001fd8  e3 02 00 00                                       |....|
00001fdc
[xdb@localhost ~]$ echo $((0x000002e3))
739

xmax(4bytes)

 [xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8156 -n 4
00001fdc  00 00 00 00                                       |....|
00001fe0

cid/_xvac(4bytes)

[xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8160 -n 4
00001fe0 00 00 00 00 |....|
00001fe

ctid(6bytes)

[xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8164 -n 6
00001fe4 00 00 00 00 01 00 |......|
00001fea
//ip_blkid=\x0000,即blockid=0
//ip_posid=\x0001,即posid=1,第1个tupl

infomask2(2bytes)

[xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8170 -n 2
00001fea 03 00 |..|
00001fec
//t_infomask2=\x0003,3代表什么意思?我们看看t_infomask2的说明
/*
* information stored in t_infomask2:
 */
 #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
 /* bits 0x1800 are available */
 #define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols
 * modified, or tuple deleted */
 #define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
 #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */
 #define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */
 //根把十六进制值转换为二进制显示
 11111111111 #define HEAP_NATTS_MASK 0x07FF
 10000000000000 #define HEAP_KEYS_UPDATED 0x2000
 100000000000000 #define HEAP_HOT_UPDATED 0x4000
 1000000000000000 #define HEAP_ONLY_TUPLE 0x8000
 1110000000000000 #define HEAP2_XACT_MASK 0xE000
 1111111111111110 #define SpecTokenOffsetNumber 0xfffe

 //前(低)11位为属性(字段 )的个数,3意味着有3个属性(字段)

infomask(2bytes)

 [xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8172 -n 2
 00001fec  02 09                                             |..|
00001fee
  [xdb@localhost ~]$ echo $((0x0902))
2306
  [xdb@localhost ~]$ echo "obase=2;2306"|bc
100100000010

12 位  
9 位commit信息
//t_infomask=\x0902,十进制值为2306,二进制值为100100000010
 //t_infomask说明
 1 #define HEAP_HASNULL 0x0001 /* has null attribute(s) */
 10 #define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
 100 #define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
 1000 #define HEAP_HASOID 0x0008 /* has an object-id field */
 10000 #define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
 100000 #define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
 1000000 #define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
 10000000 #define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */
 /* xmax is a shared locker */
 #define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
 #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
 HEAP_XMAX_KEYSHR_LOCK)
 100000000 #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
 1000000000 #define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
 #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
 10000000000 #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
 100000000000 #define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */
 1000000000000 #define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */
 10000000000000 #define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */
 100000000000000 #define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0
 * VACUUM FULL; kept for binary
 * upgrade support */
 1000000000000000 #define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0
 * VACUUM FULL; kept for binary
 * upgrade support */
 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
 1111111111110000 #define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
 //\x0802,二进制100000000010表示第2位和第12位为1,
 //意味着存在可变长属性(HEAP_HASVARWIDTH),XMAX无效(HEAP_XMAX_INVAL 

hoff(1bytes)

1.  [xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8174 -n 1
2.  00001fee 18 |.|
3.  00001fef
4.  [xdb@localhost ~]$ echo $((0x18))
5.  24
6.  //用户数据开始偏移为24,即8152+24

Tuple(15bytes)

说完了Tuple的头部数据,接下来我们看看实际的数据存储。上一节我们得到Tuple总的长度是39,计算得到数据大小为39-24=15。

Warning

实际数据计算没有公式,是一个变长值,需要根据lp_len 算出总长,减去定长24 ,得到实际数据的长度。

Tuple 中的内容

[xdb@localhost ~]$ hexdump -C $PGDATA/base/13758/16429 -s 8176 -n 15
00001ff0  01 00 00 00 13 31 20 20  20 20 20 20 20 05 61     |.....1       .a|
00001fff
回顾我们的表结构:
create table t_page (id int,c1 char(8),c2 varchar(16));
第1个字段为int,第2个字段为定长字符,第3个字段为变长字符。
相应的数据:
id=\x00000001,数字1
c1=\x133120202020202020,字符串,无需高低位变换,第1个字节\x13为标志位,后面是字符'1'+7个空格
c2=\x0561,字符串,第1个字节\x05为标志位,后面是字符'a'

小结

page
8192(byte)
page
8192(byte)
page
8192(byte)
block number
0th
1th
N-th
heap tuples
(record data)
table file
pd_lsn
pd_checmsum
pd_flags
pd_lower
pd_upper
pd_special
pd_pagesize_
version
pg_prune_xid
1
2
tuple 1
tuple 2
line pointers
free space
(hole)
pd_lower
pd_upper
lp_len
lp_flags
lp_off
2bits
15 bits
15bits
偏移量
8152bits
xmin
4 bytes
xmax
4bytes
cid/_xvac
4bytes
ctid
6bytes
infomask2
2bytes
infomask
2bytes
hoff
1bytes
24bytes
偏移量
实际数据
15bytes
1bytes
24bytes
8bytes
2bytes
2bytes
2bytes
2bytes
2bytes
2bytes
4bytes
4bytes
39bytes

以上简单介绍了如何阅读Raw Datafile中的数据行信息,包括Tuple头部信息和用户数据。在空间使用上面,可以看PG到为了进行实际的数据查询和MVCC等机制,数据库添加了很多的额外信息,实际占用的空间大小比实际的数据要大很多,如果使用列式存储应可有效的压缩空间。