从源码出发,深度剖析字节对齐
字节对齐
当数据自然对齐时,CPU可以高效地对内存进行读写操作。因此,PostgreSQL中的每个数据类型都有一个特定的对齐要求。当多个属性连续存储在一个元组中时,在一个属性前插入填充物,以使其从所需的对齐边界开始。更好地理解这些对齐要求可能有助于在磁盘上存储元组时尽量减少所需的填充,从而节省磁盘空间。
PostgreSQL中的数据类型分为以下几类:
1.Pass-by-value, fixed-length:通过数值传递给Postgres内部程序并具有固定长度的数据类型就属于这一类。长度可以是1,2,或4(在64位系统中是8)个字节。2.Pass-by-reference, fixed-length:对于这些数据类型,来自内存堆页的地址引用被发送到内部Postgres程序。它们的长度也是固定的。3.Pass-by-reference, variable-length:对于可变长度的数据类型,Postgres在实际数据之前预置了一个变长头。它存储了一些关于数据在磁盘上的实际存储方式的信息(uncompressed, compressed or TOASTed)和数据的实际长度。对于TOASTed属性,实际数据被存储在一个单独的表中。在这些情况下,变长头在其TOAST表中的实际位置的一些信息数据的后面。
通常情况下,变长头在磁盘上的大小是1Byte。但是,如果数据不能使用TOAST,并且未压缩的数据大小超过126字节,则会使用4字节的头。比如说
postgres=# create table t1 (a varchar);
CREATE TABLE
postgres=# insert into t1 values(repeat('a',126));
INSERT 0 1
postgres=# insert into t1 values(repeat('a',127));
INSERT 0 1
postgres=# select pg_column_size(a) from t1;
pg_column_size
----------------
127
131
(2 rows)
此外,具有4字节变长头的属性需要被对齐到一个4字节对齐的内存位置。这可能会浪费多达3字节的额外填充空间。因此,对这些列进行一些仔细的长度限制可以节省空间。
**Pass-by_reference, variable length(cstring, unknown)**:最后,有两种数据类型,即ctring和unknown,它们是字符串文字(string literal)。它们可以从任何1字节对齐的边界开始存储。而且,它们不需要任何变长头。
你可以使用下面的查询来检查每种类型的对齐要求,其中tyname是数据类型的名称;tybyval值为真表示数据类型是通过值访问的,否则为假;typlen是数据类型的实际长度,然而对于可变长度的数据类型,它的值<0 (对于cstring和unknown是-2,否则是-1);tyalign是数据类型所需的对齐方式。
postgres=# select typname,typbyval,typlen,typalign from pg_type;
typname | typbyval | typlen | typalign
---------------------------------------+----------+--------+----------
bool | t | 1 | c
bytea | f | -1 | i
char | t | 1 | c
name | f | 64 | c
int8 | t | 8 | d
int2 | t | 2 | s
int2vector | f | -1 | i
int4 | t | 4 | i
regproc | t | 4 | i
...
在检查每个数据类型的对齐要求后,可以通过将它们按有利于对齐的方式来摆置以减少填充空间。例如,下面的表会浪费大量的磁盘空间用于填充
CREATE TABLE t1 (
, a char
, b int2 -- 1 byte of padding after a
, c char
, d int4 -- 3 bytes of padding after c
, e char
, f int8 -- 7 bytes of padding after e
);
如果你将列重新排序为double对齐列在前,int对齐、short对齐和char对齐列在后,你可以为每个元组节省(1+3+7)=11字节的空间。
CREATE TABLE t1 (
, f int8
, d int4
, b int2
, a char
, c char
, e char
);
在元组属性之前,是一个固定大小的23字节的元组头,然后是一个可选的空值位图和一个可选的对象ID。属性总是从一个MAXALIGN的边界开始,通常在64位操作系统上是8字节(或在32位操作系统上是4字节)。因此,一个最小元组头的有效大小是24字节(23字节的头+1字节的填充)。当空值位图存在时,它占据了足够的字节,每个数据列占据一个比特位。在这个比特位列表中,1表示非空,0表示空。当位图不存在时,所有的列都被认为是不空的。例如:
postgres=# select pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
pg_column_size
----------------
24
(1 row)
这包括23字节的固定大小的元组头和一个由8字节组成的空值位图。现在,如果我们将列的数量增加到9个,它包括23字节的固定大小的元组头,一个由16字节组成的空位图和7字节的填充。
postgres=# select pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
pg_column_size
----------------
32
(1 row)
在一个由数百万行组成的表中,每个元组节省几个字节最终可能会节省一些重要的存储空间。此外,如果我们能在一个数据页中容纳更多的元组,由于更少的I/O,性能还可以得到改善。
元组数据结构
了解了一些数据对齐的规则之后,回顾一下元组(Tuple)的结构
There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data. The header is detailed in Table 70.4. The actual user data (columns of the row) begins at the offset indicated by
t_hoff
, which must always be a multiple of the MAXALIGN distance for the platform. The null bitmap is only present if the HEAP_HASNULL bit is set int_infomask
. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, the number of bits that equals the attribute count int_infomask2
). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. The object ID is only present if the HEAP_HASOID_OLD bit is set int_infomask
. If present, it appears just before thet_hoff
boundary. Any padding needed to maket_hoff
a MAXALIGN multiple will appear between the null bitmap and the object ID. (This in turn ensures that the object ID is suitably aligned.)
有一个固定大小的头(在大多数机器上占用23个字节),然后是一个可选的空值位图,一个可选的对象ID字段,以及用户数据。实际的用户数据(行的列)从
t_hoff
指示的偏移量开始,它必须始终是平台的MAXALIGN距离的倍数。只有在t_infomask
中设置了HEAP_HASNULL位,才会出现空值位图。如果它存在,它就在固定头之后开始,并占据足够的字节,以便每个数据列有一个比特(即等于t_infomask2
中属性计数的比特数)。在这个位列表中,1位表示非空,0位为空。当位图不存在时,所有的列都被认为是不空的。只有当HEAP_HASOID_OLD位在t_infomask
中被设置时,对象ID才会出现。如果存在,它就会出现在t_hoff
边界之前。任何使t_hoff
成为MAXALIGN倍数所需的填充将出现在空位图和对象ID之间。(这反过来又保证了对象ID的适当对齐。)
Tuple头部是由23byte固定大小的前缀和可选的NullBitMap构成。
大小端
后文会反复提到大小端,这里回顾一下
1.大端(Big-endian)是高位字节排放在内存的低地址端,低位字节排放在内存的高地址端,即正序排列,高尾端;2.小端(Little-endian)是低位字节排放在内存的低地址端,高位字节排放在内存的高地址端,即逆序排列,低尾端;
16bit宽的数0x1234在两种模式CPU内存中的存放方式(假设从地址0x4000开始存放)为:
32bit宽的数0x12345678在两种模式CPU内存中的存放方式(假设从地址0x4000开始存放)为:
变长数据类型
前文也提到了数据类型分为三类:传值(定长)、传引用(定长)和传引用(变长),变长的数据类型存储都是采用的varlena格式,官网上有这一块的说明:
All variable-length data types share the common header structure
struct varlena
, which includes the total length of the stored value and some flag bits. Depending on the flags, the data can be either inline or in a TOAST table; it might be compressed, too
/* ----------------
* Variable-length datatypes all share the 'struct varlena' header.
*
* NOTE: for TOASTable types, this is an oversimplification, since the value
* may be compressed or moved out-of-line. However datatype-specific routines
* are mostly content to deal with de-TOASTed values only, and of course
* client-side routines should never see a TOASTed value. But even in a
* de-TOASTed value, beware of touching vl_len_ directly, as its
* representation is no longer convenient. It's recommended that code always
* use macros VARDATA_ANY, VARSIZE_ANY, VARSIZE_ANY_EXHDR, VARDATA, VARSIZE,
* and SET_VARSIZE instead of relying on direct mentions of the struct fields.
* See postgres.h for details of the TOASTed form.
* ----------------
*/
struct varlena
{
char vl_len_[4]; /* Do not touch this field directly! */
char vl_dat[FLEXIBLE_ARRAY_MEMBER]; /* Data content is here */
};
struct varlena是一个通用的结构体,根据字节再转化为具体的对应格式
在postgres.h的头文件里,可以看到对于变长数据的这一块定义,以小端序为例
•xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)•xxxxxx10 4-byte length word, aligned, compressed data (up to 1G)•00000001 1-byte length word, unaligned, TOAST pointer•xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
1.如果第一个字节等于000 00001
(小端序), 那么就是varattrib_1b_e
,用来存储和toast有关的 external 数据2.如果第一个字节的最高位等于1,且后7bit不全为0,那么就是varattrib_1b
,用来存储小数据,不能存储超过127byte的数据3.如果第一个字节的最高位等于0,那么就是varattrib_4b
,可以存储不超过1GB的数据
/*
* These structs describe the header of a varlena object that may have been
* TOASTed. Generally, don't reference these structs directly, but use the
* macros below.
*
* We use separate structs for the aligned and unaligned cases because the
* compiler might otherwise think it could generate code that assumes
* alignment while touching fields of a 1-byte-header varlena.
*/
typedef union
{
struct /* Normal varlena (4-byte length) */
{
uint32 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_4byte;
struct /* Compressed-in-line format */
{
uint32 va_header;
uint32 va_tcinfo; /* Original data size (excludes header) and
* compression method; see va_extinfo */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
} va_compressed;
} varattrib_4b;
typedef struct
{
uint8 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Data begins here */
} varattrib_1b;
/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;
/*
* Bit layouts for varlena headers on big-endian machines:
*
* 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
* 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)
* 10000000 1-byte length word, unaligned, TOAST pointer
* 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)
*
* Bit layouts for varlena headers on little-endian machines:
*
* xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)
* xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)
* 00000001 1-byte length word, unaligned, TOAST pointer
* xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
*
* The "xxx" bits are the length field (which includes itself in all cases).
* In the big-endian case we mask to extract the length, in the little-endian
* case we shift. Note that in both cases the flag bits are in the physically
* first byte. Also, it is not possible for a 1-byte length word to be zero;
* this lets us disambiguate alignment padding bytes from the start of an
* unaligned datum. (We now *require* pad bytes to be filled with zero!)
*
* In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
* the specific type and length of the pointer datum.
*/
/*
* Endian-dependent macros. These are considered internal --- use the
* external macros below instead of using these directly.
*
* Note: IS_1B is true for external toast records but VARSIZE_1B will return 0
* for such records. Hence you should usually check for IS_EXTERNAL before
* checking for IS_1B.
*/
varattrib_1b类型
typedef struct
{
uint8 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Data begins here */
} varattrib_1b;
va_header
只有 8 位,最高位是标记位,值为 1。剩余的 7 位表示数据长度,所以varattrib_1b
类型只是用于存储长度不超过127 (2^7,2的7次方)byte 的小数据。
-----------------------------------------
tag | length |
-----------------------------------------
1 bit | 7 bit |
-----------------------------------------
postgres=# create table test (s varchar);
CREATE TABLE
postgres=# insert into test VALUES ('abcd'),('abc');
INSERT 0 2
postgres=# select lp, t_data from
postgres-# heap_page_items(get_raw_page('test', 0));
lp | t_data
----+--------------
1 | \x0b61626364
2 | \x09616263
(2 rows)
postgres=# select pg_column_size(s) from test;
pg_column_size
----------------
5
4
(2 rows)
mydb=# create table test1 (s varchar);
CREATE TABLE
mydb=# insert into test1 values('中');
INSERT 0 1
mydb=# select lp, t_data from heap_page_items(get_raw_page('test1', 0));
lp | t_data
----+------------
1 | \x09e4b8ad
(1 row)
可以看到(下面都是基于UTF8的字符集,一个英文字符等于一个字节)
1.第一个0b也就是00001011(后面7个bit不全为0,所以是varattrib_1b的数据类型),代表header,剩下的0000101,就是长度,总长度5byte(包含header),所以纯元组的长度是4byte,61、62、63、64则是十六进制的abcd的ASCII码2.第二个09也就是00001001(后面7个bit不全为0,所以是varattrib_1b的数据类型),代表header,剩下的0000100,就是长度,总长度4byte(包含header),所以纯元组的长度是3byte,61、62、63则是十六进制的abc的ASCII码
varattrib_4b类型
varattrib_4b
格式会根据存储的数据是否被压缩过分为两种。最高第二位为1,则表示存储的数据是未压缩的。为0,则表示存储的数据是压缩过的。varattrib_4b
的定义如下,使用union
来表示这两种情况。对于未压缩的数据,使用va_4byte
结构体存储。对于压缩的数据,使用va_compressed
结构体存储。
typedef union
{
/* va_data存储的数据没有被压缩 */
struct
{
uint32 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_4byte;
/* va_data存储的数据被压缩过了 */
struct
{
uint32 va_header;
uint32 va_rawsize; /* 原始数据的大小 */
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_compressed;
} varattrib_4b;
这两个结构体的第一个成员va_header
都是 uint32 类型,两者的格式是一样的。它最高位是标记位,值为 0。第二位表示是否没被压缩。剩下的 30 位表示数据的长度,所以只能支持不超过 1GB (2^30 - 1 bytes) 的数据。
--------------------------------------------------
tag | compress | length |
--------------------------------------------------
1 bit | 1 bit | 30 bit |
-------------------------------------------------
见如下样例:
postgres=# create table test (a varchar);
CREATE TABLE
postgres=# insert into test VALUES (repeat('+',126)),(repeat('-',127));
INSERT 0 2
postgres=# select lp, t_data from
postgres-# heap_page_items(get_raw_page('test', 0));
lp | t_data
----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
1 | \xff2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b
2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b2b
2 | \x0c0200002d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
(2 rows)
postgres=# select pg_column_size(a) from test;
pg_column_size
----------------
127
131
(2 rows)
•xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)•xxxxxx10 4-byte length word, aligned, compressed data (up to 1G)•00000001 1-byte length word, unaligned, TOAST pointer•xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
第一个是ff,也就是11111111(后面7个bit不全为0,所以是varattrib_1b的数据类型),长度是127byte(header 1byte + 长度 126byte)
第二个超过了126byte,header是0c020000(4byte的header,所以是varattrib_4b的数据类型),最高位为0,第二位是0,表示未压缩过,剩余的30bit代表长度,也就是00000010|000011,长度是131byte,减去header的4byte,所以长度是127byte
那么再看看压缩和没有压缩的数据
mydb=# create table test(a varchar);
CREATE TABLE
mydb=# insert into test values(repeat('-',2004)),(repeat('-',2005));
INSERT 0 2
mydb=# select lp, t_data from heap_page_items(get_raw_page('test', 0));
1 | \x601f00002d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2 | \x8e000000d5070000fe2d0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff010f014b
(2 rows)
mydb=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | character varying | | | | extended | |
Access method: heap
postgres=# select pg_column_size(a) from test;
pg_column_size
----------------
2008
35
(2 rows)
因为a列是extended,那么允许压缩和行外存储。一般会先压缩,如果还是太大,就会行外存储。这是大多数可以TOAST的数据类型的默认策略。
•xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)•xxxxxx10 4-byte length word, aligned, compressed data (up to 1G)•00000001 1-byte length word, unaligned, TOAST pointer•xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
再参照这个式子,第一行数据,x601f0000,01100000 00011111,表示未压缩过的数据,剩下的是长度,也就是00011111011000,也就是2008,减去4byte的header,所以实际数据长度是2004
而第二行,也是依葫芦画瓢,\x8e00,1000 1110,0000 0000,表示对齐的压缩过的数据,压缩后的长度是0000 00001000 11,也就是35个字节,不过值得注意的是,header后面跟了一个uncompressed data的长度。
varattrib_1b_e类型
TOAST
先温故一下TOAST,TOAST是"The Oversized-Attribute Storage Technique"(超尺寸属性存储技术)的缩写,主要用于存储一个大字段的值。要理解TOAST,我们要先理解页(BLOCK)的概念。在PostgreSQL中,页是数据在文件存储中的基本单位,其大小是固定的且只能在编译期指定,之后无法修改,默认的大小为8KB。同时,PostgreSQL不允许一行数据跨页存储。那么对于超长的行数据,PostgreSQL就会启动TOAST,将大的字段压缩或切片成多个物理行存到另一张系统表中(TOAST表),这种存储方式叫行外存储。
只有特定的数据类型支持TOAST,因为那些整数、浮点数等不太长的数据类型是没有必要使用TOAST的。
在 PostgreSQL 中每个表字段有四种 TOAST 的策略:
•PLAIN :避免压缩和行外存储。只有那些不需要 TOAST 策略就能存放的数据类型允许选择(例如 int 类型),而对于 text 这类要求存储长度超过页大小的类型,是不允许采用此策略的。•EXTENDED:允许压缩和行外存储。一般会先压缩,如果还是太大,就会行外存储。这是大多数可以TOAST的数据类型的默认策略。•EXTERNAL:允许行外存储,但不许压缩。这让在text类型和bytea类型字段上的子串操作更快。类似字符串这种会对数据的一部分进行操作的字段,采用此策略可能获得更高的性能,因为不需要读取出整行数据再解压。•MAIN:允许压缩,但不许行外存储。不过实际上,为了保证过大数据的存储,行外存储在其它方式(例如压缩)都无法满足需求的情况下,作为最后手段还是会被启动。因此理解为:尽量不使用行外存储更贴切。
varattrib_1b_e
定义如下,它并不存储数据,只是指向了外部数据的地址。根据外部数据的存储位置,可以分为几种格式。首先看看它的定义:
/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;
/*
* Type tag for the various sorts of "TOAST pointer" datums. The peculiar
* value for VARTAG_ONDISK comes from a requirement for on-disk compatibility
* with a previous notion that the tag field was the pointer datum's length.
*/
typedef enum vartag_external
{
VARTAG_INDIRECT = 1, ---一个varlena指针,可以指向varatt_external,varatt_expanded,或者是varattrib_1b,varattrib_4b 类型的原始数据。
VARTAG_EXPANDED_RO = 2, ---外部数据是存储在内存中,只读
VARTAG_EXPANDED_RW = 3, ---外部数据是存储在内存中,读写
VARTAG_ONDISK = 18 ---表示外部数据存储在磁盘中
} vartag_external;
第二个字节的va_tag
表示类型,有上面四种。每种类型下, va_data
存储的格式都不是一样的。
看下例子:
mydb=# create table test(a varchar);
CREATE TABLE
mydb=# alter table test alter a set storage external ;
ALTER TABLE
mydb=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | character varying | | | | external | |
Access method: heap
mydb=# insert into test values(repeat('-',2005));;
INSERT 0 1
mydb=# insert into test values('');
INSERT 0 1
mydb=# select lp, t_data from heap_page_items(get_raw_page('test', 0));
lp | t_data
----+----------------------------------------
1 | \x0112d9070000d50700003d44000036440000
2 | \x03
(2 rows)
mydb=# select relname from pg_class where oid = (select reltoastrelid from pg_class where relname='test');
relname
----------------
pg_toast_17459
(1 row)
mydb=# select * from pg_toast.pg_toast_17459 limit 1;
chunk_id | chunk_seq |
chunk_data
----------+-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------
17469 | 0 | \x2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d2d
(1 row)
•xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)•xxxxxx10 4-byte length word, aligned, compressed data (up to 1G)•00000001 1-byte length word, unaligned, TOAST pointer•xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
原始数据是,\x01|12|d9070000|d5070000|3d440000|36440000,为了方便,我简单切了一下,让我们逐个来分析一下
1.因为是1byte的header,\x01,也就是00000001,所以对应第三个,使用了TOAST2.十六进制的12,也就是十进制的18,对应 VARTAG_ONDISK = 18
,也就是说外部数据存储在磁盘中3.存储在磁盘的话,对应下面的varatt_external
结构体,所以前4个字节是Original data size (includes header),后四个字节是External saved size (doesn't),再是TOAST值的OID和TOAST表的OID4.TOAST的value id则是3d44,也就是0x443d,也就是17469,而TOAST的oid则是3644,也就是0x4436,换算成10进制,则是17462
typedef struct varatt_external
{
int32 va_rawsize; /* Original data size (includes header) */
int32 va_extsize; /* External saved size (doesn't) */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;
mydb=# select reltoastrelid from pg_class where relname='test';
reltoastrelid
---------------
17462
(1 row)
mydb=# select relname from pg_class where oid = (select reltoastrelid from pg_class where relname='test');
relname
----------------
pg_toast_17459
(1 row)
mydb=# \d pg_toast.pg_toast_17459;
TOAST table "pg_toast.pg_toast_17459"
Column | Type
------------+---------
chunk_id | oid ---用来表示特定TOAST值的OID,具有同样chunk_id值的所有行组成原表的TOAST字段的一行数据。
chunk_seq | integer ---用来表示该行数据在整个数据中的位置。
chunk_data | bytea ---该Chunk实际的数据。
Owning table: "public.test"
Indexes:
"pg_toast_17459_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
mydb=# select chunk_id from pg_toast.pg_toast_17459;
chunk_id
----------
17469
17469
(2 rows)
小结
通过pg_type我们可以知道一些数据类型的相关信息:
1.typbyval
判断数据类型的传递方式,是通过传值还是传引用。变长类型总是传引用,比如bytea、text。注意即使长度允许传值, typbyval
也可以为假。2.typlend
看到的 -2 对应的是cstring和unknown类型,-1对应的是其他变长类型。关于unknown,可以从这篇文章中瞥见一二:PostgreSQL 隐式类型转换探秘3.typalign
是对齐的方式,和C语言结构体类似c
= char
对齐,即不需要对齐。 s
= short
对齐(在大部分机器上为2字节)。 i
= int
对齐(在大部分机器上为4字节)。 d
= double
对齐(在很多机器上为8字节,但绝不是全部)
postgres=# select typname,typbyval,typlen,typalign from pg_type limit 10;
typname | typbyval | typlen | typalign
------------+----------+--------+----------
bool | t | 1 | c
bytea | f | -1 | i
char | t | 1 | c
name | f | 64 | c
int8 | t | 8 | d
int2 | t | 2 | s
int2vector | f | -1 | i
int4 | t | 4 | i
regproc | t | 4 | i
text | f | -1 | i
(10 rows)
postgres=# select typname,typbyval,typlen,typalign from pg_type where typlen = '-2';
typname | typbyval | typlen | typalign
---------+----------+--------+----------
unknown | f | -2 | c
cstring | f | -2 | c
(2 rows)
知道了数据类型的长度和对齐方式之后,就可以看下具体效果了,一行数据就可以节省这么多字节,上亿条的表可想而知可以节省多少IO。
postgres=# create table test1(a char,b int2,c char,d int4,e char,f int8);
CREATE TABLE
postgres=# create table test2(a int8,b int4,c int2,d char,e char,f char);
CREATE TABLE
postgres=# insert into test1 values('a',1,'b',1,'c',1);
INSERT 0 1
postgres=# insert into test2 values(1,1,1,'a','b','c');
INSERT 0 1
postgres=# create extension pageinspect ;
CREATE EXTENSION
postgres=# select lp, t_data from heap_page_items(get_raw_page('test1', 0));
lp | t_data
----+----------------------------------------------------
1 | \x056101000562000001000000056300000100000000000000
(1 row)
postgres=# select lp, t_data from heap_page_items(get_raw_page('test2', 0));
lp | t_data
----+--------------------------------------------
1 | \x0100000000000000010000000100056105620563
(1 row)
postgres=# select pg_column_size(test1.*) from test1;
pg_column_size
----------------
48
(1 row)
postgres=# select pg_column_size(test2.*) from test2;
pg_column_size
----------------
44
(1 row)
所以,我们在设计表结构的时候,需要依循下面的规则来设计:
1.非NULL的固定长度属性先放2.然后把NULL少的固定长度属性放在前面3.将所有可变长度的属性移到右边
按照上面的方式进行对比,可以发现TPS提升了1.27倍之多。
MAXALIGN
前文提到的MAXALIGN就是一个宏定义,有点晦涩难懂,我们只需要记住的是,PostgreSQL在为结构体申请空间时会考虑到字节对齐,所以计算一行大小的时候需要将对齐考虑进去。
#define MAXALIGN(LEN) TYPEALIGN(MAXIMUM_ALIGNOF, (LEN))
/* MAXALIGN covers only built-in types, not buffers */
#define TYPEALIGN(ALIGNVAL,LEN) \
(((uintptr_t) (LEN) + ((ALIGNVAL) - 1)) & ~((uintptr_t) ((ALIGNVAL) - 1)))
关于这个MAXALIGN,可以参考这篇文章:https://www.jqhtml.com/65596.html
到处都可以看到相关身影
#define SizeOfXLogLongPHD MAXALIGN(sizeof(XLogLongPageHeaderData))
其他数据类型的对齐方式
bytea
关于bytea,在官网上写了,bytea的存储大小是1字节或者4字节再加上实际的二进制串,通过前文的分析,这里的为什么有时是1,有时是4,就不难解释了。
Now, a short varlena (up to 126 data bytes) is reduced to a 1-byte header and we don't align it. To hide this from datatype-specific functions that don't want to deal with it, such a datum is considered "toasted" and will be expanded back to the normal 4-byte-header format by pg_detoast_datum.
Name | Storage Size | Description |
bytea | 1 or 4 bytes plus the actual binary string | variable-length binary string |
postgres=# create table t_bytea (b bool, ba bytea);
CREATE TABLE
postgres=# insert into t_bytea values(false, '\xFF');
INSERT 0 1
postgres=# insert into t_bytea values(false, '\xFFFF');
INSERT 0 1
postgres=# insert into t_bytea values(false, '\xFFFFFF');
INSERT 0 1
postgres=# insert into t_bytea values(false, '\xFFFFFFFF');
INSERT 0 1
postgres=# insert into t_bytea values(false, '\xFFFFFFFFFF');
INSERT 0 1
postgres=# insert into t_bytea values(false, '\xFFFFFFFFFFFF');
INSERT 0 1
postgres=# select pg_column_size(t_bytea.*), pg_column_size(b), pg_column_size(ba) from t_bytea;
pg_column_size | pg_column_size | pg_column_size
----------------+----------------+----------------
27 | 1 | 2
28 | 1 | 3
29 | 1 | 4
30 | 1 | 5
31 | 1 | 6
32 | 1 | 7
(6 rows)
postgres=# select pg_column_size(t_bytea.*), pg_column_size(b), pg_column_size(ba),length(ba) from t_bytea;
pg_column_size | pg_column_size | pg_column_size | length
----------------+----------------+----------------+--------
27 | 1 | 2 | 1
28 | 1 | 3 | 2
29 | 1 | 4 | 3
30 | 1 | 5 | 4
31 | 1 | 6 | 5
32 | 1 | 7 | 6
(6 rows)
text
postgres=# create table t_text (b bool, t text);
CREATE TABLE
postgres=# insert into t_text values(false, 'a');
INSERT 0 1
postgres=# insert into t_text values(false, 'aa');
INSERT 0 1
postgres=# insert into t_text values(false, 'aaa');
INSERT 0 1
postgres=# insert into t_text values(false, 'aaaa');
INSERT 0 1
postgres=# insert into t_text values(false, 'aaaaa');
INSERT 0 1
postgres=# insert into t_text values(false, 'aaaaaa');
INSERT 0 1
postgres=# select pg_column_size(t_text.*), pg_column_size(b), pg_column_size(t) from t_text;
pg_column_size | pg_column_size | pg_column_size
----------------+----------------+----------------
27 | 1 | 2
28 | 1 | 3
29 | 1 | 4
30 | 1 | 5
31 | 1 | 6
32 | 1 | 7
(6 rows)
text[]
postgres=# create table t_text_arr (b bool, arr text[]);
CREATE TABLE
postgres=# insert into t_text_arr values(false, ARRAY['a']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['a']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['aa']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['aaa']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['aaaa']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['aaaaa']);
INSERT 0 1
postgres=# insert into t_text_arr values(false, ARRAY['aaaaaa']);
INSERT 0 1
postgres=# select pg_column_size(t_text_arr.*), pg_column_size(b), pg_column_size(arr) from t_text_arr;
pg_column_size | pg_column_size | pg_column_size
----------------+----------------+----------------
54 | 1 | 29
54 | 1 | 29
54 | 1 | 29
54 | 1 | 29
58 | 1 | 33
58 | 1 | 33
(6 rows)
text[]就比较夸张了,header占了很多字节。
总结
所以,按照字节对齐的要求,假如不存在空值位图和变长属性的话,每一行大小是可以算出来的。
固定长度的元组会进行适当的对齐:
而变长的元组会根据header的大小决定是否对齐,4byte的header会进行对齐(按照4byte),而1byte的header则不会进行对齐。
现在基本都是64位的计算机了,对齐的方式也有点差异
所以为什么资深存储人员为啥总叫你注意 IO 对齐,机械磁盘 IO 为什么要 512 对齐呢,SSD 盘为啥要 4K 对齐?看下面这个例子
场景:读 512 个字节,但是偏移不对齐?
步骤如下:
1.磁头摆到这 512 个字节数据所在 2 个扇区位置,对齐到扇区开始的偏移;2.读取 2 个完整扇区(1024 Bytes)的数据到内存;3.从这 1024 字节的内存中,copy 出用户要用的 512 个字节,给到用户;
开销:读放大。虽然读的是 1 个扇区的数据,但是偏移却没对齐,所以必须要读 2 个扇区,也放大了一倍的流量。
字节对齐最终目的还是为了提高效率,计算机从内存中取数据是按照一个固定长度的。以32位机为例,它每次取32个位,也就是4个字节,以int型数据为例,如果它在内存中存放的位置按4字节对齐,也就是说1个int的数据全部落在计算机一次取数的区间内,那么只需要取一次就可以了。假如是存放的位置是第三个字节,就得读取两次,然后copy出用户要的数据。
并且在现代大多数处理器中,由于内存相对于CPU还是慢的多,所以又引入了L1、L2、L3这几层Cache,Cache会被分为很多行(Cache Line),Cache Line可以简单的理解为CPU Cache中的最小缓存单位。目前主流的CPU Cache的Cache Line大小都是64Bytes
[postgres@xiongcc ~]$ cat /sys/devices/system/cpu/cpu1/cache/index0/coherency_line_size
64
[postgres@xiongcc ~]$ cat /sys/devices/system/cpu/cpu1/cache/index1/coherency_line_size
64
假设我们有一个512字节的一级缓存,那么按照64B的缓存单位大小来算,这个一级缓存所能存放的缓存个数就是512/64 = 8
个。当一行 Cache Line 被从内存拷贝到 Cache 里,Cache 里会为这个 Cache Line 创建一个条目。这个 Cache 条目里既包含了拷贝的内存数据,即 Cache Line,又包含了这行数据在内存里的位置等元数据信息。每一条Cache Line一般包含16个字节,32个字节或64个字节等。比如某个计算机一级Cache的Cache Line长度是32个字节,那么每段Cache Line总是会包含32个字节对齐的一段内存。现在有一个4字节的整数,如果它的地址不是4字节对齐的,那么就有可能访问它的时候,需要使用两条Cache Line,这增加了总线通讯量,而且增加了对Cache的使用量,而且使用的数据没有在Cache里面(这时需要将数据从内存调入Cache,会非常慢)的机会会增加,这些都降低了程序的速度。
巨人的肩膀
https://gitee.com/yueludanfeng/blog/blob/master/201103/20110329_01.md
https://www.postgresql.org/docs/current/pageinspect.html
https://yieldnull.com/blog/08a509370a04d5051eddbe84e15bb186b9ddcbad/
https://www.postgresql.org/docs/14/storage-page-layout.html
https://zhmin.github.io/posts/postgresql-varlena/
https://www.enterprisedb.com/postgres-tutorials/data-alignment-postgresql
http://cenalulu.github.io/linux/all-about-cpu-cache/
https://mp.weixin.qq.com/s/NwSSlO2t2hZ5L3i8D8bc8Q