GBase有奖征文1 | 数据平台开启转型之路:GBase 8a在复杂查询场景下的性能实践
数据平台开启转型之路:GBase 8a在复杂查询场景下的性能实践
数据库国产化转型是最近几年的热点,不管是从技术角度还是从行业政策角度来看,都会对我们的数据平台未来发展有着重要的影响,提前完成技术储备和POC验证测试是十分有必要的。
前 言
目前数据平台管理的数据主要是结构化和半结构化数据,结构化数据处理目前还是以Oracle和MySQL为主,其中核心业务数据具有典型特点:
部分业务表数据更新频繁,随着时间线会持续增长
自关联复杂查询多,查询响应时间要求高
操作记录型数据整体占比很高,随着时间线呈现单向增长,查询需求多变
数据生命周期管理
曾经有一段时间负责国产数据库研发工作,对主流的开源数据库和国产数据库有整体的调研了解。结合我们自身业务数据的特点,秉承稳步推进、维持业务连续性、最小投入以及数据零丢失的原则,计划引入MPP数据库集群架构,对复杂业务结构化查询进行引流,优化核心数据库平台成本高昂的IO、CPU和内存资源占用。
MPP数据库集群架构
大规模并行分析(MPP)数据库(Analytical Massively Parallel Processing (MPP) Databases)是针对分析工作负载进行了优化的数据库:聚合和处理大型数据集。部分MPP数据库产品有良好的SQL兼容能力和事务处理能力,对于我们来说,暂时不需要太多的处理节点,大多数分析围绕结构化数据展开,习惯使用传统RDBMS所遵循的ANSI标准SQL,所以考虑选用Greenplum/GBase 8a等开源和国产数据库产品。
GBase 8a MPP Cluster
GBase 8a MPP Cluster是在GBase 8a列存储数据库基础上开发的,基亍现代于计算MPP理念和Shared Nothing架构的幵行数据库集群:
Shared Nothing + MPP 架构
集群扁平架构
基亍列存储
高劢态扩展能力:横向扩展数百节点
高可用性:Safegroup机制
Greenplum 6
Greenplum数据库通过将数据和处理负载分布在多个服务器或者主机上来存储和处理大量的数据。Greenplum是基于PostgreSQL组成的MPP集群。Master是Greenplum数据库系统的入口,Master会协调与系统中其他称为Segment的数据库实例一起工作,Segment负责存储和处理数据。
典型场景验证
测试样本:
单表,记录数约2.6亿条,导出csv文件大小约50GB。
测试环境:
Oracle 11gR2 物理单机
三节点Greenplum 6beta1/7.0.0-alpha.0集群(2副本) 三台物理机
三节点GBase 8a MPP集群(1+1副本)三台虚拟机
测试说明:
两种场景:数据加载和典型复杂查询
考量维度:数据类型与SQL语法兼容性,数据加载性能,复杂SQL查询性能
测试数据分布具有一定倾斜性
测试环境较为简陋且未遵循严格的基准测试准则,除Oracle的数据库参数有简单优化外,Greenplum和GBase 8a均为安装默认配置,同时还受限于对各个数据库产品内部运行机制的熟悉程度,故本次结果不能作为严格的产品选型对比验证。
GBase 8a集群状态信息
[gbase@gbase8a-test-1 gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
======================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
======================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
----------------------------------------------------------------------
| coordinator1 | 192.168.151.234 | OPEN | OPEN | 0 |
----------------------------------------------------------------------
| coordinator2 | 192.168.151.235 | OPEN | OPEN | 0 |
----------------------------------------------------------------------
| coordinator3 | 192.168.151.237 | OPEN | OPEN | 0 |
----------------------------------------------------------------------
==================================================================
| GBASE DATA CLUSTER INFORMATION |
==================================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
------------------------------------------------------------------
| node1 | 192.168.151.234 | OPEN | OPEN | 0 |
------------------------------------------------------------------
| node2 | 192.168.151.235 | OPEN | OPEN | 0 |
------------------------------------------------------------------
| node3 | 192.168.151.237 | OPEN | OPEN | 0 |
------------------------------------------------------------------
[gbase@gbase8a-test-1 gcinstall]$ gcadmin showdistribution node
Distribution ID: 1 | State: new | Total segment num: 6
====================================================================================================================================
|nodes | 192.168.151.234 | 192.168.151.235 | 192.168.151.237 |
------------------------------------------------------------------------------------------------------------------------------------
|primary | 1 | 2 | 3 |
|segments | 4 | 5 | 6 |
------------------------------------------------------------------------------------------------------------------------------------
|duplicate | 3 | 1 | 2 |
|segments 1| 5 | 6 | 4 |
====================================================================================================================================
GreenPlum集群状态信息
采取源码自编译部署
首次部署时使用gpdb-6X_STABLE版本:gpdb-6X_STABLE PostgreSQL 9.4.24 (Greenplum Database 6.0.0-beta.1 build dev)
后来升级为 PostgreSQL 9.6beta4 (Greenplum Database 7.0.0-alpha.0 build dev) on x86_64-unknown-linux-gnu
gpstate:dba-testsvr-1:dgadmin-[INFO]:-Starting gpstate with args:
gpstate:dba-testsvr-1:dgadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 7.0.0-alpha.0 build dev'
gpstate:dba-testsvr-1:dgadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 9.6beta4 (Greenplum Database 7.0.0-alpha.0 build dev) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 5.5.0, 64-bit compiled
gpstate:dba-testsvr-1:dgadmin-[INFO]:-Obtaining Segment details from master...
gpstate:dba-testsvr-1:dgadmin-[INFO]:-Gathering data from segments...
gpstate:dba-testsvr-1:dgadmin-[INFO]:-Greenplum instance status summary
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Master instance = Active
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Master standby = No master standby configured
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total segment instance count from metadata = 6
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Primary Segment Status
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total primary segments = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total primary segment valid (at master) = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total primary segment failures (at master) = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid files missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid files found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid PIDs found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of /tmp lock files missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of /tmp lock files found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number postmaster processes missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number postmaster processes found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Mirror Segment Status
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total mirror segments = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total mirror segment valid (at master) = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total mirror segment failures (at master) = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid files missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid files found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of postmaster.pid PIDs found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of /tmp lock files missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number of /tmp lock files found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number postmaster processes missing = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number postmaster processes found = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number mirror segments acting as primary segments = 0
gpstate:dba-testsvr-1:dgadmin-[INFO]:- Total number mirror segments acting as mirror segments = 3
gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------
数据加载
Oracle 11gR2:
create table record_detail
(
xxxxid numeric(20) not null,
xxxxxxid VARCHAR(20) not null,
xxxxxxxxxxserial VARCHAR(26) not null,
xxxxxxdate DATE not null,
xxxxxxdate DATE not null,
xxxxxxyear numeric(3) not null,
xxxce numeric(8,2) not null,
xxxxxid VARCHAR(7) not null,
xxxxxxxxxxtype VARCHAR(10),
xxxxxxflag numeric(1),
xxxxxxxunit CHAR(1),
xxxxxxxname VARCHAR(255),
xxxxxxxxxtion VARCHAR(20),
xxxxxxxxtype VARCHAR(10),
xxxxxxxxcode VARCHAR(10) not null,
xxxxxxxdate DATE,
xxxxxxxxxxid VARCHAR(6) not null,
xxxxxxxxxxid VARCHAR(6) not null,
xxxxdate DATE not null,
xxxxxxxid VARCHAR(20),
xxxxxxyear varchar(10),
xxxxxxxxxxxxxxxyear varchar(10)
) tablespace test_data;
SQL*Loader: Release 11.2.0.4.0
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
Control File: record_detail.ctl
Data File: record_detail.csv
Bad File: record_detail.bad
Discard File: record_detail.dsc
(Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 10000
Continuation: none specified
Path used: Direct
Silent options: FEEDBACK
Table RECORD_DETAIL, loaded from every logical record.
Insert option in effect for this table: TRUNCATE
TRAILING NULLCOLS option in effect
...
Table RECORD_DETAIL:
259761870 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Date conversion cache disabled due to overflow (default size: 1000)
Bind array size not used in direct path.
Column array rows : 5000
Stream buffer bytes: 256000
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 259761870
Total logical records rejected: 0
Total logical records discarded: 0
Total stream buffers loaded by SQL*Loader main thread: 94724
Total stream buffers loaded by SQL*Loader load thread: 107011
Elapsed time was: 00:51:45.51
CPU time was: 00:45:25.52
Greenplum 6:
testcluster1=# \d record_detail;
Table "public.record_detail"
Column | Type | Modifiers
---------------------+-----------------------------+-----------
xxxxid | numeric(20,0) | not null
xxxxxxid | character varying(20) | not null
xxxxxxxxxxserial | character varying(26) | not null
xxxxxxdate | timestamp without time zone | not null
xxxxxxdate | timestamp without time zone | not null
xxxxxxyear | numeric(3,0) | not null
xxxce | numeric(8,2) | not null
xxxxxid | character varying(7) | not null
xxxxxxxxxxtype | character varying(10) |
xxxxxxflag | numeric(1,0) |
xxxxxxxunit | character(1) |
xxxxxxxname | character varying(255) |
xxxxxxxxxtion | character varying(20) |
xxxxxxxxtype | character varying(10) |
xxxxxxxxcode | character varying(10) | not null
xxxxxxxdate | timestamp without time zone |
xxxxxxxxxxid | character varying(6) | not null
xxxxxxxxxxid | character varying(6) | not null
xxxxdate | timestamp without time zone | not null
xxxxxxxid | character varying(20) |
xxxxxxyear | character varying(10) |
xxxxxxxxxxxxxxxyear | character varying(10) |
Distributed by: (xxxxid)
testcluster1=# COPY record_detail FROM '/home/dgadmin/record_detail.csv'
testcluster1-# WITH DELIMITER ','
testcluster1-# SEGMENT REJECT LIMIT 10 ROWS;
COPY 259761870
Time: 1138945.854 ms
GBase 8a:
create table record_detail
(
xxxxid numeric(20) not null,
xxxxxxid VARCHAR(20) not null,
xxxxxxxxxxserial VARCHAR(26) not null,
xxxxxxdate DATETIME not null,
xxxxxxdate DATETIME not null,
xxxxxxyear numeric(3) not null,
xxxce numeric(8,2) not null,
xxxxxid VARCHAR(7) not null,
xxxxxxxxxxtype VARCHAR(10),
xxxxxxflag numeric(1),
xxxxxxxunit CHAR(1),
xxxxxxxname VARCHAR(255),
xxxxxxxxxtion VARCHAR(20),
xxxxxxxxtype VARCHAR(10),
xxxxxxxxcode VARCHAR(10) not null,
xxxxxxxdate DATETIME,
xxxxxxxxxxid VARCHAR(6) not null,
xxxxxxxxxxid VARCHAR(6) not null,
xxxxdate DATETIME not null,
xxxxxxxid VARCHAR(20),
xxxxxxyear varchar(10),
xxxxxxxxxxxxxxxyear varchar(10)
) DISTRIBUTED BY('xxxxid');
gbase> load data infile 'file://192.168.151.234/home/gbase/record_detail.csv' INTO TABLE testdb.record_detail FIELDS TERMINATED BY ',';
Query OK, 259761870 rows affected (Elapsed: 00:09:25.35)
Task 29 finished, Loaded 259761870 records, Skipped 0 records
典型复杂查询
查询来源于实际的Oracle生产环境,具体文本略。
SQL均连续执行三次,取最优时间,以减少IO影响。
Oracle执行结果:
OPERATION_TYPE EXPEND_COUNT EXPEND_MONEY EXPEND_YEAR
-------------- ------------ ------------ -----------
add 962480 21121428 1068391
transfer 42637 1282589 42637
renew 320150 12078327 394506
Executed in 309.319 seconds
Greenplum执行结果:
operation_type | expend_count | expend_money | expend_year
----------------+--------------+--------------+-------------
transfer | 42637 | 1282589.00 | 42637
add | 962480 | 21121428.00 | 1068391
renew | 320150 | 12078327.00 | 394506
(3 rows)
Time: 33220.760 ms
GBase 8a:
+----------------+--------------+--------------+-------------+
| OPERATION_TYPE | EXPEND_COUNT | EXPEND_MONEY | EXPEND_YEAR |
+----------------+--------------+--------------+-------------+
| renew | 320150 | 12078327.00 | 394506 |
| transfer | 42637 | 1282589.00 | 42637 |
| add | 962480 | 21121428.00 | 1068391 |
+----------------+--------------+--------------+-------------+
3 rows in set (Elapsed: 00:00:05.92)
小 结
特性 | Oracle | GreenPlum | GBase 8a |
架构 | 单机 | 三节点集群 | 三节点集群 |
部署难易 | N/A | 简单 | 简单 |
数据类型兼容性 | N/A | 兼容 | 兼容 |
数据加载性能 | 3105秒 | 1139秒 | 565秒 |
SQL语法兼容性 | N/A | 微调 | 无需调整 |
复杂SQL查询性能 | 309秒 | 33秒 | 6秒 |
经过简单的对比分析,可以看出GBase 8a对结构化数据的复杂SQL查询性能加速方面具有较大的优势。在数据生命周期管理中引入MPP类数据库产品,既可以改善现有复杂查询的性能体验,又可以维持现有的SQL使用习惯,可以作为核心业务数据库平台到大数据平台的过渡。目前已经建设完成六节点GBase 8a MPP集群,将逐步接管部分非实时查询业务,实现核心数据库平台的负载分流。
THE END
—— 更多精彩,请戳图片 ——