查看原文
其他

GBase有奖征文优选文章1 | 数据平台开启转型之路:GBase 8a在复杂查询场景下的性能实践

黄永厚 GBASE数据库
2024-11-02


数据平台开启转型之路:GBase 8a在复杂查询场景下的性能实践


文 | 黄永厚

2021年初有幸参加了南大通用组织的线上免费GBase 8a GDCA培训课程,培训结束后本来想写一篇培训感受和学习心得,后来因为负责全力推进核心数据库平台同城双活项目的建设而暂时搁置。


数据库国产化转型是最近几年的热点,不管是从技术角度还是从行业政策角度来看,都会对我们的数据平台未来发展有着重要的影响,提前完成技术储备和POC验证测试是十分有必要的。


前  言


目前数据平台管理的数据主要是结构化和半结构化数据,结构化数据处理目前还是以Oracle和MySQL为主,其中核心业务数据具有典型特点:


  • 分业务表数据更新频繁,随着时间线会持续增长

  • 自关联复杂查询多,查询响应时间要求高

  • 操作记录型数据整体占比很高,随着时间线呈现单向增长,查询需求多变

  • 数据生命周期管理


曾经有一段时间负责国产数据库研发工作,对主流的开源数据库和国产数据库有整体的调研了解。结合我们自身业务数据的特点,秉承稳步推进、维持业务连续性、最小投入以及数据零丢失的原则,计划引入MPP数据库集群架构,对复杂业务结构化查询进行引流,优化核心数据库平台成本高昂的IO、CPU和内存资源占用。


MPP数据库集群架构


大规模并行分析(MPP)数据库(Analytical Massively Parallel Processing (MPP) Databases)是针对分析工作负载进行了优化的数据库:聚合和处理大型数据集。部分MPP数据库产品有良好的SQL兼容能力和事务处理能力,对于我们来说,暂时不需要太多的处理节点,大多数分析围绕结构化数据展开,习惯使用传统RDBMS所遵循的ANSI标准SQL,所以考虑选用Greenplum/GBase 8a等开源和国产数据库产品。


GBase 8a MPP Cluster


GBase 8a MPP Cluster是在GBase 8a列存储数据库基础上开发的,基亍现代于计算MPP理念和Shared Nothing架构的幵行数据库集群:

  • Shared Nothing + MPP 架构

  • 集群扁平架构

  • 基亍列存储

  • 高劢态扩展能力:横向扩展数百节点

  • 高可用性:Safegroup机制



Greenplum 6


Greenplum数据库通过将数据和处理负载分布在多个服务器或者主机上来存储和处理大量的数据。Greenplum是基于PostgreSQL组成的MPP集群。Master是Greenplum数据库系统的入口,Master会协调与系统中其他称为Segment的数据库实例一起工作,Segment负责存储和处理数据。



典型场景验证


测试样本:

单表,记录数约2.6亿条,导出csv文件大小约50GB。


测试环境:

  • Oracle 11gR2 物理单机

  • 三节点Greenplum 6beta1/7.0.0-alpha.0集群(2副本) 三台物理机

  • 三节点GBase 8a MPP集群(1+1副本)三台虚拟机


测试说明:

  1. 两种场景:数据加载和典型复杂查询

  2. 考量维度:数据类型与SQL语法兼容性,数据加载性能,复杂SQL查询性能

  3. 测试数据分布具有一定倾斜性

  4. 测试环境较为简陋且未遵循严格的基准测试准则,除Oracle的数据库参数有简单优化外,Greenplum和GBase 8a均为安装默认配置,同时还受限于对各个数据库产品内部运行机制的熟悉程度,故本次结果不能作为严格的产品选型对比验证。


GBase 8a集群状态信息


[gbase@gbase8a-test-1 gcinstall]$ gcadminCLUSTER STATE: ACTIVECLUSTER MODE: NORMAL======================================================================| GBASE COORDINATOR CLUSTER INFORMATION |======================================================================| NodeName | IpAddress |gcware |gcluster |DataState |----------------------------------------------------------------------| coordinator1 | 192.168.151.234 | OPEN | OPEN | 0 |----------------------------------------------------------------------| coordinator2 | 192.168.151.235 | OPEN | OPEN | 0 |----------------------------------------------------------------------| coordinator3 | 192.168.151.237 | OPEN | OPEN | 0 |----------------------------------------------------------------------==================================================================| GBASE DATA CLUSTER INFORMATION |==================================================================|NodeName | IpAddress |gnode |syncserver |DataState |------------------------------------------------------------------| node1 | 192.168.151.234 | OPEN | OPEN | 0 |------------------------------------------------------------------| node2 | 192.168.151.235 | OPEN | OPEN | 0 |------------------------------------------------------------------| node3 | 192.168.151.237 | OPEN | OPEN | 0 |------------------------------------------------------------------[gbase@gbase8a-test-1 gcinstall]$ gcadmin showdistribution node Distribution ID: 1 | State: new | Total segment num: 6====================================================================================================================================|nodes | 192.168.151.234 | 192.168.151.235 | 192.168.151.237 |------------------------------------------------------------------------------------------------------------------------------------|primary | 1 | 2 | 3 ||segments | 4 | 5 | 6 |------------------------------------------------------------------------------------------------------------------------------------|duplicate | 3 | 1 | 2 ||segments 1| 5 | 6 | 4 |====================================================================================================================================


 GreenPlum集群状态信息


采取源码自编译部署


首次部署时使用gpdb-6X_STABLE版本:gpdb-6X_STABLE PostgreSQL 9.4.24 (Greenplum Database 6.0.0-beta.1 build dev)
后来升级为 PostgreSQL 9.6beta4 (Greenplum Database 7.0.0-alpha.0 build dev) on x86_64-unknown-linux-gnu


gpstate:dba-testsvr-1:dgadmin-[INFO]:-Starting gpstate with args: gpstate:dba-testsvr-1:dgadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 7.0.0-alpha.0 build dev'gpstate:dba-testsvr-1:dgadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 9.6beta4 (Greenplum Database 7.0.0-alpha.0 build dev) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 5.5.0, 64-bit compiledgpstate:dba-testsvr-1:dgadmin-[INFO]:-Obtaining Segment details from master...gpstate:dba-testsvr-1:dgadmin-[INFO]:-Gathering data from segments...gpstate:dba-testsvr-1:dgadmin-[INFO]:-Greenplum instance status summarygpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Master instance                                           = Activegpstate:dba-testsvr-1:dgadmin-[INFO]:-   Master standby                                            = No master standby configuredgpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total segment instance count from metadata                = 6gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Primary Segment Statusgpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total primary segments                                    = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total primary segment valid (at master)                   = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total primary segment failures (at master)                = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid files found                = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of /tmp lock files found                     = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number postmaster processes missing                 = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number postmaster processes found                   = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Mirror Segment Statusgpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total mirror segments                                     = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total mirror segment valid (at master)                    = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid files found                = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number of /tmp lock files found                     = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number postmaster processes missing                 = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number postmaster processes found                   = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0gpstate:dba-testsvr-1:dgadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 3gpstate:dba-testsvr-1:dgadmin-[INFO]:-----------------------------------------------------


数据加载


Oracle 11gR2:


create table record_detail( xxxxid numeric(20) not null, xxxxxxid VARCHAR(20) not null, xxxxxxxxxxserial VARCHAR(26) not null, xxxxxxdate DATE not null, xxxxxxdate DATE not null, xxxxxxyear numeric(3) not null, xxxce numeric(8,2) not null, xxxxxid VARCHAR(7) not null, xxxxxxxxxxtype VARCHAR(10), xxxxxxflag numeric(1), xxxxxxxunit CHAR(1), xxxxxxxname VARCHAR(255), xxxxxxxxxtion VARCHAR(20), xxxxxxxxtype VARCHAR(10), xxxxxxxxcode VARCHAR(10) not null, xxxxxxxdate DATE, xxxxxxxxxxid VARCHAR(6) not null, xxxxxxxxxxid VARCHAR(6) not null, xxxxdate DATE not null, xxxxxxxid VARCHAR(20), xxxxxxyear varchar(10), xxxxxxxxxxxxxxxyear varchar(10)) tablespace test_data;SQL*Loader: Release 11.2.0.4.0Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.Control File: record_detail.ctlData File: record_detail.csv Bad File: record_detail.bad Discard File: record_detail.dsc (Allow all discards)Number to load: ALLNumber to skip: 0Errors allowed: 10000Continuation: none specifiedPath used: DirectSilent options: FEEDBACKTable RECORD_DETAIL, loaded from every logical record.Insert option in effect for this table: TRUNCATETRAILING NULLCOLS option in effect...Table RECORD_DETAIL: 259761870 Rows successfully loaded. 0 Rows not loaded due to data errors. 0 Rows not loaded because all WHEN clauses were failed. 0 Rows not loaded because all fields were null. Date conversion cache disabled due to overflow (default size: 1000)Bind array size not used in direct path.Column array rows : 5000Stream buffer bytes: 256000Read buffer bytes: 1048576Total logical records skipped: 0Total logical records read: 259761870Total logical records rejected: 0Total logical records discarded: 0Total stream buffers loaded by SQL*Loader main thread: 94724Total stream buffers loaded by SQL*Loader load thread: 107011Elapsed time was: 00:51:45.51CPU time was: 00:45:25.52


Greenplum 6:


testcluster1=# \d record_detail;                 Table "public.record_detail"       Column        |            Type             | Modifiers ---------------------+-----------------------------+----------- xxxxid              | numeric(20,0)               | not null xxxxxxid            | character varying(20)       | not null xxxxxxxxxxserial    | character varying(26)       | not null xxxxxxdate          | timestamp without time zone | not null xxxxxxdate          | timestamp without time zone | not null xxxxxxyear          | numeric(3,0)                | not null xxxce               | numeric(8,2)                | not null xxxxxid             | character varying(7)        | not null xxxxxxxxxxtype      | character varying(10)       | xxxxxxflag          | numeric(1,0)                | xxxxxxxunit         | character(1)                | xxxxxxxname         | character varying(255)      | xxxxxxxxxtion       | character varying(20)       | xxxxxxxxtype        | character varying(10)       | xxxxxxxxcode        | character varying(10)       | not null xxxxxxxdate         | timestamp without time zone | xxxxxxxxxxid        | character varying(6)        | not null xxxxxxxxxxid        | character varying(6)        | not null xxxxdate            | timestamp without time zone | not null xxxxxxxid           | character varying(20)       | xxxxxxyear          | character varying(10)       | xxxxxxxxxxxxxxxyear | character varying(10)       | Distributed by: (xxxxid)
testcluster1=# COPY record_detail FROM '/home/dgadmin/record_detail.csv'testcluster1-#  WITH DELIMITER ','testcluster1-#  SEGMENT REJECT LIMIT 10 ROWS;
COPY 259761870Time: 1138945.854 ms


GBase 8a:


create table record_detail(  xxxxid              numeric(20) not null,  xxxxxxid            VARCHAR(20) not null,  xxxxxxxxxxserial    VARCHAR(26) not null,  xxxxxxdate          DATETIME not null,  xxxxxxdate          DATETIME not null,  xxxxxxyear          numeric(3) not null,  xxxce               numeric(8,2) not null,  xxxxxid             VARCHAR(7) not null,  xxxxxxxxxxtype      VARCHAR(10),  xxxxxxflag          numeric(1),  xxxxxxxunit         CHAR(1),  xxxxxxxname         VARCHAR(255),  xxxxxxxxxtion       VARCHAR(20),  xxxxxxxxtype        VARCHAR(10),  xxxxxxxxcode        VARCHAR(10) not null,  xxxxxxxdate         DATETIME,  xxxxxxxxxxid        VARCHAR(6) not null,  xxxxxxxxxxid        VARCHAR(6) not null,  xxxxdate            DATETIME not null,  xxxxxxxid           VARCHAR(20),  xxxxxxyear          varchar(10),  xxxxxxxxxxxxxxxyear varchar(10)) DISTRIBUTED BY('xxxxid');
gbase> load data infile 'file://192.168.151.234/home/gbase/record_detail.csv' INTO TABLE testdb.record_detail FIELDS TERMINATED BY ',';                      Query OK, 259761870 rows affected (Elapsed: 00:09:25.35)Task 29 finished, Loaded 259761870 records, Skipped 0 records


典型复杂查询


查询来源于实际的Oracle生产环境,具体文本略。


SQL均连续执行三次,取最优时间,以减少IO影响。


Oracle执行结果:


OPERATION_TYPE EXPEND_COUNT EXPEND_MONEY EXPEND_YEAR-------------- ------------ ------------ -----------add 962480 21121428 1068391transfer 42637 1282589 42637renew 320150 12078327 394506
Executed in 309.319 seconds


Greenplum执行结果:


operation_type | expend_count | expend_money | expend_year ----------------+--------------+--------------+------------- transfer       |        42637 |   1282589.00 |       42637 add            |       962480 |  21121428.00 |     1068391 renew          |       320150 |  12078327.00 |      394506(3 rows)
Time: 33220.760 ms


GBase 8a:


+----------------+--------------+--------------+-------------+| OPERATION_TYPE | EXPEND_COUNT | EXPEND_MONEY | EXPEND_YEAR |+----------------+--------------+--------------+-------------+| renew          |       320150 |  12078327.00 |      394506 || transfer       |        42637 |   1282589.00 |       42637 || add            |       962480 |  21121428.00 |     1068391 |+----------------+--------------+--------------+-------------+3 rows in set (Elapsed: 00:00:05.92)


小  结


特性OracleGreenPlumGBase 8a
架构单机三节点集群三节点集群
部署难易N/A简单简单
数据类型兼容性N/A兼容兼容
数据加载性能3105秒1139秒565秒
SQL语法兼容性N/A微调无需调整
复杂SQL查询性能309秒33秒6秒


经过简单的对比分析,可以看出GBase 8a对结构化数据的复杂SQL查询性能加速方面具有较大的优势。在数据生命周期管理中引入MPP类数据库产品,既可以改善现有复杂查询的性能体验,又可以维持现有的SQL使用习惯,可以作为核心业务数据库平台到大数据平台的过渡。目前已经建设完成六节点GBase 8a MPP集群,将逐步接管部分非实时查询业务,实现核心数据库平台的负载分流。



THE END




往期文章 

新闻资讯

GBASE数据实现数据层面同城双活之应用

南大通用GBase数据库在国家石油天然气管网集团有限公司的应用

南大通用GBase数据库奠基轨道交通国产信息化

生态合作

GBASE数据库2月份适配汇总

南大通用GBase 8c数据库与泛微软件完成互认证 共同搭建统一数字化办公平台

南大通用加入龙蜥社区 共同构建生态体系

培训活动

培训|“老”程序员的“新”学堂 —— 优秀学员谈GBase初体验

GBase技术征文大赛倒计时|千元大奖等你来拿!

认证培训 | 欢迎参加GBase 8a MPP CLuster数据库3月训练营

技术干货

趣说GBase 8a数据库集群(一)

趣说GBase 8a数据库集群(二)

趣说GBase 8a数据库集群(三)—之高可用特性


继续滑动看下一个
GBASE数据库
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存