数据开发优化实战案例：Group by数据倾斜

Original 涤生-莫哥涤生大数据

2024-12-05

涤生数据开发优化实战之：group by 数据倾斜场景！

问题描述

在计算用户国家，平台，版本等指标时，由于用户分布是不均匀的,所以在group by时就会导致数据明显倾斜，对于数据量较小，或者非distinct计算时影响不是很明显，但是对于数据量大的任务、如留存、事件表人数计算(特别是留存)，如果没有特别处理，一般计算30天留存都是在1.5小时以上的时长。

需求描述

1. 计算充值用户30天的留存(APP留存,行为留存)
2. 筛选维度: 国家,版本,平台,用户类型(APP),消费位置; 其中国家维度需要中东地区,all,中东各国家, 其他维度均需要all；

Step1.国家汇总优化

-- =================================== 中东地区 + ALL + 各国家用户 ===================================

create temporary table tb_user_all as-- 基本用户信息select dt ,uid ,concat(nation,'_',substring(md5(uid),0,2)) nation ,concat(appkey,'_',substring(md5(uid),0,2)) appkey ,concat(ver,'_',substring(md5(uid),0,2)) ver

   ,concat(case is_new_user when 1 then '新用户' when 0 then '老用户' else 'unknown' end ,'_',substring(md5(uid),0,2)) is_new_user

from tdyh.dwd_xxx_user_uid_day_detailwhere dt>='${date-30}' and dt<='${date+30}'

union all-- ALL（国家）用户信息select dt ,uid ,concat('all','_',substring(md5(uid),0,2)) nation ,concat(appkey,'_',substring(md5(uid),0,2)) appkey ,concat(ver,'_',substring(md5(uid),0,2)) ver

   ,concat(case is_new_user when 1 then '新用户' when 0 then '老用户' else 'unknown' end ,'_',substring(md5(uid),0,2)) is_new_user

from tdyh.dwd_xxx_user_uid_day_detailwhere dt>='${date-30}' and dt<='${date+30}'

union all-- 中东地区汇总用户select a.dt ,a.uid ,concat('中东地区','_',substring(md5(uid),0,2)) nation ,concat(a.appkey,'_',substring(md5(uid),0,2)) appkey ,concat(a.ver,'_',substring(md5(uid),0,2)) ver

   ,concat(case a.is_new_user when 1 then '新用户' when 0 then '老用户' else 'unknown' end,'_',substring(md5(uid),0,2)) is_new_user

from ( select dt,uid,nation,appkey,ver,is_new_user from tdyh.dwd_xxx_user_uid_day_detail

      where dt>='${date-30}' and dt<='${date+30}' and nation not in ('印度尼西亚','印度','越南','泰国','巴西')

) ainner join ( select area, nation from tdyh.dim_xxx_area_nation_config where area='中东地区' ) bon a.nation=b.nation;

1. 针对不同国家信息在上层进行汇总，便于后续写代码，无需多次 union all计算
> 存在问题：当后续将日活表作为右表关联时，无法统计unknown,且需要将其设置为all, 即 nvl(t0.nation,'all')；
2. 对国家，版本等维度均加上 substring(md5(uid),0,2) 后缀，即将已有维度按uid打散；

存在问题：a. 此处用 _ 分割，有可能在版本中出现 _ 版本，但这部分版本会很少，且属于脏数据所以可以忽略，其他维度基本不会出现该情况 ; b. md5是以英文 + 数字组成，所以取前两位理论会组成（26+10）^2 = 1296，但实测生成的新为应为 ${md5_N} * ${old_dim},如appkey=256*2=512个维度

3. 当关联中东地区汇总时，可以将非中东地区的数据剔除，如：印度,印度尼西亚,巴西,越南,泰国等。

Step2.原始数据关联维度

create temporary table tb_charge_user asselect t0.dt ,t0.uid ,nvl(t1.ver,'unknown') ver ,nvl(t1.appkey,'unknown') sys ,nvl(t1.nation,'all') nation ,nvl(t1.is_new_user,'unknown') is_new_user ,case t0.opt_type when '1' then '我的钱包' when '2' then '游戏内' when '3' then '语音房点击充值' when '4' then '语音房钻石不够' when 'all' then 'all' else '其他' end opt_typefrom ( select dt,uid,opt_type from tdyh.dim_xxx_charge_order_success_user_day_detail where dt>='${date-30}' and dt<='${date}' ) t0left join ( select dt,uid,nation,appkey,ver,is_new_user from tb_user_all where dt>='${date-30}' and dt<='${date}' ) t1on t0.uid=t1.uid and t0.dt=t1.dt;

1. 关联国家时,未关联到的需要用all代替,而非unknown, 导致问题无法计算 unknown 部分数据。

Step3.计算留存（app留存为例）

-- 设置小表与大表关联时的reduce数set mapred.reduce.tasks=100;select dt ,'app留存' as res_type ,nation ,nvl(sys,'all') sys ,nvl(ver,'all') ver ,nvl(is_new_user,'all') is_new_user ,opt_type ,sum(d0_dau) dau ,round(sum(d1_dau)/sum(d0_dau),4) d1 ,... ,round(sum(d30_dau)/sum(d0_dau),4) d30from ( select dt ,split(nation,'_')[0] nation ,split(sys,'_')[0] sys ,split(ver,'_')[0] ver ,split(is_new_user,'_')[0] is_new_user ,opt_type ,count(distinct if(dt_num=0,uid1,null)) d0_dau ,... ,count(distinct if(dt_num=30,uid1,null)) d30_dau from ( select a.dt ,a.ver ,a.sys ,a.nation ,a.is_new_user ,a.opt_type

               ,datediff(concat(substr(b.dt,1,4),'-',substr(b.dt,5,2),'-',substr(b.dt,7,2)),concat(substr(a.dt,1,4),'-',substr(a.dt,5,2),'-',substr(a.dt,7,2))) dt_num

,b.uid uid1 from ( select dt,uid,ver,sys,nation,is_new_user,opt_type from tb_charge_user ) a left join ( select dt,uid from tb_user_all where dt>='${date-30}' and dt<='${date+30}' group by dt,uid ) b on a.uid=b.uid ) t group by dt,opt_type,is_new_user,sys,ver,nation ) ttgroup by dt,opt_type,is_new_user,sys,ver,nationgrouping sets ((dt,opt_type,nation,is_new_user), (dt,opt_type,nation,sys), (dt,opt_type,nation,ver), (dt,opt_type,nation,is_new_user,sys), (dt,opt_type,nation,is_new_user,ver), (dt,opt_type,nation,sys,ver), (dt,opt_type,nation,is_new_user,sys,ver), (dt,opt_type,nation))

1. 用户数据表与每日APP活跃用户通过uid关联计算, 此处数据明显膨胀,简单理解每一个用户最多膨胀30倍(30天留存)；

2. 由于数据膨胀,且在计算用户时没有distinct,则肯定会出现数据倾斜(此处中各维度可以保证uid唯一,可以去掉distinct)；

3. 最里层计算各基础指标的用户数, 因为各维度采用 md5(uid)+${dim},所以不同分区的数据肯定不会出现一样的情况,全局distinct可以写成各分区分别distinct再sum汇总,且不要在内部grouping sets；

4. 外层对计算好的结果再进行汇总计算, 此处再 grouping sets；

5. 在内部进行left join时，右表一定要保证数据量最小，本样例 dt、uid并不唯一，所以进行group by；

6. 有时在left join时，生成的reduce数量很小，所以此处可以强制设置reduce的个数，因为此处不会出现数据倾斜问题，故reduce越大速度越快，一般100个即可，此处设置不会影响上部分SQL,但会影响下部分SQL；

优化结果: 由原来2小时20分最后变成半个小时；

优化点

1. 数据倾斜；
2. 计算留存时left join右表进行group by；

经验总结

本优化主要解决大数据量数据倾斜问题,且在count中使用 distinct，而对未使用distinct的任务优化效果并不明显，以及对读写耗时型任务优化效果也不是很明显。

涤生大数据往期精彩推荐