其他
Hive高阶分析函数
示例:实现同一数据集的多重group by操作。事实上GROUPING SETS是多个GROUP BY进行UNION ALL操作的简单表达。
数据:
+----------+----------+---+
|year_month|date |pv |
+----------+----------+---+
|2015-04 |2015-04-12|10 |
|2015-03 |2015-03-10|2 |
|2015-03 |2015-03-10|9 |
|2015-04 |2015-04-12|4 |
|2015-04 |2015-04-13|5 |
|2015-04 |2015-04-13|6 |
|2015-04 |2015-04-16|7 |
|2015-03 |2015-03-12|3 |
+----------+----------+---+
SQL示例:
SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID
FROM data
GROUP BY year_month,date
GROUPING SETS (year_month,date)
ORDER BY GROUPING__ID;
结果:
+----------+----------+----+------------+
|year_month|date |pv |grouping__id|
+----------+----------+----+------------+
|2015-03 |null |14 |1 |
|2015-04 |null |32 |1 |
|null |2015-04-12|14 |2 |
|null |2015-03-10|11 |2 |
|null |2015-04-16|7 |2 |
|null |2015-04-13|11 |2 |
|null |2015-03-12|3 |2 |
+----------+----------+----+------------+
等价于:
SELECT year_month,NULL,SUM(pv) AS pv,1 AS GROUPING__ID FROM data GROUP BY year_month
UNION ALL
SELECT NULL,date, SUM(pv) AS pv,2 AS GROUPING__ID FROM data GROUP BY date;
GROUPING__ID,表示结果属于哪一个分组集合。
根据GROUP BY的维度的所有组合进行聚合。
SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID
FROM data
GROUP BY year_month,date
WITH CUBE
ORDER BY GROUPING__ID;
结果:
+----------+----------+----+------------+
|year_month|date |pv |grouping__id|
+----------+----------+----+------------+
|2015-04 |2015-04-12|14 |0 |
|2015-04 |2015-04-16|7 |0 |
|2015-03 |2015-03-10|11 |0 |
|2015-03 |2015-03-12|3 |0 |
|2015-04 |2015-04-13|11 |0 |
|2015-04 |null |32 |1 |
|2015-03 |null |14 |1 |
|null |2015-03-10|11 |2 |
|null |2015-04-12|14 |2 |
|null |2015-04-13|11 |2 |
|null |2015-03-12|3 |2 |
|null |2015-04-16|7 |2 |
|null |null |46 |3 |
+----------+----------+----+------------+
等价于:
SELECT NULL,NULL, SUM(pv) AS pv,0 AS GROUPING__ID FROM data
UNION ALL
SELECT year_month,NULL, SUM(pv) AS pv,1 AS GROUPING__ID FROM data GROUP BY year_month
UNION ALL
SELECT NULL,date, SUM(pv) AS pv,2 AS GROUPING__ID FROM data GROUP BY date
UNION ALL
SELECT year_month,date, SUM(pv) AS pv,3 AS GROUPING__ID FROM data GROUP BY year_month,date;
CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
--比如,以year_month维度进行层级聚合:
SELECT
year_month,
date,
SUM(pv) AS pv,
GROUPING__ID
FROM data
GROUP BY year_month,date
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
+----------+----------+----+------------+
|year_month|date |pv |grouping__id|
+----------+----------+----+------------+
|2015-03 |2015-03-10|11 |0 |
|2015-03 |2015-03-12|3 |0 |
|2015-04 |2015-04-13|11 |0 |
|2015-04 |2015-04-12|14 |0 |
|2015-04 |2015-04-16|7 |0 |
|2015-04 |null |32 |1 |
|2015-03 |null |14 |1 |
|null |null |46 |3 |
+----------+----------+----+------------+
可以实现这样的上钻过程:月天的PV -> 月的PV -> 总PV。
--把year_month和date调换顺序,则以date维度进行层级聚合:
SELECT
date,
year_month,
SUM(pv) AS pv,
GROUPING__ID
FROM data
GROUP BY date,year_month
WITH ROLLUP
ORDER BY GROUPING__ID
结果:
+----------+----------+----+------------+
|date |year_month|pv |grouping__id|
+----------+----------+----+------------+
|2015-04-13|2015-04 |11 |0 |
|2015-04-12|2015-04 |14 |0 |
|2015-04-16|2015-04 |7 |0 |
|2015-03-10|2015-03 |11 |0 |
|2015-03-12|2015-03 |3 |0 |
|2015-04-12|null |14 |1 |
|2015-03-10|null |11 |1 |
|2015-04-16|null |7 |1 |
|2015-04-13|null |11 |1 |
|2015-03-12|null |3 |1 |
|null |null |46 |3 |
+----------+----------+----+------------+
上面介绍的是Hive中常用的高阶分析函数,下面介绍一些其他实用函数:
将分组内的数据放入到一个集合中,具有去重的功能。
--统计每个用户具体哪些天访问过
select
user_id,
collect_set(visit_date) over(partition by user_id) as visit_date_set
from wedw_tmp.tmp_url_info;
collect_list:和collect_set一样,但是没有去重功能。
数组内排序,通常结合collect_set或者collect_list使用。如collect_list为例子,可以发现日期并不是按照顺序组合的,这里有需求需要按照时间升序的方式来组合。
--按照时间升序来组合
select
user_id,
sort_array(collect_list(visit_date) over(partition by user_id)) as visit_date_set
from wedw_tmp.tmp_url_info;
用于解析url相关的参数,直接上sql。
select
visit_url,
parse_url(visit_url, 'HOST') as url_host, --解析host
parse_url(visit_url, 'PATH') as url_path, --解析path
parse_url(visit_url, 'QUERY') as url_query,--解析请求参数
parse_url(visit_url, 'REF') as url_ref, --解析ref
parse_url(visit_url, 'PROTOCOL') as url_protocol, --解析协议
parse_url(visit_url, 'AUTHORITY') as url_authority,--解析author
parse_url(visit_url, 'FILE') as url_file, --解析filepath
parse_url(visit_url, 'USERINFO') as url_user_info --解析userinfo
from wedw_tmp.tmp_url_info;
推荐文章: