一文搞定Hive函数,建议收藏!
The following article is from 数据猿温大大 Author by Welsh_wen
导读:工欲善其事必先利其器,hive是我们与大数据打交道的工具,所以我们需要熟练使用hive的一些函数,这样后面的数据开发就得心应手啦。
废话不多说直接上干货。
目录如下
1、数据准备
2、字符函数
3、聚合函数
4、数学函数
5、时间函数
6、窗口函数
7、条件函数
1 数据准备
首先我们创建一张收集用户流量页面的SQL表,同学们可以直接将其在mysql下创建:
/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.7.16-log : Database - dmall_ware_gaea
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`dmall_ware_gaea` /*!40100 DEFAULT CHARACTER SET utf8 */;
USE `dmall_ware_gaea`;
/*Table structure for table `user_view` */
DROP TABLE IF EXISTS `user_view`;
CREATE TABLE `user_view` (
`site_id` char(4) DEFAULT NULL,
`user_name` char(11) DEFAULT NULL,
`pv` int(4) DEFAULT NULL,
`dt` char(16) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*Data for the table `user_view` */
insert into `user_view`(`site_id`,`user_name`,`pv`,`dt`) values ('A10','Sone',2,'20200801'),('A10','welsh',3,'20200801'),('A10','Sone',16,'20200801'),('A10','Albert',20,'20200802'),('A10','GG',32,' 20200801'),('A20','Albert',42,' 20200801'),('A20','welsh',10,'20200801'),('A20','welsh',15,'20200802'),('A10','Albert',20,'20200801'),('A20','Sone',NULL,'20200802'),('A20','welsh',15,'20200802'),('A20','Albert',10,'20200802'),('A10','Jojo',16,'20200802'),('A20','welsh',35,'20200803'),('A10','welsh',33,'20200803'),('A20','Sone',66,'20200803'),('A20','Jojo',15,'20200802'),('A10','Albert',53,'20200803'),('A10','Jojo',12,'20200803'),('A20','GG',35,'20200803'),('A20','J.K',30,'20200803');
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
预览:
site_id | user_name | pv | dt |
---|---|---|---|
A10 | Sone | 2 | 20200801 |
A10 | welsh | 3 | 20200801 |
A10 | Sone | 16 | 20200801 |
A10 | Albert | 20 | 20200802 |
A10 | GG | 32 | 20200801 |
A20 | Albert | 42 | 20200801 |
A20 | welsh | 10 | 20200801 |
A20 | welsh | 15 | 20200802 |
A10 | Albert | 20 | 20200801 |
A20 | Sone | \N | 20200802 |
A20 | welsh | 15 | 20200802 |
A20 | Albert | 10 | 20200802 |
A10 | Jojo | 16 | 20200802 |
A20 | welsh | 35 | 20200803 |
A10 | welsh | 33 | 20200803 |
A20 | Sone | 66 | 20200803 |
A20 | Jojo | 15 | 20200802 |
A10 | Albert | 53 | 20200803 |
A10 | Jojo | 12 | 20200803 |
A20 | GG | 35 | 20200803 |
A20 | J.K | 30 | 20200803 |
2 字符函数
说明:对字符进行拼接、截取、去空格
枚举:concat、concat_ws、substring、trim、lpad、rpad、split、find_in_set
2.1 concat
说明:拼接字符
SELECT CONCAT(user_name,dt) FROM user_view
# 输出:
"welsh20200801"
"Albert20200801"
...
2.2 concat_ws
说明:拼接字符且分割
SELECT CONCAT_WS(':',user_name,dt) FROM user_view
# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.3 substring
说明:截取字符串
用法:subString(col, intstart, int len)
SELECT CONCAT_WS(':',user_name,dt) FROM user_view
# 输出:
"welsh:20200801"
"Albert:20200801"
...
2.4 trim
说明:去掉两边的空格
用法:trim(col)
select trim(' welsh ')
# 输出:
"welsh"
2.5 repeat
说明:复制函数
用法:repeat(string str, int n)
select repeat('welsh',2)
# 输出:
"welshwelsh"
2.6 lpad
说明:填充函数,默认从左开始补充
用法:lpad(string str, int len, string pad)
select lpad('welsh',10, 'ddd')
# 输出:
"dddddwelsh"
2.7 rpad
说明:右补充函数,默认从右开始补充
用法:rpad(string str, int len, string pad)
select rpad('welsh',10, 'ddd')
# 输出:
"welshddddd"
2.8 split:
说明:分割函数,返回list
用法:split(string str, stringpat)
select split('welshUAlbertUGG','U')
# 输出:
["welsh","Albert","GG"]
2.9 find_in_set:
说明:查找函数,返回首次出现该字符位置
用法:find_in_set(string str, string strList)
select find_in_set('welsh','Albert,and,welsh,go,to,Swimming')
# 输出:
3
3 聚合函数
说明:对数据汇总、相加、平均、最大值、最小值
枚举:count、sum、avg、min、max、collect_list、collect_set
3.1 count
说明:汇总,若使用distinct则是去重后再汇总
用法:count(*),count(distint col)
# count统计包含null值总数
select count(*) from user_view
# count 不含null值总数
select count(pv) from user_view
# count(distinct col)统计去重总数
select count(distinct user_name) from user_view
3.2 sum
说明:相加,若使用distinct则是去重后再汇总
用法:sum(*),sum(distint col)
# sum 统计总值
select SUM(pv) FROM user_view
# sum 统计去重后总值
SELECT SUM(DISTINCT pv) FROM user_view
3.3 avg
说明:平均值,若使用distinct则是去重后在求平均值
用法:avg(*),avg(distint col)
# avg平均值
SELECT avg(pv) FROM user_view
# avg(distinct pv)去重后平均值
SELECT avg(distinct pv) FROM user_view
# min最小值
SELECT min(pv) FROM user_view
# max最大值
SELECT max(pv) FROM user_view
3.4 collect_list
说明:将字段组装成一个list,没有去重
用法:collect_list(col)
select collect_list(user_name) from dmall_gaea_analysis.user_view;
# 输出:
["Sone","welsh","Sone","Albert","GG","Albert","welsh","welsh","Albert","Sone","welsh","Albert","Jojo","welsh","welsh","Sone","Jojo","Albert","Jojo","GG","J.K"]
3.4 collect_set
说明:将字段组装成一个list,去重
用法:collect_set(col)
select collect_set(user_name) from dmall_gaea_analysis.user_view;
# 输出:
["Sone","welsh","Albert","GG","Jojo","J.K"]
4 数学函数
说明:对数据球方差、标准偏差、样本标准层
枚举:variance、stddev_pop、stddev_samp
# variance方差
SELECT variance(pv) FROM user_view
# stddev_pop标准偏差
SELECT stddev_pop(pv) FROM user_view
# stddev_samp样本标准偏差
SELECT stddev_samp(pv) FROM user_view
5 时间函数
说明:时间获取、格式化、2个时间相差、时间增加、时间减少
枚举:unix_timestamp、FROM_UNIXTIME、to_date、weekofyear、weekofyear、datediff、date_add、date_sub
5.1 unix_timestamp
说明:获取当前时间戳
用法:unix_timestamp()
SELECT unix_timestamp()
# 输出:
1600226901
5.2 FROM_UNIXTIME
说明:格式化时间戳,通常与unix_timestamp()一起用,获取当前时间
用法:FROM_UNIXTIME()
SELECT FROM_UNIXTIME(unix_timestamp(),'yyyyMMdd')
# 输出:
20200916
5.3 to_date
说明:格式化时间
用法:to_date()
SELECT to_date('2020-09-10 10:03:01') as now_time
# 输出:
2020-09-10
5.4 weekofyear
说明:返回当前周
用法:weekofyear()
SELECT weekofyear('2020-09-08 10:03:01') as now_time
# 输出:
37
5.5 datediff
说明:日期相差天数
用法:datediff()
select datediff('2020-09-09','2020-08-08')
# 输出:
32
5.6 date_add
说明:日期增加
用法:date_add()
select date_add('2020-09-08',10) as date_time
# 输出:
2020-09-18
5.7 date_sub
说明:日期减少N天
用法:date_sub()
select date_sub('2020-09-08',10) as date_time
# 输出:
2020-08-29
6 窗口函数
说明:常用于对现有数据的排名
枚举:row_number、RANK、DENSE_RANK
row_number(): 分组后,从1开始排名,遇到相同值按照表中记录的顺序进行排列
RANK():分组后,从1开始排名,遇到相同值会在名次中留下空位
DENSE_RANK():分组后,从1开始排名,遇到相同值不会留下空位
select
user_name,pv,
row_number() over (partition by site_id,dt order by pv desc) as ord_1,
RANK() over (partition by site_id,dt order by pv desc) as ord_2,
DENSE_RANK() over (partition by site_id,dt order by pv desc) as ord_3
from dmall_gaea_analysis.user_view where dt='20200803' and site_id='A20'
输出:
user_name | pv | ord_1 | ord_2 | ord_3 |
---|---|---|---|---|
Sone | 66 | 1 | 1 | 1 |
welsh | 35 | 2 | 2 | 2 |
GG | 35 | 3 | 2 | 2 |
J.K | 30 | 4 | 4 | 3 |
结论:由于welsh 与 GG 的pv值一样,所以根据规则排名如下
row_number() 排名:1234
RANK() 排名:1224
DENSE_RANK() 排名:1223
7 条件函数
说明:常用于对null进行处理
枚举:case
select
user_name,
case when pv is null then 0 else pv end as pv
from dmall_gaea_analysis.user_view where pv is null
# 输出:
user_name pv
Sone 0
今天的分享就到这里,谢谢大家。
有用的话,文末分享、点赞、在看~
干货推荐:
2020-11-11
2020-11-10
2020-11-09
社群福利:
关于我们:
入群请联系小助手:iom1128『紫霞仙子』
!关注不迷路~ 各种福利、资源定期分享!
👇文末戳阅读原文
回复 【hive】:即可获得Hive编程指南资料,让你轻松掌握它;
回复【06】:即可获取经典大数据电子书籍,大厂的橄榄枝正向你招手!