算法人必懂的进阶SQL知识,4道面试常考题
以下文章来源于小小挖掘机 ,作者石晓文
作者 | 石晓文
来源|小小挖掘机(ID:wAlsjwj)
近期在不同群里有小伙伴们提出了一些在面试和笔试中遇到的Hive SQL问题,Hive作为算法工程师的一项必备技能,在面试中也是极有可能被问到的,所以有备无患,本文将对这四道题进行详细的解析,还是有一定难度的,希望你看完本文能够有所收获。
1、多列转多行
第一道题目是这样的:
假设现有一张Hive表,
元数据格式为:
字段:
id stirng
tim string
数据格式如下:
a,b,c,d 2:00,3:00,4:00,5:00
f,b,c,d 1:10,2:20,3:30,4:40
需要变成:
a 2:00
b 3:00
c 4:00
d 5:00
a,b,c,d 2:00,3:00,4:00,5:00
f,b,c,d 1:10,2:20,3:30,4:40
select
id,tim,single_tim
from
default.a1
lateral view explode(split(tim,',')) t as single_tim
这样一行变成了16行,而我们仅仅需要的是其中能够对齐下标的四行。所以在进行explode的时候,我们期望不仅仅能够能够获得数组里的每个值,还希望能够得到其对应的下标,这样在对两列同时进行explode的时候,保留数组下标相同的四行就可以了。这里我们会用到posexplode函数。
select
id,tim,single_id_index,single_id
from
default.a1
lateral view posexplode(split(id,',')) t as single_id_index,single_id
select
id,tim,single_id,single_tim
from
default.a1
lateral view posexplode(split(id,',')) t as single_id_index,single_id
lateral view posexplode(split(tim,',')) t as single_tim_index,single_tim
where
single_id_index = single_tim_index
2、排序后相邻两行均值
第二题的原始数据如下:
要求如下:
lag(字段名,N) over(partition by 分组字段 order by 排序字段 排序方式)
lead(字段名,N) over(partition by 分组字段 order by 排序字段 排序方式)
select
year,chr,if(pre_val is null,val,(val + pre_val) / 2.0) as avg_val
from
(
select
year,chr,val,
lag(val,1) over(partition by year order by chr asc) as pre_val
from
default.a2
) a
3、获取字符串索引列表
1011
0101
=> 取到每一行中1所对应的索引列表,索引从1开始
0101 2,4
1011 1,3,4
select
id,stri,concat_ws(',',collect_list(index)) as indices
from
(
select
id,stri,chr,cast(index + 1 as string) as index
from
default.abcg
lateral view posexplode(split(stri,'')) ids as index,chr
where
chr = '1'
) a
group by
id,stri
4、分块排序
最后一题感觉是比较有难度的一道题目:
2014,1
2015,1
2017,0
2018,0
2019,1
2020,1
2021,1
2022,0
2023,0
=>
2014,1,1
2015,1,2
2017,0,1
2018,0,2
2019,1,1
2020,1,2
2021,1,3
2022,0,1
2023,0,2
select year,
num,
row_number() over(partition by min_year order by year asc) as new_rank
from
(
select year,
base.num as num,
min_year,
row_number() over(partition by base.year order by min_year desc) as rank
from (
select *
from default.a3
) base
inner join (
select min_year,
num,
pre_num
from (
select year as min_year,
num,
lag(num,1) over(order by year) as pre_num
from default.a3
) a
where num!=pre_num
or pre_num is null
) min_year
on base.num = min_year.num
where base.year >= min_year.min_year
) cc
where rank = 1
order by year
select year as min_year,
num,
lag(num,1) over(order by year) as pre_num
from default.a3
select min_year,
num,
pre_num
from (
select year as min_year,
num,
lag(num,1) over(order by year) as pre_num
from default.a3
) a
where num!=pre_num
or pre_num is null
select year,
num,
min_year
from
(
select year,
base.num as num,
min_year,
row_number() over(partition by base.year order by min_year desc) as rank
from (
select *
from default.a3
) base
inner join (
select min_year,
num,
pre_num
from (
select year as min_year,
num,
lag(num,1) over(order by year) as pre_num
from default.a3
) a
where num!=pre_num
or pre_num is null
) min_year
on base.num = min_year.num
where base.year >= min_year.min_year
) cc
where rank = 1
order by year
◆
精彩推荐
◆
5大必知的图算法,附Python代码实现
如何用爬虫技术帮助孩子秒到心仪的幼儿园(基础篇)
Python传奇:30年崛起之路 2019年最新华为、BAT、美团、头条、滴滴面试题目及答案汇总
阿里巴巴杨群:高并发场景下Python的性能挑战