拼多多面经分享:24个「数据分析师」岗位面试题和答案解析
点击上方“Python数据科学”,星标公众号
重磅干货,第一时间送达
01
P(A|B)表示输入词B实际为A的概率
P(B|A)表示词A错输为B的概率,可以根据AB的相似度计算(如编辑距离)
P(A)是词A出现的频率,统计获得
P(B)对于所有候选的A都一样,所以可以省去
02
方案1(没考虑到偶数个数的情况):
set @m = (select count(*)/2 from table)
select column from table order by column limit @m, 1
方案2(考虑偶数个数,中位数是中间两个数的平均):
set @index = -1
select avg(table.column)
from
(select @index:=@index+1 as index, column
from table order by column) as t
where t.index in (floor(@index/2),ceiling(@index/2))
select avg(distinct column) from table
select column, count(*) from table group by column order by column desc limit 1(emmm,好像用到count了)
03
限制树深
剪枝
限制叶节点数量
正则化项
增加数据
bagging(subsample、subfeature、低维空间投影)
数据增强(加入有杂质的数据)
早停
04
理解:朴素贝叶斯是在已知一些先验概率的情况下,由果索因的一种方法
其它:朴素的意思是假设了事件相互独立
05
能应用于非线性可分的情况
最后分类时由支持向量决定,复杂度取决于支持向量的数目而不是样本空间的维度,避免了维度灾难
具有鲁棒性:因为只使用少量支持向量,抓住关键样本,剔除冗余样本
高维低样本下性能好,如文本分类
模型训练复杂度高
难以适应多分类问题
核函数选择没有较好的方法论
06
初始化k个点
根据距离点归入k个类中
更新k个类的类中心
重复②③,直到收敛或达到迭代次数
07
SET @row_number = 0; SELECT (@row_number:=@row_number + 1) AS num FROM table
08
运营活动
产品变动
技术故障
设计漏洞(如产生可以撸羊毛的设计)
竞品
用户偏好
节假日
社会事件(如产生舆论)
09
明确需求,需求方的目的是什么
拆解任务
制定可执行方案
推进
验收
10
map阶段:读取HDFS中的文件,解析成<k,v>的形式,并对<k,v>进行分区(默认一个区),将相同k的value放在一个集合中
reduce阶段:将map的输出copy到不同的reduce节点上,节点对map的输出进行合并、排序
11
select t1.date
from Tourists as t1, Tourists as t2, Tourists as t3
on t1.id = (t2.id+1) and t2.id = (t3.id+1)
where t1.visits >100 and t2.visits>100 and t3.visits>100
12
update salary
set gender = replace('mf', gender, '')
13
每个特征列都完全匹配的情况下
最多有一个特征列不匹配,其他19个特征列都完全匹配,但哪个列不匹配未知
select A.id,
((case A.d1 when B.d1 then 1 else 0) +
(case A.d2 when B.d2 then 1 else 0) +
...) as count_match
from A left join B
on A.d1 = B.d1
14
select uid1, uid2, sum(result) as dot
from
(select t1.uid as uid1, t2.uid as uid2, t1.goods_id, t1.star*t2.star as result
from t as t1, t as t2
on t1.goods_id = t2.goods_id) as t
group by goods_id
15
select count(*) from class
group by teacher having count(*) > 1
select teacher, count(course) as count_course
from class
group by teacher
16
select knight from table
group by vote_knight
order by count(vote_knight) limit 1
17
员工表为employee,字段为id,employee_name,belong_dormitory_id,belong_department_id;
宿舍表为dormitory,字段为id,dormitory_number;
部门表为department,字段为id,department_name
select dormitory_number, department_name, count(employee_name) as count_employee
from employee as e
left join dormitory as dor on e.belong_dormitory_id = dor.id
left join department as dep on e.belong_department_id = dep.id
18
set @sum = (select sum(frequency)+1 as sum from table)
set @index = 0
set @last_index = 0
select avg(distinct t.frequecy)
from
(select @last_index := @index, @index := @index+frequency as index, frequency
from table) as t
where t.index in (floor(@sum/2), ceiling(@sum/2))
or (floor(@sum/2) > t.last_index and ceiling(@sum.2) <= t.index)
19
select t1.class, avg(distinct t1.score) as median
from table t1, table t2 on t1.id = t2.id
group by t1.class, t1.score
having sum(case when t1.score >= t2.score then 1else 0 end) >=
(select count(*)/2 from table where table.class = t1.class)
and
having sum(case when t1.score <= t2.score then 1else 0 end) >=
(select count(*)/2 from table where table.class = t1.class)
20
select count(distinct user_id) as c from table group by month(pay_time) order by c desc limit 3
select order_id, order_amount from ((select user_id, max(pay_time) as mt from table group by user_id where DATEDIFF(pay_time, NOW()) = -1 as t1) left join table as t2 where t1.user_id = t2.user_id and t1.mt == t2.pay_time)
21
select * from a
where a.user_id exsit (select user_id from b)
22
select * from table order by rand() limit 2000
set @target = 0
set @count_user = 0
select @target:=@target+10 as age_right, *
from table as t1
where t1.age >=@target-10 and t1.age < (@target)
and t1.id in
(select floor(count(*)*0.1) from table as t2
where t1.age >=@target-10 and t1.age < (@target)
order by rand() limit ??)
注:mysql下按百分比取数没有想到比较好的方法,因为limit后面不能接变量。想到的方法是先计算出每个年龄段的总数,然后计算出1%是多少,接着给每一行加一个递增+1的行标,当行标=1%时,结束
23
select visit_date, count(distince user_id)
group by visit_date
select t1.date
from table t1, table t2, ..., table t7
on t1.visit_date = (t2.visit_date+1) and t2.visit_date = (t3.visit_date+1)
and ... and t6.visit_date = (t7.visit_date+1)
24
select day(visit_date), count(distinct user_id)
from table
where user_id not in
(select user_id from table
where day(visit_date) < date_sub(visit_date, interval 7day))
# 三日留存# 先计算每个平台7日前的新用户数量select t1.plat, t1.c/t2.c as retention_3(select plat, count(distinct user_id)from tablegroup by plat, user_idhaving day(min(visit_date)) = date_sub(now(), interval 7 day)) as t1left join(select plat, count(distinct user_id) as cfrom tablegroup by user_id having count(user_id) > 0having day(min(visit_date)) = date_sub(now(), interval 7 day)and day(max(visit_date)) > date_sub(now(), interval 7 day)and day(max(visit_date)) <= date_sub(now(), interval 4day)) as t2on t1.plat = t2.plat