拼多多面经分享：24个「数据分析师」岗位面试题和答案解析

Python数据科学 2020-09-13

点击上方“Python数据科学”，星标公众号

重磅干货，第一时间送达

☞500g+超全学习资源免费领取

作者：稻蛙

来源：CSDN

https://blog.csdn.net/u013382288/article/details/80450360

本文为转载分享，如有侵权请联系后台删除

问：贝叶斯公式复述并解释应用场景

P（A|B) = P(B|A)*P(A) / P(B)

如搜索query纠错，设A为正确的词，B为输入的词，那么：

P(A|B)表示输入词B实际为A的概率
P(B|A)表示词A错输为B的概率，可以根据AB的相似度计算（如编辑距离）
P(A)是词A出现的频率，统计获得
P(B)对于所有候选的A都一样，所以可以省去

问：如何写SQL求出中位数平均数和众数（除了用count之外的方法）

1. 中位数

方案1（没考虑到偶数个数的情况）：

set @m = (select count(*)/2 from table)select column from table order by column limit @m, 1

方案2（考虑偶数个数，中位数是中间两个数的平均）：

set @index = -1select avg(table.column)from(select @index:=@index+1 as index, columnfrom table order by column) as twhere t.index in (floor(@index/2),ceiling(@index/2))

2. 平均数

select avg(distinct column) from table

3. 众数

select column, count(*) from table group by column order by column desc limit 1(emmm，好像用到count了）

问：如何避免决策树过拟合

限制树深
剪枝
限制叶节点数量
正则化项
增加数据
bagging（subsample、subfeature、低维空间投影）
数据增强（加入有杂质的数据）
早停

问：朴素贝叶斯的理解

理解：朴素贝叶斯是在已知一些先验概率的情况下，由果索因的一种方法

其它：朴素的意思是假设了事件相互独立

问：SVM的优点

优点：

能应用于非线性可分的情况
最后分类时由支持向量决定，复杂度取决于支持向量的数目而不是样本空间的维度，避免了维度灾难
具有鲁棒性：因为只使用少量支持向量，抓住关键样本，剔除冗余样本
高维低样本下性能好，如文本分类

缺点：

模型训练复杂度高
难以适应多分类问题
核函数选择没有较好的方法论

问：Kmeans的原理

初始化k个点
根据距离点归入k个类中
更新k个类的类中心
重复②③，直到收敛或达到迭代次数

问：口答一个SQL题（要用到row number）

mysql中设置row number：

SET @row_number = 0; SELECT (@row_number:=@row_number + 1) AS num FROM table

问：业务场景题，如何分析次日留存率下降的问题

业务问题关键是问对问题，然后才是拆解问题去解决。

1. 两层模型

从用户画像、渠道、产品、行为环节等角度细分，明确到底是哪里的次日留存率下降了

2. 指标拆解

次日留存率 = Σ 次日留存数 / 今日获客人数

3. 原因分析

内部：

运营活动
产品变动
技术故障
设计漏洞（如产生可以撸羊毛的设计）

外部：

竞品
用户偏好
节假日
社会事件（如产生舆论）

问：处理需求时的一般思路是什么，并举例

明确需求，需求方的目的是什么
拆解任务
制定可执行方案
推进
验收

问：hadoop原理和mapreduce原理

1. Hadoop原理

采用HDFS分布式存储文件，MapReduce分解计算，其它先略

2. MapReduce原理

map阶段：读取HDFS中的文件，解析成<k,v>的形式，并对<k,v>进行分区（默认一个区），将相同k的value放在一个集合中

reduce阶段：将map的输出copy到不同的reduce节点上，节点对map的输出进行合并、排序

问：现有一个数据库表Tourists，记录了某个景点7月份每天来访游客的数量如下：id date visits 1 2017-07-01 100 …… 非常巧，id字段刚好等于日期里面的几号。现在请筛选出连续三天都有大于100天的日期。上面例子的输出为：date 2017-07-01 ……

select t1.datefrom Tourists as t1, Tourists as t2, Tourists as t3on t1.id = (t2.id+1) and t2.id = (t3.id+1)where t1.visits >100 and t2.visits>100 and t3.visits>100

问：在一张工资表salary里面，发现2017-07这个月的性别字段男m和女f写反了，请用一个Updae语句修复数据。例如表格数据是：id name gender salary month 1 A m 1000 2017-06 2 B f 1010 2017-06

update salaryset gender = replace('mf', gender, '')

问：现有A表，有21个列，第一列id，剩余列为特征字段，列名从d1-d20，共10W条数据！另外一个表B称为模式表，和A表结构一样，共5W条数据请找到A表中的特征符合B表中模式的数据，并记录下相对应的id。

有两种情况满足要求：

每个特征列都完全匹配的情况下
最多有一个特征列不匹配，其他19个特征列都完全匹配，但哪个列不匹配未知

select A.id,((case A.d1 when B.d1 then 1 else 0) +(case A.d2 when B.d2 then 1 else 0) +...) as count_matchfrom A left join Bon A.d1 = B.d1

问：我们把用户对商品的评分用稀疏向量表示，保存在数据库表t里面：t的字段有：uid，goods_id，star uid是用户id；goodsid是商品id；star是用户对该商品的评分，值为1-5。现在我们想要计算向量两两之间的内积，内积在这里的语义为：对于两个不同的用户，如果他们都对同样的一批商品打了分，那么对于这里面的每个人的分数乘起来，并对这些乘积求和。

例子，数据库表里有以下的数据：U0 g0 2 U0 g1 4 U1 g0 3 U1 g1 1 计算后的结果为：U0 U1 2*3+4*1=10 ……

select uid1, uid2, sum(result) as dotfrom(select t1.uid as uid1, t2.uid as uid2, t1.goods_id, t1.star*t2.star as resultfrom t as t1, t as t2on t1.goods_id = t2.goods_id) as tgroup by goods_id

问：统计教授多门课老师数量并输出每位老师教授课程数统计表

设表class中字段为id，teacher，course

1. 统计教授多门课老师数量

select count(*) from classgroup by teacher having count(*) > 1

2. 输出每位老师教授课程数统计

select teacher, count(course) as count_coursefrom classgroup by teacher

问：四个人选举出一个骑士，统计投票数，并输出真正的骑士名字

设表tabe中字段为id，knight，vote_knight

select knight from tablegroup by vote_knightorder by count(vote_knight) limit 1

问：员工表，宿舍表，部门表，统计出宿舍楼各部门人数表

设：

员工表为employee，字段为id，employee_name，belong_dormitory_id，belong_department_id；

宿舍表为dormitory，字段为id，dormitory_number；

部门表为department，字段为id，department_name

select dormitory_number, department_name, count(employee_name) as count_employeefrom employee as eleft join dormitory as dor on e.belong_dormitory_id = dor.idleft join department as dep on e.belong_department_id = dep.id

问：给出一堆数和频数的表格，统计这一堆数中位数

设表table中字段为id,number,frequency

set @sum = (select sum(frequency)+1 as sum from table)set @index = 0set @last_index = 0select avg(distinct t.frequecy)from(select @last_index := @index, @index := @index+frequency as index, frequencyfrom table) as twhere t.index in (floor(@sum/2), ceiling(@sum/2))or (floor(@sum/2) > t.last_index and ceiling(@sum.2) <= t.index)

问：中位数，三个班级合在一起的一张成绩单，统计每个班级成绩中位数

设表table中字段为id，class，score

select t1.class, avg(distinct t1.score) as medianfrom table t1, table t2 on t1.id = t2.idgroup by t1.class, t1.scorehaving sum(case when t1.score >= t2.score then 1else 0 end) >=(select count(*)/2 from table where table.class = t1.class)andhaving sum(case when t1.score <= t2.score then 1else 0 end) >=(select count(*)/2 from table where table.class = t1.class)

问：交易表结构为user_id,order_id,pay_time,order_amount

写sql查询过去一个月付款用户量（提示：用户量需去重）最高的3天分别是哪几天

写sql查询做昨天每个用户最后付款的订单ID及金额

select count(distinct user_id) as c from table group by month(pay_time) order by c desc limit 3

select order_id, order_amount from ((select user_id, max(pay_time) as mt from table group by user_id where DATEDIFF(pay_time, NOW()) = -1 as t1) left join table as t2 where t1.user_id = t2.user_id and t1.mt == t2.pay_time)

问：PV表a(表结构为user_id,goods_id),点击表b(user_id,goods_id),数据量各为50万条，在防止数据倾斜的情况下，写一句sql找出两个表共同的user_id和相应的goods_id

select * from awhere a.user_id exsit (select user_id from b)

问：表结构为user_id,reg_time,age, 写一句sql按user_id随机抽样2000个用户写一句sql取出按各年龄段（每10岁一个分段，如（0,10））分别抽样1%的用户

1. 随机抽样2000个用户

select * from table order by rand() limit 2000

2. 取出各年龄段抽样1%的用户

set @target = 0set @count_user = 0select @target:=@target+10 as age_right, *from table as t1where t1.age >=@target-10 and t1.age < (@target)and t1.id in(select floor(count(*)*0.1） from table as t2where t1.age >=@target-10 and t1.age < (@target)order by rand() limit ??)

注：mysql下按百分比取数没有想到比较好的方法，因为limit后面不能接变量。想到的方法是先计算出每个年龄段的总数，然后计算出1%是多少，接着给每一行加一个递增+1的行标，当行标=1%时，结束

问：用户登录日志表为user_id,log_id,session_id,plat,visit_date 用sql查询近30天每天平均登录用户数量用sql查询出近30天连续访问7天以上的用户数量

1. 近三十天每天平均登录用户数量

select visit_date, count(distince user_id)group by visit_date

2. 近30天连续访问7天以上的用户数量

select t1.datefrom table t1, table t2, ..., table t7on t1.visit_date = (t2.visit_date+1) and t2.visit_date = (t3.visit_date+1)and ... and t6.visit_date = (t7.visit_date+1）

问：表user_id,visit_date,page_name,plat 统计近7天每天到访的新用户数统计每个访问渠道plat7天前的新用户的3日留存率和7日留存率

1. 近7天每天到访的新用户数

select day(visit_date), count(distinct user_id)from tablewhere user_id not in(select user_id from tablewhere day(visit_date) < date_sub(visit_date, interval 7day))

2. 每个渠道7天前用户的3日留存和7日留存

# 三日留存# 先计算每个平台7日前的新用户数量select t1.plat, t1.c/t2.c as retention_3(select plat, count(distinct user_id)from tablegroup by plat, user_idhaving day(min(visit_date)) = date_sub(now(), interval 7 day)) as t1left join(select plat, count(distinct user_id) as cfrom tablegroup by user_id having count(user_id) > 0having day(min(visit_date)) = date_sub(now(), interval 7 day)and day(max(visit_date)) > date_sub(now(), interval 7 day)and day(max(visit_date)) <= date_sub(now(), interval 4day)) as t2on t1.plat = t2.plat

- end -

推荐阅读
Pandas进阶大神！从0到100你只差这篇文章！
卧槽！VSCode 上竟然也能画流程图了？？？
Python3 网络爬虫：API 数据的抓取使用
精心整理了14个数据分析和机器学习项目！附数据集
VS Code连接远程服务器运行Jupyter Notebook教程
解放双手！用 Python 控制你的鼠标和键盘
Google确认Chrome存在严重漏洞，向20亿用户发出警告：你们需立即更新浏览器

100G数据分析、机器学习资料免费领取
1、扫描下方二维码，添加 Python数据科学作者微信
2、可申请入群，并获得数据分析、机器学习资料
3、一定要备注：入群 + 地点 + 学校/公司。例如：入群+北京+清华。

长按扫码，申请入群
（添加人数较多，请耐心等待）

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

拼多多面经分享：24个「数据分析师」岗位面试题和答案解析

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

生成图片，分享到微信朋友圈

拼多多面经分享：24个「数据分析师」岗位面试题和答案解析

您可能也对以下帖子感兴趣