爬虫实战：抓取知乎问题“大学生如何赚到一万元”

查看原文

其他

爬虫实战：抓取知乎问题“大学生如何赚到一万元”

From 大邓大邓和他的Python 2019-04-26

最近对赚钱的话题很感兴趣，在知乎上关注了很多“赚钱”相关的问题，高质量的有不少，但是夹杂着私货的也不少。不过知乎的数据比较全，我们完全可以用来做文本分析。

爬虫的原理我就不细讲了，如果大家对爬虫的原理和相关库不甚了解，又想快速入门，不妨看看我们这门课。

待爬网址

问题：如何在大学赚到一万元？

大学里面学费加一年开销最少就是10000元，所以如何赚到10000 链接 https://www.zhihu.com/question/34011097

分析请求

因为我们知道知乎的响应数据都是json型网站，所以我们想找到json对应的链接规律。F12键打开开发者工具，选中XHR,不停地往下滑动页面，开发者工具Network会不停的闪过很多链接。

经过排查我们发现这个链接很特殊，点击进去详情如下

对应的数据是json格式

里面果然是用户的回答数据

现在我们将找到的网址复制下来分析分析

https://www.zhihu.com/api/v4/questions/34011097/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset=10&platform=desktop&sort_by=default

我们发现网址中有 34011097 和 offset 两个参数是可以调整的，分别代表话题id和 回答的id。我们将上面的网址整理一下，设计成网址模板base_url

base_url = 'https://www.zhihu.com/api/v4/questions/{question_id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset={offset}&platform=desktop&sort_by=default'

伪装请求

我们还要注意的一点是写爬虫一般是需要伪装请求头headers，而在知乎这种网站，我们可能还需要cookies。我新建了一个 settings.py文件，用来存放cookies、headers、网址模板baseurl和questionid

COOKIES = {'cookie': '你的cookies'}
HEADERS = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}
BASE_URL = 'https://www.zhihu.com/api/v4/questions/{question_id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit=5&offset={offset}&platform=desktop&sort_by=default'
QUESTION_ID = '34011097'

数据存储

接下来我们新建 zhihu.py用于设计爬虫逻辑，因为知乎的数据都是json格式，相对于html而言json的数据更有层次性更加的干净。为了保证后续数据分析的可扩展性，我们尽量保存原始。所以这里用到了jsonlines库用于存储json数据(以行的方式存储每个json)，如果不熟悉可以把 jsonlines库：高效率的保存多个python对象这篇文章收藏起来。