不会 Python 没关系,手把手教你用 web scraper 抓取豆瓣电影 top 250 和 b 站排行榜
苏生不惑第190 篇原创文章,将本公众号设为
星标
,第一时间看最新文章。
关于Python之前分享过很多文章了:
Python 抓取知乎电影话题下万千网友推荐的电影,这个国庆节不愁没电影看了
一键下载公众号所有文章,导出文件支持PDF,HTML,Markdown,Excel,chm等格式
一键备份微博并导出生成PDF,顺便用Python分析微博账号数据
如果要抓取数据,一般使用Python是很方便的,不过如果你还不会推荐使用Chrome扩展 web scraper,下面就分别用Python和 web scraper 抓取豆瓣电影top 250 和b站排行榜的数据。
Python 抓取豆瓣电影
打开豆瓣电影top 250 主页 https://movie.douban.com/top250
我们需要抓取电影标题,排行,评分,和简介,python 抓取数据的步骤一般为请求网页,解析网页,提取数据和保存数据,下面是一段简单的Python代码。
import requests
import re
import bs4
from selenium import webdriver
from matplotlib import pyplot as plt
from wordcloud import WordCloud
import jieba
import pandas as pd
def request_url(url):
headers = {
'Referer': 'https://movie.douban.com',
'Host': 'movie.douban.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
response = requests.get(url, headers=headers)
return response.text
def movie_info(html):
data = []
soup = bs4.BeautifulSoup(html, 'html.parser')
items = soup.select('li > div.item')
for item in items:
desc_item = item.select('div.info > div.bd > p.quote > span')
desc = ''
if desc_item is not None and len(desc_item) > 0:
desc = desc_item[0].text
data.append({
'url':item.select('div.info > div.hd > a')[0]['href'],
'title':item.select('div.info > div.hd > a > span')[0].text,
'rank':item.select('div.pic > em')[0].text,
'score':item.select('div.info > div.bd > div.star > span.rating_num')[0].text,
'desc':desc,
})#{'url': 'https://movie.douban.com/subject/1292052/', 'title': '肖申克的救赎', 'rank': '1', 'score':
'9.7', 'desc': '希望让人自由。'}
return data
if __name__ == '__main__':
urls = ['https://movie.douban.com/top250?start={0}&filter='.format(i * 25) for i in range(10)]
movie_list = [request_url(url) for url in urls]
movie_url_list = [movie_info(movie) for movie in movie_list]
data = []
for j in movie_url_list:
for k in j:
data.append(k)
print(data)
df = pd.DataFrame(data)
df.to_csv("douban_movies.csv",encoding="utf_8_sig",index=False)
执行 Python 脚本后会生成一个CSV文件,不过有些电影没有简介 ,比如周星驰的《九品芝麻官》https://movie.douban.com/subject/1297518/
web scraper 抓取豆瓣电影
这是一款免费的Chrome扩展,只要建立sitemap即可抓取相应的数据,无需写代码即可抓取95%以上的网站数据(比如博客列表,知乎回答,微博评论等), Chrome扩展地址 https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn ,如果你上不了谷歌在公众号后台回复 Python
获取我下载好的crx文件,先改文件名后缀为.rar
,解压到一个目录中,然后加载已解压的扩展程序即可安装成功。
使用web scraper抓取数据步骤为 创建 sitemap,新建 selector (抓取规则),启动抓取程序,导出 csv文件 。
打开谷歌浏览器控制台,可以看到多了个web scraper 标签,下面有sitemaps,sitemap,create new sitemap ,点击create新建一个爬虫抓取任务。
接着点击 element preview 预览下可以看到电影元素都抓取到了,因为一页有多部电影还要选中 Multiple 。
点击selector graph 可以看到抓取的选择器关系图。
最后可以export sitemap 导出这个爬虫任务,是个json格式字符串,你可以直接复制我这个导入直接抓取豆瓣电影数据。
{"_id":"douban","startUrl":["https://movie.douban.com/top250?start=[0-250:25]&filter="],"selectors":[{"id":"row","type":"SelectorElement","parentSelectors":["_root"],"selector":".grid_view li","multiple":true,"delay":0},{"id":"电影名","type":"SelectorText","parentSelectors":["row"],"selector":"span.title","multiple":false,"regex":"","delay":0},{"id":"豆瓣链接","type":"SelectorLink","parentSelectors":["row"],"selector":".hd a","multiple":false,"delay":0},{"id":"电影排名","type":"SelectorText","parentSelectors":["row"],"selector":"em","multiple":false,"regex":"","delay":0},{"id":"电影简介","type":"SelectorText","parentSelectors":["row"],"selector":"span.inq","multiple":false,"regex":"","delay":0},{"id":"豆瓣评分","type":"SelectorText","parentSelectors":["row"],"selector":"span.rating_num","multiple":false,"regex":"","delay":0}]}
使用 web scraper 抓取数据就是这么简单,不用写代码也能轻松完成抓取任务,不过第一次操作还是有点难,尤其对不熟悉网页结构的小伙伴,之后有空我录制一个视频方便大家自己实践下(有问题文末评论或者加我微信交流),下面再用 web scraper 抓取b站排行榜 https://www.bilibili.com/v/popular/rank/all
先预览下抓取的效果。
为了方便你抓取,我也提供了json字符串,你可以直接导入抓取。
{"_id":"bilibili_rank","startUrl":["https://www.bilibili.com/v/popular/rank/all"],"selectors":[{"id":"row","type":"SelectorElement","parentSelectors":["_root"],"selector":"li.rank-item","multiple":true,"delay":0},{"id":"视频排名","type":"SelectorText","parentSelectors":["row"],"selector":"div.num","multiple":false,"regex":"","delay":0},{"id":"视频标题","type":"SelectorText","parentSelectors":["row"],"selector":"a.title","multiple":false,"regex":"","delay":0},{"id":"播放量","type":"SelectorText","parentSelectors":["row"],"selector":".detail > span:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"弹幕数","type":"SelectorText","parentSelectors":["row"],"selector":"span:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"up主","type":"SelectorText","parentSelectors":["row"],"selector":"a span","multiple":false,"regex":"","delay":0},{"id":"视频链接","type":"SelectorLink","parentSelectors":["row"],"selector":"a.title","multiple":false,"delay":0},{"id":"点赞数","type":"SelectorText","parentSelectors":["视频链接"],"selector":"span.like","multiple":false,"regex":"","delay":0},{"id":"投币数","type":"SelectorText","parentSelectors":["视频链接"],"selector":"span.coin","multiple":false,"regex":"","delay":0},{"id":"收藏数","type":"SelectorText","parentSelectors":["视频链接"],"selector":"span.collect","multiple":false,"regex":"","delay":0}]}