其他
【爬虫实战】亚马逊网站Top100畅销书爬取
本文作者:赵冰洁,中南财经政法大学金融学院
本文编辑:寇晓璇
技术总编:张馨月
爬虫俱乐部云端课程
在推文《网络爬虫入门之requests 库的基本使用——以亚马逊图书界面为例》中,我们对单个页面的信息抓取进行了详细介绍,那么如果想要循环抓取这些畅销书籍的信息应该怎么做呢?在今天的推文中,小编将带领大家进行逐步讲解~
首先,导入第三方库并输入 headers 信息:
1import requests
2from lxml import etree
3import pandas as pd
4import re
5
6url = "https://www.amazon.cn/gp/bestsellers/digital-text/116169071/ref=zg_bs_pg_2?ie=UTF8&pg=1"
7headers = {
8 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
9 "Accept-Encoding": "gzip, deflate, br",
10 "Accept-Language": "zh-CN,zh;q=0.9",
11 "Connection": "keep-alive",
12 "downlink": "10",
13 "ect": "4g",
14 "Host": "www.amazon.cn",
15 "rtt": "100",
16 "Sec-Fetch-Dest": "document",
17 "Sec-Fetch-Mode": "navigate",
18 "Sec-Fetch-Site": "none",
19 "Sec-Fetch-User": "?1",
20 "Upgrade-Insecure-Requests": "1",
21 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
22}
1html = requests.get(url, headers = headers)
2tree = etree.HTML(html.text) #将源代码转换为Element对象
3print(html.text) #查看所爬取的网页信息
1title_xpath = "//li/span/div/span/a/div/text()" #获取书名的xpath
2author_xpath = "//li/span/div/span/div[1]/span/text()" #获取作者的xpath
3price_xpath = "//li/span/div/span/div[position() > 2]/a/span/span/text()" #获取价格的xpath
4book_url_xpath = "//li/span/div/span/a/@href" #获取书籍链接的xpath
5
6title_list = tree.xpath(title_xpath)
7title_list = [re.sub("\s","",title) for title in title_list]
8
9author_list = tree.xpath(author_xpath)
10
11price_list = tree.xpath(price_xpath)
12
13book_url_list = tree.xpath(book_url_xpath)
14book_url_list = ["https://www.amazon.cn/" + book_url for book_url in book_url_list]
15
16for title,author,price,book_url in zip(title_list,author_list,price_list,book_url_list):
17 print(title + "," + author + "," + price + "," + book_url)
1# 导入第三方库并输入 headers 信息
2import requests
3from lxml import etree
4import pandas as pd
5import re
6
7headers = {
8 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
9 "Accept-Encoding": "gzip, deflate, br",
10 "Accept-Language": "zh-CN,zh;q=0.9",
11 "Connection": "keep-alive",
12 "downlink": "10",
13 "ect": "4g",
14 "Host": "www.amazon.cn",
15 "rtt": "100",
16 "Sec-Fetch-Dest": "document",
17 "Sec-Fetch-Mode": "navigate",
18 "Sec-Fetch-Site": "none",
19 "Sec-Fetch-User": "?1",
20 "Upgrade-Insecure-Requests": "1",
21 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36"
22}
23
24# 通过 GET 请求方式获取网页信息
25html = requests.get(url, headers = headers)
26tree = etree.HTML(html.text) #将源代码转换为Element对象
27
28# 设置所获取信息对应的 xpath
29title_xpath = "//li/span/div/span/a/div/text()"
30author_xpath = "//li/span/div/span/div[1]/span/text()"
31price_xpath = "//li/span/div/span/div[position() > 2]/a/span/span/text()"
32book_url_xpath = "//li/span/div/span/a/@href"
33
34# 设置所获取信息对应的列表
35all_title = []
36all_author = []
37all_price = []
38all_book_url = []
39
40# 对网页进行循环以爬取所有页面的信息
41for p in range(1,3):
42 url = "https://www.amazon.cn/gp/bestsellers/digital-text/116169071/ref=zg_bs_pg_2?ie=UTF8&pg=%g"%p
43
44 title_list = tree.xpath(title_xpath)
45 title_list = [re.sub("\s","",title) for title in title_list]
46 all_title.extend(title_list) # 将每次获取到的信息拼接到列表中
47
48 author_list = tree.xpath(author_xpath)
49 all_author.extend(author_list) # 将每次获取到的信息拼接到列表中
50
51 price_list = tree.xpath(price_xpath)
52 all_price.extend(price_list) # 将每次获取到的信息拼接到列表中
53
54 book_url_list = tree.xpath(book_url_xpath)
55 book_url_list = ["https://www.amazon.cn/" + book_url for book_url in book_url_list]
56 all_book_url.extend(book_url_list) # 将每次获取到的信息拼接到列表中
1file = r"D:\Top100畅销书\Top100畅销书.xlsx"
2df = pd.DataFrame(data = [all_title,all_author,all_price,all_book_url]).T
3df.columns = ["书名","作者","价格","Url"]
4df.to_excel(file,index = None)
5print("程序执行完毕")
对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”。用心做事,不负您的支持!
Json文件好帮手——JsonPath
数据转置pro之sxpose2文件"搬家"小助手:mvfilespyecharts绘图——河流图展示
你知道MDPI期刊的热门题目吗?
文件合并你不行,mergemany来帮宁分组进行描述性统计的小技巧 --astx命令介绍
新年快乐|爬虫俱乐部2020年度总结Python中实现Excel的重复值提取fileexists:告诉你“我”存在吗?工作中一切困难的解决途径——motivatedolly
【爬虫实战】喜茶的门店都开在了哪里?
import delimited | 再也不用担心读入网页源代码“乱七芭蕉”了如何简洁地列出指定属性的变量?ds命令来了!
如何在Python中进行描述性统计分析?
分析师和他们的雇主重视与管理层接触吗?——分析师参与盈余电话会议的研究
繁忙的董事与公司业绩:来自并购的证据使用Python接口读取CSMAR数据列表生成式|让你的代码简洁又美观微信公众号“Stata and Python数据分析”分享实用的stata、python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。
此外,欢迎大家踊跃投稿,介绍一些关于stata和python的数据处理和分析技巧。投稿邮箱:statatraining@163.com投稿要求:1)必须原创,禁止抄袭;
2)必须准确,详细,有例子,有截图;
注意事项:
1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。
2)邮件请注明投稿,邮件名称为“投稿+推文名称”。
3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。