其他
爬虫实战——聚募网股权众筹信息爬取
本文作者:陈志玲
文字编辑:余术玲
技术总编:张 邯
使用谷歌浏览器,进入目标网站,选择已完成项目(https://www.dreammove.cn/list/index.html?industry=0&type=8&city=0)。
import requests
import time
import json
二、解析JSON文件获取第一层所有页面的id
headers = {"Accept":"application/json, text/javascript, */*; q=0.01",
"Referer":"https://www.dreammove.cn/list/index.html?industry=0&type=8&city=0",
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36",
"X-Requested-With":"XMLHttpRequest"}
上图中headers的Request URL:
当我们滚动左边的页面时,可以发现上图右边的headers的Request URL发生了变化,变为:
当我们滚动到最后一页时可发现变化是有规律的。url中变动的第一个数字是对应的页面数,第二串数字是从1970年1月1日0时0分0秒开始计算的,截止到我们看到此网页的时间(此处是以毫秒为单位的)。Python默认情况下的时间戳是以秒为单位输出的float,因此我们需要通过把秒转换成毫秒的方法获得13位的时间戳。
timestamp = int(round(time.time()*1000))
url=f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset=1&keyword=&_={timestamp}"
raw_html = requests.get(url,headers= headers)
html_text = raw_html.text
json.loads(html_text)["data"]["list"]
ProjectInfo = []
for i in range(1,23):
url=f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset={i}&keyword=&_={timestamp}"
raw_html = requests.get(url,headers= headers)
html_text = raw_html.text
project=[]
project=json.loads(html_text)["data"]["list"]
ProjectInfo.extend(project)
三、爬取第二层项目的团队信息
headers ={"Accept": "application/json, text/javascript, */*;q=0.01",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"zh-CN,zh;q=0.9",
"Connection":"keep-alive",
"Cookie":"PHPSESSID=m2el8qb3r83d3u6f6hvashqd85;Hm_lvt_c18b08cac9b94bf4628c0277d3a4d7de=1566483953,1566521937,1566607126,1566630977;Hm_lpvt_c18b08cac9b94bf4628c0277d3a4d7de=1566631899",
"Host": "www.dreammove.cn",
"Referer":"https://www.dreammove.cn/project/detail/id/GQ15639333500017933.html",
"User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36(KHTML, like Gecko)
Chrome/76.0.3809.100 MobileSafari/537.36"
,
"X-Requested-With":"XMLHttpRequest"}
timestamp =int(round(time.time()*1000))
teamInfo=[]
for b in range(0,198):
url=f"https://www.dreammove.cn/project/project_team/id/{ProjectId[b]}?_={timestamp}"
raw_html = requests.get(url,headers =headers)
html_text=raw_html.text
teamInfoi=json.loads(html_text)["data"]["team_list"]
if teamInfoi.__class__ ==list:
teamInfo=teamInfo+teamInfoi;continue
print(teamInfo)
for var in teamInfo:
print(var)
import requests
import time
import json
headers = {"Accept": "application/json, text/javascript, */*;q=0.01",
"Referer":"https://www.dreammove.cn/list/index.html?industry=0&type=8&city=0",
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36",
"X-Requested-With": "XMLHttpRequest"}
timestamp = int(round(time.time()*1000))
ProjectInfo = []
for i in range(1,23):
url=f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset={i}&keyword=&_={timestamp}"
raw_html = requests.get(url,headers= headers)
html_text = raw_html.text
project=json.loads(html_text)["data"]["list"]
ProjectInfo.extend(project)
#print(ProjectInfo)
Project_FileName = "C:\\CrowdFunding\\dreammove\\ProjectInfo.csv"
VarName =['id','update_time','province_name','subsite_id','is_open','industry','type','open_flag','project_name','step','seo_string','abstract','cover','project_phase','member_count','province','city','address','company_name','project_url','uid','over_time','vote_leader_step','stage','is_agree','is_del','agreement_id','barcode','sort','display_subsite_id','need_fund','real_fund','project_valuation','final_valuation','min_lead_fund','min_follow_fund','total_fund','agree_total_fund','leader_flag','leader_id','read_cnt','follow_cnt','inverstor_cnt','comment_cnt','nickname','short_name','site_url','site_logo','storelevel','industry_name']
with open(Project_FileName,"w",encoding = "gb18030") as f: #
f.write("\t".join(VarName)+"\n")
for EachInfo in ProjectInfo:
tempInfo = []
for key in VarName:
if key in EachInfo:
tempInfo.append(str(EachInfo[key]).replace("\n","").replace("\t","").replace("\r",""))
else:
tempInfo.append("")
f.write("\t".join(tempInfo)+"\n")
with open(Project_FileName,"r",encoding = "gb18030") as f:
final_Info = f.readlines()
ProjectId = []
for i in range(1,len(final_Info)):
ProjectId.append(final_Info[i].split("\t")[0])
#print(final_Info)
#print(ProjectId)
headers = {"Accept":"application/json, text/javascript, */*; q=0.01",
"Accept-Encoding":"gzip, deflate, br", "Accept-Language":"zh-CN,zh;q=0.9",
"Connection":"keep-alive",
"Cookie":"PHPSESSID=m2el8qb3r83d3u6f6hvashqd85;Hm_lvt_c18b08cac9b94bf4628c0277d3a4d7de=1566483953,1566521937,1566607126,1566630977;Hm_lpvt_c18b08cac9b94bf4628c0277d3a4d7de=1566631899",
"Host": "www.dreammove.cn",
"Referer":"https://www.dreammove.cn/project/detail/id/GQ15639333500017933.html",
"User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36(KHTML, like Gecko) Chrome/76.0.3809.100 Mobile Safari/537.36",
"X-Requested-With":"XMLHttpRequest"}
timestamp = int(round(time.time()*1000))
teamInfo=[]
for b in range(0,198):
url=f"https://www.dreammove.cn/project/project_team/id/{ProjectId[b]}?_={timestamp}"
raw_html = requests.get(url,headers = headers)
html_text=raw_html.text
teamInfoi=json.loads(html_text)["data"]["team_list"]
if teamInfoi.__class__ == list:
teamInfo.extend(teamInfoi);continue
for var in teamInfo:
print(var)
关于我们
微信公众号“Stata and Python数据分析”分享实用的Stata、Python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。
1)必须原创,禁止抄袭;
2)必须准确,详细,有例子,有截图;
注意事项:
1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。
2)邮件请注明投稿,邮件名称为“投稿+推文名称”。
3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。