爬虫平台的架构实现和框架的选型
导读:本文详细讲述了使用Python开发爬虫的原理与框架,希望对大家有价值。
首先来看一下一个爬虫平台的设计,作为一个爬虫平台,需要支撑多种不同的爬虫方式,所以一般爬虫平台需要包括:
爬虫规则的维护,平台在接收到爬虫请求时,需要能按照匹配一定的规则去进行自动爬虫 爬虫的job调度器,平台需要能负责爬虫任务的调度,比如定时调度,轮询调度等。 爬虫可以包括异步的海量爬虫,也可以包括实时爬虫,异步爬虫指的是爬虫的数据不会实时返回,可能一个爬虫任务会执行很久。 实时爬虫指爬的数据要实时返回,这个就要求时间很短,一般适合少量数据的爬虫。 爬虫好的数据可以生成指定的文件,比如csv文件,json文件等,然后通过数据处理引擎做统一处理,比如csv文件可以通过数据交换落入大数据平台,或者爬虫好的数据也可以丢入kafka中,然后再通过流式处理任务(spark或者storm,flink)做爬虫数据的清洗和处理,处理完的数据,可以入到数据库中。
pip install scrapy
scrapy startproject zj_scrapy
cd zj_scrapy
scrapy genspider sjqq “sj.qq.com”
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZjScrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
pass
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import HtmlResponse
from zj_scrapy.items import ZjScrapyItem
class SjqqSpider(scrapy.Spider):
name = 'sjqq'
allowed_domains = ['sj.qq.com']
start_urls = ['https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114']
def parse(self, response:HtmlResponse):
name_list = response.xpath('/html/body/div[3]/div[2]/ul/li')
print("=============",response.headers)
for each in name_list:
item = ZjScrapyItem()
name = each.xpath('./div/div/a[1]/text()').extract()
item['name'] = name[0]
yield item
pass
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ZjScrapyPipeline(object):
def process_item(self, item, spider):
print("+++++++++++++++++++",item['name'])
print("-------------------",spider.cc)
return item
ITEM_PIPELINES = {
'zj_scrapy.pipelines.ZjScrapyPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
class ZjScrapyPipeline(object):
def process_item(self, item, spider):
print("+++++++++++++++++++",item['name'])
print("-------------------",spider.cc)
return item
pip install scrapyd
pip install scrapyd-client
scrapyd-deploy <target> -p <project> --version <version>
curl http://localhost:6800/schedule.json -d project= zj_scrapy -d spider=sjqq
{
"node_name": "ZJPH-0321",
"status": "ok",
"jobid": "dd7f10aca76e11e99b656c4b90156b7e"
}
setting (string, optional) –自定义爬虫settings jobid (string, optional) –jobid,之前启动过的spider,会有一个id,这个是可选参数 _version (string, optional) –版本号,之前部署的时候的version,只能使用int数据类型,没指定,默认启动最新版本
project (string, required) –项目名
version (string, required) –项目版本,不填写则是当前时间戳
egg (file, required) –当前项目的egg文件
curl http://localhost:6800/addversion.json -F project=myproject -F version=r23 -F egg=@myproject.egg
project (string, required) –项目名
job (string, required) -jobid
curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444
curl http://localhost:6800/listprojects.json
project (string, required) –项目名
curl http://localhost:6800/listversions.json?project=myproject
project (string, required) –项目名
_version (string, optional) –版本号
$ curl http://localhost:6800/listspiders.json?project=myproject
project (string, option) - restrict results to project name
curl http://localhost:6800/listjobs.json?project=myproject | python -m json.tool
project (string, required) - the project name
version (string, required) - the project version
curl http://localhost:6800/delversion.json -d project=myproject -d version=r99
project (string, required) - the project name
curl http://localhost:6800/delproject.json -d project=myproject
# -*- coding: utf-8 -*-import requestsfrom bs4 import BeautifulSoupimport timeclass SyncCrawlSjqq(object): def parser(self,url): req = requests.get(url) soup = BeautifulSoup(req.text,"lxml") name_list = soup.find(class_='app-list clearfix')('li') names=[]for name in name_list: app_name = name.find('a',class_="name ofh").text names.append(app_name)return namesif __name__ == '__main__': syncCrawlSjqq = SyncCrawlSjqq() t1 = time.time() url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"print(syncCrawlSjqq.parser(url)) t2 = time.time()print('一般方法,总共耗时:%s' % (t2 - t1))
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json
app = Flask(__name__)
class SyncCrawlSjqq(object):
def parser(self,url):
req = requests.get(url)
soup = BeautifulSoup(req.text,"lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names=[]
for name in name_list:
app_name = name.find('a',class_="name ofh").text
names.append(app_name)
return names
@app.route('/getSyncCrawlSjqqResult',methods = ['GET'])
def getSyncCrawlSjqqResult():
syncCrawlSjqq=SyncCrawlSjqq()
return Response(json.dumps(syncCrawlSjqq.parser(request.args.get("url"))),mimetype="application/json")
if __name__ == '__main__':
app.run(port=3001,host='0.0.0.0',threaded=True)
#app.run(port=3001,host='0.0.0.0',processes=3)
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqqMultiProcessing(object):
def parser(self,url):
req = requests.get(url)
soup = BeautifulSoup(req.text,"lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names=[]
for name in name_list:
app_name = name.find('a',class_="name ofh").text
names.append(app_name)
return names
if __name__ == '__main__':
url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
executor = ThreadPoolExecutor(max_workers=20)
syncCrawlSjqqMultiProcessing = SyncCrawlSjqqMultiProcessing()
t1 = time.time()
future_tasks=[executor.submit(print(syncCrawlSjqqMultiProcessing.parser(url)))]
wait(future_tasks, return_when=ALL_COMPLETED)
t2 = time.time()
print('一般方法,总共耗时:%s' % (t2 - t1))
比如单线程运行,多线程在爬虫时明显会要快很多。
作者:张永清
来源:https://www.cnblogs.com/laoqing