北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片

admin 2019-04-06 阅读:289

AntNest

简明飞快的异步爬虫框张艾佳架(python3.6+),只要600行左右的代码

功用

  • 开箱即用的HTT浊日风暴P客户端
  • 供给Item extractor, 能够明确地声明怎么从response解析数据(支撑xpath诛仙往生咒, jpath or regex)
  • 经过 "ensure_future" and "as_completed" api 供给方便的工作流

装置

pip install ant_nest

运用方法:

创立一个Demo项目:

>>> ant_nest -c examples

主动会创立以下文件:

drwxr-xr-x 5 bruce staff 160 Jun 30 18:24 ants
-rw-r--r-- 1 bruce staff 208 Jun 26 22:5北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片9 settings.py

假定咱们想获取GitHub抢手库房,让咱们创立一个"examples/ants/example2.py":

from ant_nest import *
from yarl import URL
class GithubAnt(Ant):
"""Crawl trending repositories from github"""
item_pipelines = [
ItemFieldReplacePipeline(
('meta_content', 'star', 'fork'),
excess_chars=('\r', '\n', '\t', ' '))
]
concurrent_limit = 1 # save the website`s and your bandwidth!
def __init__(self):
super().__init__()
self.item_extractor = ItemExtractor(dict)
self.item_extractor.add_pattern(
'xpath', 'title', '//h1/strong/a/text()')
self.item_extractor.add_pattern(
'xpath含糊朋友', 'author', '//h1/span/a/text()', default='Not found')
self.item_北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片extractor.add_pattern(
'xpath', 'meta_content',密桃社
'//div[@class="repository-meta-content col-11 mb-1"]//text()',
extract_type=ItemExtractor.EXTRACT_WITH_JOIN_ALL)
self.item_ext符艳朵ractor.add_pattern(
'xpath',
'star', '//a[@class="social-count js-social-count"]/text()')
self.item_extractor.add_pattern(
'xpath', 'fork', '//a[@class="social-count"]/text()')
async def crawl_repo(self, url):
"""Crawl information from one repo"""
response = await self.request(url)
# extract item from response
item = self.item_extractor.extract王一淳摘银(response)
item['origin_北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片url'] = response.url
await self.collect(item) # let item go through pipelines(be cleaned)
self.logger.info('*' * 70 + 'I got one hot repo!\n' +战地4上海之围宣传片 str(item))
async def run(self)侯镛:
"""App entrance, our play ground"""
response = await self.request('https://github.com/explore')
for url in response.捍卫萝卜应战39html_element安纳塔拉休假酒店本相.xpath(
'/html/body/div[4]/div[2]/div/div[2]/div[1]/article//h1/a[2]/'
'@href'):
# crawl many repos with our coroutines pool
self.schedule_coroutine(
self.crawl_repo(response.url.join(URL(url))))
self.logger.info('Waiting...'北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片)

然后咱们能够列出一切可运转的爬虫(在"examples"文件夹下)

>>> $ant_nest -l
ants.example2.GithubAnt

运转! (without debug log):

>>> ant_nest -a ants.example2.GithubAnt
INFO:GithubAnt:Opening
INFO:GithubAnt:Waiting...
INFO:GithubAnt:**********************************************************************I got on南涧气候e hot repo!
{'title': 'NLP-progress', 'author': 'se小振平bastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), inc北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片luding the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebast北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片ianruder/NLP-progress')}
INFO:GithubAnyourlustmoviest:**********************************************************************I got one hot repo!
{'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi…', 'star': '6,032', '孤岛国际fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
INFO:GithubAnt:********************************************北京印刷学院,简明飞快的异步爬虫结构 AntNest,剪纸图片**************************I got one hot repo!
{'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': '才智树宝物二加一pure-bash-bible', 'author': 'dylanaraps', 'meta_content': ' A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'flutter', 'author': 'flu恩施剿匪记tter', 'meta_content': 'Flutter makes it easy and fast t都阳鳗鱼o build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1人兽文,337', 'origin_url': URL('https://github.com/flutter/flutter')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '\\u200d Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J…', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
INFO:GithubAnt:Closed
INFO:GithubAnt:Get 7 Request in total
INFO:GithubAnt:Get 7 Response in total
INFO:GithubAnt:Get 6 dict in total
INFO:GithubAnt:Run GithubAnt in 18.157656 seconds

咱们能够经过类特点来装备咱们的爬虫

class Ant(abc.ABC):
response_pipelines: List[Pipeline] = []
request_pipelines: List[Pipeline] = []
item_pipelines: List[Pipeline] = []
request_cls = Request
response_cls = Response
request_timeout = DEFAULT_TIMEOUT.total
request_retries = 3
request_retry_delay = 5
request_proxies: List[Union[str, URL]] = []
request_max_redirects = 10
request_allow_redirects = True
response_in_stream = False
connection_limit = 100 # see "TCPConnector" in "aiohttp"
connection_limit_per_host = 0
concurrent_limit = 100