跳转到主要内容

小巧可定制的多进程多代理爬虫。

项目描述

travis sonar_quality sonar_maintainability sonar_coverage Maintainability pip

一个高度可定制的爬虫,使用多进程和代理根据给定的过滤器、搜索和保存功能下载一个或多个网站。

请记住DDoS是非法的。请不要将此软件用于非法目的。

安装TinyCrawler

pip install tinycrawler

预览(测试用例)

这是运行test_base.py时的控制台预览。

preview

使用示例

from tinycrawler import TinyCrawler, Log, Statistics
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
from requests import Response
from urllib.parse import urlparse
import os
import json


def html_sanitization(html: str) -> str:
    """Return sanitized html."""
    return html.replace("WRONG CONTENT", "RIGHT CONTENT")


def get_product_name(response: Response) -> str:
    """Return product name from given Response object."""
    return response.url.split("/")[-1].split(".html")[0]


def get_product_category(soup: BeautifulSoup) -> str:
    """Return product category from given BeautifulSoup object."""
    return soup.find_all("span")[-2].get_text()


def parse_tables(html: str, path: str, strainer: SoupStrainer):
    """Parse table at given strained html object saving them as csv at given path."""
    for table in BeautifulSoup(
            html, "lxml", parse_only=strainer).find_all("table"):
        df = pd.read_html(html_sanitization(str(table)))[0].drop(0)
        table_name = df.columns[0]
        df.set_index(table_name, inplace=True)
        df.to_csv("{path}/{table_name}.csv".format(
            path=path, table_name=table_name))


def parse_metadata(html: str, path: str, strainer: SoupStrainer):
    """Parse metadata from given strained html and saves them as json at given path."""
    with open("{path}/metadata.json".format(path=path), "w") as f:
        json.dump({
            "category":
            get_product_category(
                BeautifulSoup(html, "lxml", parse_only=strainer))
        }, f)


def parse(response: Response):
    path = "{root}/{product}".format(
        root=urlparse(response.url).netloc, product=get_product_name(response))
    if not os.path.exists(path):
        os.makedirs(path)
    parse_tables(
        response.text, path,
        SoupStrainer(
            "table",
            attrs={"class": "table table-hover table-condensed table-fixed"}))

    parse_metadata(
        response.text, path,
        SoupStrainer("span"))


def url_validator(url: str, logger: Log, statistics: Statistics)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("https://www.example.com/it/alimenti"")


def file_parser(response: Response, logger: Log, statistics):
    if response.url.endswith(".html"):
        parse(response)


seed = "https://www.example.com/it/alimenti"
crawler = TinyCrawler(follow_robots_txt=False)
crawler.set_file_parser(file_parser)
crawler.set_url_validator(url_validator)

crawler.load_proxies("http://mytestserver.domain", "proxies.json")

crawler.run(seed)

代理应采用以下格式

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

许可证

本软件采用MIT许可证发布。

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源代码分发

tinycrawler-1.7.5.tar.gz (16.2 kB 查看哈希值)

上传时间:

由以下支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误日志 StatusPage StatusPage 状态页