tinycrawler · PyPI · Python 包索引

小巧可定制的多进程多代理爬虫。

这些详情尚未由PyPI验证

项目链接

主页

项目描述

一个高度可定制的爬虫，使用多进程和代理根据给定的过滤器、搜索和保存功能下载一个或多个网站。

请记住DDoS是非法的。请不要将此软件用于非法目的。

安装TinyCrawler

pip install tinycrawler

预览（测试用例）

这是运行test_base.py时的控制台预览。

preview

使用示例

from tinycrawler import TinyCrawler, Log, Statistics
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
from requests import Response
from urllib.parse import urlparse
import os
import json


def html_sanitization(html: str) -> str:
    """Return sanitized html."""
    return html.replace("WRONG CONTENT", "RIGHT CONTENT")


def get_product_name(response: Response) -> str:
    """Return product name from given Response object."""
    return response.url.split("/")[-1].split(".html")[0]


def get_product_category(soup: BeautifulSoup) -> str:
    """Return product category from given BeautifulSoup object."""
    return soup.find_all("span")[-2].get_text()


def parse_tables(html: str, path: str, strainer: SoupStrainer):
    """Parse table at given strained html object saving them as csv at given path."""
    for table in BeautifulSoup(
            html, "lxml", parse_only=strainer).find_all("table"):
        df = pd.read_html(html_sanitization(str(table)))[0].drop(0)
        table_name = df.columns[0]
        df.set_index(table_name, inplace=True)
        df.to_csv("{path}/{table_name}.csv".format(
            path=path, table_name=table_name))


def parse_metadata(html: str, path: str, strainer: SoupStrainer):
    """Parse metadata from given strained html and saves them as json at given path."""
    with open("{path}/metadata.json".format(path=path), "w") as f:
        json.dump({
            "category":
            get_product_category(
                BeautifulSoup(html, "lxml", parse_only=strainer))
        }, f)


def parse(response: Response):
    path = "{root}/{product}".format(
        root=urlparse(response.url).netloc, product=get_product_name(response))
    if not os.path.exists(path):
        os.makedirs(path)
    parse_tables(
        response.text, path,
        SoupStrainer(
            "table",
            attrs={"class": "table table-hover table-condensed table-fixed"}))

    parse_metadata(
        response.text, path,
        SoupStrainer("span"))


def url_validator(url: str, logger: Log, statistics: Statistics)->bool:
    """Return a boolean representing if the crawler should parse given url."""
    return url.startswith("https://www.example.com/it/alimenti"")


def file_parser(response: Response, logger: Log, statistics):
    if response.url.endswith(".html"):
        parse(response)


seed = "https://www.example.com/it/alimenti"
crawler = TinyCrawler(follow_robots_txt=False)
crawler.set_file_parser(file_parser)
crawler.set_url_validator(url_validator)

crawler.load_proxies("http://mytestserver.domain", "proxies.json")

crawler.run(seed)

代理应采用以下格式

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

许可证

本软件采用MIT许可证发布。

项目详情

这些详情尚未由PyPI验证

项目链接

主页

版本历史发布通知 | RSS源

此版本

1.7.5

2018年11月22日

1.7.1

2018年11月16日

1.7.0

2018年11月16日

1.6.1

2018年11月5日

1.6.0

2018年11月4日

1.5.0

2018年10月4日

1.2.0

2018年6月17日

1.0.1

2018年6月16日

1.0.0

2018年6月16日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分发

tinycrawler-1.7.5.tar.gz (16.2 kB 查看哈希值)

上传时间：2018年11月22日 源

哈希值 for tinycrawler-1.7.5.tar.gz

tinycrawler-1.7.5.tar.gz 的哈希值
算法	哈希摘要
SHA256	`18059f7ada5aea225777f72dd3b119d44cc24ebc9174f0a09e4fbb41f141a517`
MD5	`f2221d0c2ca4962d62504ebceadc6312`
BLAKE2b-256	`c2f74d10a49ea78f4c4c7ceab208c3d67fd60ad8436c02d636bf8ef3baafc40f`

tinycrawler 1.7.5

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类

项目描述

安装TinyCrawler

预览（测试用例）

使用示例

许可证

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类

版本历史发布通知 | RSS源

下载文件

源代码分发

tinycrawler 1.7.5

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类

项目描述

安装TinyCrawler

预览（测试用例）

使用示例

许可证

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类

版本历史 发布通知 | RSS源

下载文件

源代码分发

版本历史发布通知 | RSS源