跳转到主要内容

用于收集和验证静态文件(代码和文档)中URL的工具

项目描述

Build Status Documentation Status codecov Python CodeFactor PyPI Downloads License

urlchecker-python

这是一个Python模块,用于收集静态文件(代码和文档)中的URL,然后测试并报告损坏的链接。如果您想将其用作GitHub动作,请参阅urlchecker-action。您还可以在quay.io/urlstechie/urlchecker上找到基于容器的版本。从版本0.0.26开始,我们使用多进程,因此检查运行速度更快,您可以将URLCHECKER_WORKERS设置为更改工作进程的数量(默认为9)。如果您不想使用多进程,请使用版本0.0.25或更早版本。

模块文档

有关代码的详细文档可在urlchecker-python.readthedocs.io找到

使用方法

安装

您可以从pypi安装urlchecker。在安装之前,建议您从以下链接安装fake-useragent:

pip install git+https://github.com/danger89/fake-useragent.git

然后安装urlchecker

$ pip install urlchecker

或直接从存储库安装

$ git clone https://github.com/urlstechie/urlchecker-python.git
$ cd urlchecker-python
$ python setup.py install

安装过程会将二进制文件 urlchecker 放入您的 Python 路径中。

$ which urlchecker
/home/vanessa/anaconda3/bin/urlchecker

检查本地文件夹

您最可能用例是检查包含静态文件(文档或代码)的本地目录中的文件。在这种情况下,您可以使用 urlchecker 检查。

$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
                        [--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
                        [--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
                        [--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
                        path

positional arguments:
  path                  the local path or GitHub repository to clone and check

options:
  -h, --help            show this help message and exit
  -b BRANCH, --branch BRANCH
                        if cloning, specify a branch to use (defaults to main)
  --subfolder SUBFOLDER
                        relative subfolder path within path (if not specified, we use root)
  --cleanup             remove root folder after checking (defaults to False, no cleaup)
  --serial              run checks in serial (no multiprocess)
  --no-check-certs      Allow urls to validate that fail certificate checks
  --force-pass          force successful pass (return code 0) regardless of result
  --no-print            Skip printing results to the screen (defaults to printing to console).
  --verbose             Print file names for failed urls in addition to the urls.
  --file-types FILE_TYPES
                        comma separated list of file extensions to check (defaults to .md,.py)
  --files FILES         comma separated list of exact files or patterns to check.
  --exclude-urls EXCLUDE_URLS
                        comma separated links to exclude (no spaces)
  --exclude-patterns EXCLUDE_PATTERNS
                        comma separated list of patterns to exclude (no spaces)
  --exclude-files EXCLUDE_FILES
                        comma separated list of files and patterns to exclude (no spaces)
  --save SAVE           Path to a csv file to save results to.
  --retry-count RETRY_COUNT
                        retry count upon failure (defaults to 2, one retry).
  --timeout TIMEOUT     timeout (seconds) to provide to the requests library (defaults to 5)

您有很大的灵活性来定义要跳过的 URL 或文件的模式,以及重试次数或超时(秒数)。最基本的用法将检查整个目录。让我们克隆并检查 urlchecker 操作

$ git clone https://github.com/urlstechie/urlchecker-action.git
$ cd urchecker-action

并运行最简单的命令来检查当前工作目录(.)。

$ urlchecker check .
           original path: .
              final path: /tmp/urlchecker-action
               subfolder: None
                  branch: master
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5

 /tmp/urlchecker-action/README.md 
 --------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/rseng/awesome-rseng
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/USRSE/usrse.github.io
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/urlstechie/urlchecker-action/releases
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/buildtesters/buildtest
https://github.com/r-hub/docs
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/urlstechie/URLs-checker-test-repo
https://github.com/marketplace/actions/urlchecker-action
https://github.com/actions/checkout
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/SuperKogito/Voice-based-gender-recognition/issues
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml

 /tmp/urlchecker-action/examples/README.md 
 -----------------------------------------
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/urlstechie/urlchecker-action/issues
https://help.github.com/en/actions/reference/events-that-trigger-workflows


Done. The following urls did not pass:
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2

未通过检查的 URL 是库的一个示例参数!让我们添加一个简单的模式来排除它。

$ urlchecker check --exclude-pattern SuperKogito .
           original path: .
              final path: /tmp/urlchecker-action
               subfolder: None
                  branch: master
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
           urls excluded: []
   url patterns excluded: ['SuperKogito']
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5

 /tmp/urlchecker-action/README.md 
 --------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/actions/checkout
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
https://github.com/rseng/awesome-rseng
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/urlstechie/URLs-checker-test-repo
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/r-hub/docs
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/buildtesters/buildtest
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/marketplace/actions/urlchecker-action
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml

 /tmp/urlchecker-action/examples/README.md 
 -----------------------------------------
https://help.github.com/en/actions/reference/events-that-trigger-workflows
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/urlstechie/urlchecker-action/releases


Done. All URLS passed.

我们还可以根据文件类型进行筛选。如果我们想这样做(例如,只检查不同类型的文件),我们可以做以下任何一项

# Check only html files
urlchecker check --file-types *.html .

# Check hidden flies
urlchecker check --file-types ".*" .

# Check hidden files and html files
urlchecker check --file-types ".*,*.html" .

请注意,虽然一些模式在没有引号的情况下也可以正常工作,但为了大多数情况,建议使用引号,因为如果 shell 扩展了模式中的任何部分,它将无法按预期工作。默认情况下,urlchecker 检查 Python 和 Markdown。如果多进程工作器出现错误,您还可以添加 --serial 来串行运行并测试。这将使运行速度变慢,但有助于调试。

$ urlchecker check . --files "content/docs/hacking/contributing/documentation/index.md" --serial

检查 GitHub 仓库

但是,在克隆仓库之前就不需要这样做岂不是更好?当然!我们可以指定一个 GitHub URL,如果之后想要清理文件夹,还可以添加 --cleanup

$ urlchecker check https://github.com/SuperKogito/SuperKogito.github.io.git

如果您为白名单(或任何类型的预期列表)指定任何参数,请确保提供没有空格的逗号分隔列表

$ urlchecker check --exclude-files=README.md,_config.yml

保存结果

如果您想将结果保存到文件,可能用于某种记录或其他数据分析,您可以提供 --save 参数

$ urlchecker check --save results.csv .

您保存的文件将包括逗号分隔值表格,列出 URL 及其结果。结果选项是 "通过" 和 "失败",默认标题是 URL,RESULT。如果您想要更改这些默认值(例如,使用制表符分隔符或不同的标题),可以在 Python 中调用该函数时进行更改。以下是一个默认文件的示例,它应该满足大多数用例

URL,RESULT
https://github.com/SuperKogito,passed
https://www.google.com/,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues,passed
https://github.com/SuperKogito/Voice-based-gender-recognition,passed
https://github.com/SuperKogito/spafe/issues/4,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/2,passed
https://github.com/SuperKogito/spafe/issues/5,passed
https://github.com/SuperKogito/URLs-checker/blob/master/README.md,passed
https://img.shields.io/,passed
https://github.com/SuperKogito/spafe/,passed
https://github.com/SuperKogito/spafe/issues/3,passed
https://www.google.com/,passed
https://github.com/SuperKogito,passed
https://github.com/SuperKogito/spafe/issues/8,passed
https://github.com/SuperKogito/spafe/issues/7,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/1,passed
https://github.com/SuperKogito/spafe/issues,passed
https://github.com/SuperKogito/URLs-checker/issues,passed
https://github.com/SuperKogito/spafe/issues/2,passed
https://github.com/SuperKogito/URLs-checker,passed
https://github.com/SuperKogito/spafe/issues/6,passed
https://github.com/SuperKogito/spafe/issues/1,passed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/3,failed
https://none.html,failed
https://github.com/SuperKogito/URLs-checker/issues/2,failed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/1,failed
https://github.com/SuperKogito/URLs-checker/issues/4,failed

从 Python 使用

检查路径

如果您想检查提供客户端之外的 URL 列表,这相当简单!假设我们有一个路径,我们的当前工作目录,并想检查 .py 和 .md 文件(默认)

from urlchecker.core.check import UrlChecker
import os

path = os.getcwd()
checker = UrlChecker(path)    
# UrlChecker:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python

当然,您还可以提供更多的参数来派生原始文件列表

checker = UrlChecker(
    path=path,
    file_types=[".md", ".py", ".rst"],
    include_patterns=[],
    exclude_files=["README.md", "LICENSE"],
    print_all=True,
)

然后我可以像这样运行检查器

checker.run()

或者更详细地排除 URL

checker.run(
    exclude_urls=exclude_urls,
    exclude_patterns=exclude_patterns,
    retry_count=3,
    timeout=5,
)

您将获得返回的结果对象,它也可以在 checker.results 中找到,这是一个简单的字典,包含 "通过" 和 "失败" 键,用于显示所有文件的通过和失败情况。

{'passed': ['https://github.com/SuperKogito/spafe/issues/4',
  'http://shachi.org/resources',
  'https://superkogito.github.io/blog/SpectralLeakageWindowing.html',
  'https://superkogito.github.io/figures/fig4.html',
  'https://github.com/urlstechie/urlchecker-test-repo',
  'https://www.google.com/',
  ...
  'https://github.com/SuperKogito',
  'https://img.shields.io/',
  'https://www.google.com/',
  'https://docs.pythonlang.cn/2'],
 'failed': ['https://github.com/urlstechie/urlschecker-python/tree/master',
  'https://github.com/SuperKogito/Voice-based-gender-recognition,passed',
  'https://github.com/SuperKogito/URLs-checker/README.md',
   ...
  'https://superkogito.github.io/tables',
  'https://github.com/SuperKogito/URLs-checker/issues/2',
  'https://github.com/SuperKogito/URLs-checker/README.md',
  'https://github.com/SuperKogito/URLs-checker/issues/4',
  'https://github.com/SuperKogito/URLs-checker/issues/3',
  'https://github.com/SuperKogito/URLs-checker/issues/1',
  'https://none.html']}

您还可以查看 checker.checks,这是一个按文件名组织的结果对象字典

for file_name, result in checker.checks.items(): 
    print() 
    print(result) 
    print("Total Results: %s " % result.count) 
    print("Total Failed: %s" % len(result.failed)) 
    print("Total Passed: %s" % len(result.passed)) 

...

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/tests/test_files/sample_test_file.md
Total Results: 26 
Total Failed: 6
Total Passed: 20

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.pytest_cache/README.md
Total Results: 1 
Total Failed: 0
Total Passed: 1

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.eggs/pytest_runner-5.2-py3.7.egg/ptr.py
Total Results: 0 
Total Failed: 0
Total Passed: 0

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/docs/source/conf.py
Total Results: 3 
Total Failed: 0
Total Passed: 3

对于任何结果对象,您都可以打印通过、失败、白名单或所有 URL 的列表。

result.all                                                                                                                                                                       
['https://sphinx-doc.cn/en/master/usage/configuration.html',
 'https://docs.pythonlang.cn/3',
 'https://docs.pythonlang.cn/2']

result.failed                                                                                                                                                                    
[]

result.exclude
[]

result.passed                                                                                                                                                                    
['https://sphinx-doc.cn/en/master/usage/configuration.html',
 'https://docs.pythonlang.cn/3',
 'https://docs.pythonlang.cn/2']

result.count
3

检查 URL 列表

如果您开始时有一个要检查的 URL 列表,您也可以这样做!

from urlchecker.core.urlproc import UrlCheckResult

urls = ['https://www.github.com', "https://github.com", "https://banana-pudding-doesnt-exist.com"]

# Instantiate an empty checker to extract urls
checker = UrlCheckResult()
File name None is undefined or does not exist, skipping extraction.

如果您提供文件名,将为您提取 URL。

checker = UrlCheckResult(
    file_name=file_name,
    exclude_patterns=exclude_patterns,
    exclude_urls=exclude_urls,
    print_all=self.print_all,
)

或者,您可以提供所有参数而不提供文件名

checker = UrlCheckResult(
    exclude_patterns=exclude_patterns,
    exclude_urls=exclude_urls,
    print_all=self.print_all,
)

如果您不提供要检查 URL 的文件名,您可以直接将您定义的 URL 传递给 check_urls 函数

checker.check_urls(urls)

https://www.github.com
https://github.com
HTTPSConnectionPool(host='banana-pudding-doesnt-exist.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f989abdfa10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
https://banana-pudding-doesnt-exist.com

当然,您可以指定超时和重试

checker.check_urls(urls, retry_count=retry_count, timeout=timeout)

运行检查器后,您可以获取所有 URL、通过集合和失败集合

checker.failed                                                                                                                                                                   
['https://banana-pudding-doesnt-exist.com']

checker.passed                                                                                                                                                                   
['https://www.github.com', 'https://github.com']

checker.all                                                                                                                                                                      
['https://www.github.com',
 'https://github.com',
 'https://banana-pudding-doesnt-exist.com']

checker.all                                                                                                                                                                      
['https://www.github.com',
 'https://github.com',
 'https://banana-pudding-doesnt-exist.com']

checker.count                                                                                                                                                                    
3

如果您有任何问题,请不要犹豫,请提出问题

Docker

如果您想使用urlchecker构建基础容器,我们可以提供一个Docker容器,这意味着您不需要在主机上安装它。您可以按照以下步骤构建容器:

docker build -t urlchecker .

然后,entrypoint将暴露urlchecker。

docker run -it urlschecker

开发

组织

该模块按照以下方式组织:

├── client              # command line client
├── main                # functions for supported integrations (e.g., GitHub)
├── core                # core file and url processing tools
└── version.py          # package and versioning

例如,在“client”文件夹中,为客户端暴露的命令(例如,check)将相应命名,例如,client/check.py。在main/github.py中提供GitHub功能。这种组织方式应该非常直观,总是可以找到您要找的东西。

驱动程序

为了测试更困难的URL,我们使用Web驱动程序,您可以选择以下选项:

两者都可以与selenium一起使用。此驱动程序是可选的,但默认情况下将附带我们的操作。要安装它,您可以下载驱动程序并确保安装selenium。

$ pip install urlchecker[selenium]

并执行以下操作:

  1. 直接将其添加到您的路径中
  2. 将驱动程序所在的目录导出为URLCHECKER_DRIVERS_PATH
  3. 将其放在urlchecker克隆的根目录中(它将在这里查找)

支持

如果您需要帮助或想为组织提出一个项目建议,请打开一个问题

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源分布

urlchecker-0.0.35.tar.gz (90.9 kB 查看散列

上传时间

构建分布

urlchecker-0.0.35-py3-none-any.whl (97.4 kB 查看散列

上传时间 Python 3

由以下支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误日志 StatusPage StatusPage 状态页面