用于收集和验证静态文件(代码和文档)中URL的工具
项目描述
urlchecker-python
这是一个Python模块,用于收集静态文件(代码和文档)中的URL,然后测试并报告损坏的链接。如果您想将其用作GitHub动作,请参阅urlchecker-action。您还可以在quay.io/urlstechie/urlchecker上找到基于容器的版本。从版本0.0.26开始,我们使用多进程,因此检查运行速度更快,您可以将URLCHECKER_WORKERS
设置为更改工作进程的数量(默认为9)。如果您不想使用多进程,请使用版本0.0.25或更早版本。
模块文档
有关代码的详细文档可在urlchecker-python.readthedocs.io找到
使用方法
安装
您可以从pypi安装urlchecker。在安装之前,建议您从以下链接安装fake-useragent:
pip install git+https://github.com/danger89/fake-useragent.git
然后安装urlchecker
$ pip install urlchecker
或直接从存储库安装
$ git clone https://github.com/urlstechie/urlchecker-python.git
$ cd urlchecker-python
$ python setup.py install
安装过程会将二进制文件 urlchecker
放入您的 Python 路径中。
$ which urlchecker
/home/vanessa/anaconda3/bin/urlchecker
检查本地文件夹
您最可能用例是检查包含静态文件(文档或代码)的本地目录中的文件。在这种情况下,您可以使用 urlchecker 检查。
$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
[--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
[--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
[--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
path
positional arguments:
path the local path or GitHub repository to clone and check
options:
-h, --help show this help message and exit
-b BRANCH, --branch BRANCH
if cloning, specify a branch to use (defaults to main)
--subfolder SUBFOLDER
relative subfolder path within path (if not specified, we use root)
--cleanup remove root folder after checking (defaults to False, no cleaup)
--serial run checks in serial (no multiprocess)
--no-check-certs Allow urls to validate that fail certificate checks
--force-pass force successful pass (return code 0) regardless of result
--no-print Skip printing results to the screen (defaults to printing to console).
--verbose Print file names for failed urls in addition to the urls.
--file-types FILE_TYPES
comma separated list of file extensions to check (defaults to .md,.py)
--files FILES comma separated list of exact files or patterns to check.
--exclude-urls EXCLUDE_URLS
comma separated links to exclude (no spaces)
--exclude-patterns EXCLUDE_PATTERNS
comma separated list of patterns to exclude (no spaces)
--exclude-files EXCLUDE_FILES
comma separated list of files and patterns to exclude (no spaces)
--save SAVE Path to a csv file to save results to.
--retry-count RETRY_COUNT
retry count upon failure (defaults to 2, one retry).
--timeout TIMEOUT timeout (seconds) to provide to the requests library (defaults to 5)
您有很大的灵活性来定义要跳过的 URL 或文件的模式,以及重试次数或超时(秒数)。最基本的用法将检查整个目录。让我们克隆并检查 urlchecker 操作
$ git clone https://github.com/urlstechie/urlchecker-action.git
$ cd urchecker-action
并运行最简单的命令来检查当前工作目录(.)。
$ urlchecker check .
original path: .
final path: /tmp/urlchecker-action
subfolder: None
branch: master
cleanup: False
file types: ['.md', '.py']
files: []
print all: True
urls excluded: []
url patterns excluded: []
file patterns excluded: []
force pass: False
retry count: 2
save: None
timeout: 5
/tmp/urlchecker-action/README.md
--------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/rseng/awesome-rseng
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/USRSE/usrse.github.io
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/urlstechie/urlchecker-action/releases
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/buildtesters/buildtest
https://github.com/r-hub/docs
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/urlstechie/URLs-checker-test-repo
https://github.com/marketplace/actions/urlchecker-action
https://github.com/actions/checkout
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/SuperKogito/Voice-based-gender-recognition/issues
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
/tmp/urlchecker-action/examples/README.md
-----------------------------------------
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/urlstechie/urlchecker-action/issues
https://help.github.com/en/actions/reference/events-that-trigger-workflows
Done. The following urls did not pass:
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
未通过检查的 URL 是库的一个示例参数!让我们添加一个简单的模式来排除它。
$ urlchecker check --exclude-pattern SuperKogito .
original path: .
final path: /tmp/urlchecker-action
subfolder: None
branch: master
cleanup: False
file types: ['.md', '.py']
files: []
print all: True
urls excluded: []
url patterns excluded: ['SuperKogito']
file patterns excluded: []
force pass: False
retry count: 2
save: None
timeout: 5
/tmp/urlchecker-action/README.md
--------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/actions/checkout
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
https://github.com/rseng/awesome-rseng
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/urlstechie/URLs-checker-test-repo
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/r-hub/docs
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/buildtesters/buildtest
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/marketplace/actions/urlchecker-action
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
/tmp/urlchecker-action/examples/README.md
-----------------------------------------
https://help.github.com/en/actions/reference/events-that-trigger-workflows
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/urlstechie/urlchecker-action/releases
Done. All URLS passed.
我们还可以根据文件类型进行筛选。如果我们想这样做(例如,只检查不同类型的文件),我们可以做以下任何一项
# Check only html files
urlchecker check --file-types *.html .
# Check hidden flies
urlchecker check --file-types ".*" .
# Check hidden files and html files
urlchecker check --file-types ".*,*.html" .
请注意,虽然一些模式在没有引号的情况下也可以正常工作,但为了大多数情况,建议使用引号,因为如果 shell 扩展了模式中的任何部分,它将无法按预期工作。默认情况下,urlchecker 检查 Python 和 Markdown。如果多进程工作器出现错误,您还可以添加 --serial
来串行运行并测试。这将使运行速度变慢,但有助于调试。
$ urlchecker check . --files "content/docs/hacking/contributing/documentation/index.md" --serial
检查 GitHub 仓库
但是,在克隆仓库之前就不需要这样做岂不是更好?当然!我们可以指定一个 GitHub URL,如果之后想要清理文件夹,还可以添加 --cleanup
。
$ urlchecker check https://github.com/SuperKogito/SuperKogito.github.io.git
如果您为白名单(或任何类型的预期列表)指定任何参数,请确保提供没有空格的逗号分隔列表
$ urlchecker check --exclude-files=README.md,_config.yml
保存结果
如果您想将结果保存到文件,可能用于某种记录或其他数据分析,您可以提供 --save
参数
$ urlchecker check --save results.csv .
您保存的文件将包括逗号分隔值表格,列出 URL 及其结果。结果选项是 "通过" 和 "失败",默认标题是 URL,RESULT
。如果您想要更改这些默认值(例如,使用制表符分隔符或不同的标题),可以在 Python 中调用该函数时进行更改。以下是一个默认文件的示例,它应该满足大多数用例
URL,RESULT
https://github.com/SuperKogito,passed
https://www.google.com/,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues,passed
https://github.com/SuperKogito/Voice-based-gender-recognition,passed
https://github.com/SuperKogito/spafe/issues/4,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/2,passed
https://github.com/SuperKogito/spafe/issues/5,passed
https://github.com/SuperKogito/URLs-checker/blob/master/README.md,passed
https://img.shields.io/,passed
https://github.com/SuperKogito/spafe/,passed
https://github.com/SuperKogito/spafe/issues/3,passed
https://www.google.com/,passed
https://github.com/SuperKogito,passed
https://github.com/SuperKogito/spafe/issues/8,passed
https://github.com/SuperKogito/spafe/issues/7,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/1,passed
https://github.com/SuperKogito/spafe/issues,passed
https://github.com/SuperKogito/URLs-checker/issues,passed
https://github.com/SuperKogito/spafe/issues/2,passed
https://github.com/SuperKogito/URLs-checker,passed
https://github.com/SuperKogito/spafe/issues/6,passed
https://github.com/SuperKogito/spafe/issues/1,passed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/3,failed
https://none.html,failed
https://github.com/SuperKogito/URLs-checker/issues/2,failed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/1,failed
https://github.com/SuperKogito/URLs-checker/issues/4,failed
从 Python 使用
检查路径
如果您想检查提供客户端之外的 URL 列表,这相当简单!假设我们有一个路径,我们的当前工作目录,并想检查 .py 和 .md 文件(默认)
from urlchecker.core.check import UrlChecker
import os
path = os.getcwd()
checker = UrlChecker(path)
# UrlChecker:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python
当然,您还可以提供更多的参数来派生原始文件列表
checker = UrlChecker(
path=path,
file_types=[".md", ".py", ".rst"],
include_patterns=[],
exclude_files=["README.md", "LICENSE"],
print_all=True,
)
然后我可以像这样运行检查器
checker.run()
或者更详细地排除 URL
checker.run(
exclude_urls=exclude_urls,
exclude_patterns=exclude_patterns,
retry_count=3,
timeout=5,
)
您将获得返回的结果对象,它也可以在 checker.results
中找到,这是一个简单的字典,包含 "通过" 和 "失败" 键,用于显示所有文件的通过和失败情况。
{'passed': ['https://github.com/SuperKogito/spafe/issues/4',
'http://shachi.org/resources',
'https://superkogito.github.io/blog/SpectralLeakageWindowing.html',
'https://superkogito.github.io/figures/fig4.html',
'https://github.com/urlstechie/urlchecker-test-repo',
'https://www.google.com/',
...
'https://github.com/SuperKogito',
'https://img.shields.io/',
'https://www.google.com/',
'https://docs.pythonlang.cn/2'],
'failed': ['https://github.com/urlstechie/urlschecker-python/tree/master',
'https://github.com/SuperKogito/Voice-based-gender-recognition,passed',
'https://github.com/SuperKogito/URLs-checker/README.md',
...
'https://superkogito.github.io/tables',
'https://github.com/SuperKogito/URLs-checker/issues/2',
'https://github.com/SuperKogito/URLs-checker/README.md',
'https://github.com/SuperKogito/URLs-checker/issues/4',
'https://github.com/SuperKogito/URLs-checker/issues/3',
'https://github.com/SuperKogito/URLs-checker/issues/1',
'https://none.html']}
您还可以查看 checker.checks
,这是一个按文件名组织的结果对象字典
for file_name, result in checker.checks.items():
print()
print(result)
print("Total Results: %s " % result.count)
print("Total Failed: %s" % len(result.failed))
print("Total Passed: %s" % len(result.passed))
...
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/tests/test_files/sample_test_file.md
Total Results: 26
Total Failed: 6
Total Passed: 20
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.pytest_cache/README.md
Total Results: 1
Total Failed: 0
Total Passed: 1
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.eggs/pytest_runner-5.2-py3.7.egg/ptr.py
Total Results: 0
Total Failed: 0
Total Passed: 0
UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/docs/source/conf.py
Total Results: 3
Total Failed: 0
Total Passed: 3
对于任何结果对象,您都可以打印通过、失败、白名单或所有 URL 的列表。
result.all
['https://sphinx-doc.cn/en/master/usage/configuration.html',
'https://docs.pythonlang.cn/3',
'https://docs.pythonlang.cn/2']
result.failed
[]
result.exclude
[]
result.passed
['https://sphinx-doc.cn/en/master/usage/configuration.html',
'https://docs.pythonlang.cn/3',
'https://docs.pythonlang.cn/2']
result.count
3
检查 URL 列表
如果您开始时有一个要检查的 URL 列表,您也可以这样做!
from urlchecker.core.urlproc import UrlCheckResult
urls = ['https://www.github.com', "https://github.com", "https://banana-pudding-doesnt-exist.com"]
# Instantiate an empty checker to extract urls
checker = UrlCheckResult()
File name None is undefined or does not exist, skipping extraction.
如果您提供文件名,将为您提取 URL。
checker = UrlCheckResult(
file_name=file_name,
exclude_patterns=exclude_patterns,
exclude_urls=exclude_urls,
print_all=self.print_all,
)
或者,您可以提供所有参数而不提供文件名
checker = UrlCheckResult(
exclude_patterns=exclude_patterns,
exclude_urls=exclude_urls,
print_all=self.print_all,
)
如果您不提供要检查 URL 的文件名,您可以直接将您定义的 URL 传递给 check_urls
函数
checker.check_urls(urls)
https://www.github.com
https://github.com
HTTPSConnectionPool(host='banana-pudding-doesnt-exist.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f989abdfa10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
https://banana-pudding-doesnt-exist.com
当然,您可以指定超时和重试
checker.check_urls(urls, retry_count=retry_count, timeout=timeout)
运行检查器后,您可以获取所有 URL、通过集合和失败集合
checker.failed
['https://banana-pudding-doesnt-exist.com']
checker.passed
['https://www.github.com', 'https://github.com']
checker.all
['https://www.github.com',
'https://github.com',
'https://banana-pudding-doesnt-exist.com']
checker.all
['https://www.github.com',
'https://github.com',
'https://banana-pudding-doesnt-exist.com']
checker.count
3
如果您有任何问题,请不要犹豫,请提出问题。
Docker
如果您想使用urlchecker构建基础容器,我们可以提供一个Docker容器,这意味着您不需要在主机上安装它。您可以按照以下步骤构建容器:
docker build -t urlchecker .
然后,entrypoint将暴露urlchecker。
docker run -it urlschecker
开发
组织
该模块按照以下方式组织:
├── client # command line client
├── main # functions for supported integrations (e.g., GitHub)
├── core # core file and url processing tools
└── version.py # package and versioning
例如,在“client”文件夹中,为客户端暴露的命令(例如,check)将相应命名,例如,client/check.py
。在main/github.py
中提供GitHub功能。这种组织方式应该非常直观,总是可以找到您要找的东西。
驱动程序
为了测试更困难的URL,我们使用Web驱动程序,您可以选择以下选项:
- Chrome驱动程序
- Gecko驱动程序(firefox)
两者都可以与selenium一起使用。此驱动程序是可选的,但默认情况下将附带我们的操作。要安装它,您可以下载驱动程序并确保安装selenium。
$ pip install urlchecker[selenium]
并执行以下操作:
- 直接将其添加到您的路径中
- 将驱动程序所在的目录导出为
URLCHECKER_DRIVERS_PATH
- 将其放在urlchecker克隆的根目录中(它将在这里查找)
支持
如果您需要帮助或想为组织提出一个项目建议,请打开一个问题
项目详情
下载文件
下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。
源分布
构建分布
urlchecker-0.0.35.tar.gz的散列
算法 | 散列摘要 | |
---|---|---|
SHA256 | e303c4d240f3e00e21583bf9636e6792b4d7abd889de5b3114e78662cd2b7f45 |
|
MD5 | 7503f2358a0819d2011eab676eb451e5 |
|
BLAKE2b-256 | 9711b87b3e014d93bfb7dc288fe907fefdf401bf2a35494cdaef9438d2e701e5 |
urlchecker-0.0.35-py3-none-any.whl的散列
算法 | 散列摘要 | |
---|---|---|
SHA256 | a6297b4627ead89a71dcc1c03a366fd9380640bf2a02d7135edcd48ff7a44290 |
|
MD5 | c92c09038db0d2908dd58549bcf92e09 |
|
BLAKE2b-256 | 087eef9950a258bcbb5d2770aef1bd681a471d12b871d55f1f6883c8ac41514d |