提取软件仓库元数据的包
项目描述
爬虫
爬虫是一个用于从各种代码托管平台(如:GitHub.com、GitHub Enterprise、GitLab.com、托管GitLab和Bitbucket Server)抓取和可视化开源数据的工具。
入门:Code.gov
Code.gov 是美国联邦政府新推出的网站,允许公众访问政府定制开发的软件的元数据。此站点需要元数据才能运行,这个Python库可以帮助实现这一点!
要开始使用,您需要一个 GitHub Personal Auth Token 来向GitHub API发起请求。这应该在您的环境或shell rc
文件中设置为 GITHUB_API_TOKEN
$ export GITHUB_API_TOKEN=XYZ
$ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc
此外,为了执行劳动时间估算,您需要将 cloc
安装到您的环境中。这通常使用 包管理器(如 npm
或 homebrew
)完成。
然后,为了为您所在的机构生成 code.json
文件,您需要一个 config.json
文件来协调您将要连接和抓取数据的平台。一个示例配置文件可以在 demo.json 中找到。一旦您有了配置文件,您就可以安装并运行爬虫了!
# Install Scraper from a local copy of this repository
$ pip install -e .
# OR
# Install Scraper from PyPI
$ pip install llnl-scraper
# Run Scraper with your config file ``config.json``
$ scraper --config config.json
生成的 code.json
文件的一个完整示例可以在 这里找到。
配置文件选项
配置文件是一个JSON文件,用于指定从哪些代码库平台拉取项目,以及一些可以用于覆盖通过爬取返回的不完整或不准确数据的设置。
基本结构是
{
// REQUIRED
"contact_email": "...", // Used when the contact email cannot be found otherwise
// OPTIONAL
"agency": "...", // Your agency abbreviation here
"organization": "...", // The organization within the agency
"permissions": { ... }, // Object containing default values for usageType and exemptionText
// Platform configurations, described in more detail below
"GitHub": [ ... ],
"GitLab": [ ... ],
"Bitbucket": [ ... ],
}
"GitHub": [
{
"url": "https://github.com", // GitHub.com or GitHub Enterprise URL to inventory
"token": null, // Private token for accessing this GitHub instance
"public_only": true, // Only inventory public repositories
"connect_timeout": 4, // The timeout in seconds for connecting to the server
"read_timeout": 10, // The timeout in seconds to wait for a response from the server
"orgs": [ ... ], // List of organizations to inventory
"repos": [ ... ], // List of single repositories to inventory
"exclude": [ ... ] // List of organizations / repositories to exclude from inventory
}
],
"GitLab": [
{
"url": "https://gitlab.com", // GitLab.com or hosted GitLab instance URL to inventory
"token": null, // Private token for accessing this GitHub instance
"fetch_languages": false, // Include individual calls to API for language metadata. Very slow, so defaults to false. (eg, for 191 projects on internal server, 5 seconds for False, 12 minutes, 38 seconds for True)
"orgs": [ ... ], // List of organizations to inventory
"repos": [ ... ], // List of single repositories to inventory
"exclude": [ ... ] // List of groups / repositories to exclude from inventory
}
]
"Bitbucket": [
{
"url": "https://bitbucket.internal", // Base URL for a Bitbucket Server instance
"username": "", // Username to authenticate with
"password": "", // Password to authenticate with
"token": "", // Token to authenticate with, if supplied username and password are ignored
"exclude": [ ... ] // List of projects / repositories to exclude from inventory
}
]
"TFS": [
{
"url": "https://tfs.internal", // Base URL for a Team Foundation Server (TFS) or Visual Studio Team Services (VSTS) or Azure DevOps instance
"token": null, // Private token for accessing this TFS instance
"exclude": [ ... ] // List of projects / repositories to exclude from inventory
}
]
许可证
爬虫在MIT许可证下发布。更多详情请参阅许可证文件。
LLNL-CODE-705597
项目详情
下载文件
下载适合您平台的文件。如果您不确定选择哪一个,请了解更多关于安装包的信息。
源分发
llnl-scraper-0.14.0.tar.gz (27.8 kB 查看哈希)
构建分发
llnl_scraper-0.14.0-py3-none-any.whl (32.2 kB 查看哈希)
关闭
llnl-scraper-0.14.0.tar.gz的哈希
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 881fbe04c0f0df3dfe6a887413bfd126921f2ec3344f5d9e797629be0aaab60d |
|
MD5 | 99ea32f736954c72b620c2ad007bc3b8 |
|
BLAKE2b-256 | a1a9d32afd4ad6c1ca185856ab62c421e6920c8d9555c349765ea470060220ec |
关闭
llnl_scraper-0.14.0-py3-none-any.whl的哈希
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 015e080d24888ef2d48aa9f4602bed866e373adc59276b42bdb1b59ed6d9ad2a |
|
MD5 | 86edefd15a1f1ddbb40a3518831c0dc8 |
|
BLAKE2b-256 | 1ff77928878103c1a03c4be1d850f98645de9413a97c927f10ddbb18cfbc2fb7 |