跳转到主要内容

提取软件仓库元数据的包

项目描述

爬虫

爬虫是一个用于从各种代码托管平台(如:GitHub.com、GitHub Enterprise、GitLab.com、托管GitLab和Bitbucket Server)抓取和可视化开源数据的工具。

入门:Code.gov

Code.gov 是美国联邦政府新推出的网站,允许公众访问政府定制开发的软件的元数据。此站点需要元数据才能运行,这个Python库可以帮助实现这一点!

要开始使用,您需要一个 GitHub Personal Auth Token 来向GitHub API发起请求。这应该在您的环境或shell rc 文件中设置为 GITHUB_API_TOKEN

    $ export GITHUB_API_TOKEN=XYZ

    $ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc

此外,为了执行劳动时间估算,您需要将 cloc 安装到您的环境中。这通常使用 包管理器(如 npmhomebrew)完成。

然后,为了为您所在的机构生成 code.json 文件,您需要一个 config.json 文件来协调您将要连接和抓取数据的平台。一个示例配置文件可以在 demo.json 中找到。一旦您有了配置文件,您就可以安装并运行爬虫了!

    # Install Scraper from a local copy of this repository
    $ pip install -e .
    # OR
    # Install Scraper from PyPI
    $ pip install llnl-scraper

    # Run Scraper with your config file ``config.json``
    $ scraper --config config.json

生成的 code.json 文件的一个完整示例可以在 这里找到

配置文件选项

配置文件是一个JSON文件,用于指定从哪些代码库平台拉取项目,以及一些可以用于覆盖通过爬取返回的不完整或不准确数据的设置。

基本结构是

{
    // REQUIRED
    "contact_email": "...",  // Used when the contact email cannot be found otherwise

    // OPTIONAL
    "agency": "...",         // Your agency abbreviation here
    "organization": "...",   // The organization within the agency
    "permissions": { ... },  // Object containing default values for usageType and exemptionText

    // Platform configurations, described in more detail below
    "GitHub": [ ... ],
    "GitLab": [ ... ],
    "Bitbucket": [ ... ],
}
"GitHub": [
    {
        "url": "https://github.com",  // GitHub.com or GitHub Enterprise URL to inventory
        "token": null,                // Private token for accessing this GitHub instance
        "public_only": true,          // Only inventory public repositories

        "connect_timeout": 4,  // The timeout in seconds for connecting to the server
        "read_timeout": 10,    // The timeout in seconds to wait for a response from the server

        "orgs": [ ... ],    // List of organizations to inventory
        "repos": [ ... ],   // List of single repositories to inventory
        "exclude": [ ... ]  // List of organizations / repositories to exclude from inventory
    }
],
"GitLab": [
    {
        "url": "https://gitlab.com",  // GitLab.com or hosted GitLab instance URL to inventory
        "token": null,                // Private token for accessing this GitHub instance
        "fetch_languages": false,     // Include individual calls to API for language metadata. Very slow, so defaults to false. (eg, for 191 projects on internal server, 5 seconds for False, 12 minutes, 38 seconds for True)

        "orgs": [ ... ],    // List of organizations to inventory
        "repos": [ ... ],   // List of single repositories to inventory
        "exclude": [ ... ]  // List of groups / repositories to exclude from inventory
    }
]
"Bitbucket": [
    {
        "url": "https://bitbucket.internal",  // Base URL for a Bitbucket Server instance
        "username": "",                       // Username to authenticate with
        "password": "",                       // Password to authenticate with
        "token": "",                          // Token to authenticate with, if supplied username and password are ignored

        "exclude": [ ... ]  // List of projects / repositories to exclude from inventory
    }
]
"TFS": [
    {
        "url": "https://tfs.internal",  // Base URL for a Team Foundation Server (TFS) or Visual Studio Team Services (VSTS) or Azure DevOps instance
        "token": null,                  // Private token for accessing this TFS instance

        "exclude": [ ... ]  // List of projects / repositories to exclude from inventory
    }
]

许可证

爬虫在MIT许可证下发布。更多详情请参阅许可证文件。

LLNL-CODE-705597

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪一个,请了解更多关于安装包的信息。

源分发

llnl-scraper-0.14.0.tar.gz (27.8 kB 查看哈希)

上传时间

构建分发

llnl_scraper-0.14.0-py3-none-any.whl (32.2 kB 查看哈希)

上传时间 Python 3

由以下机构支持

AWSAWS云计算和安全赞助商DatadogDatadog监控FastlyFastlyCDNGoogleGoogle下载分析MicrosoftMicrosoftPSF赞助商PingdomPingdom监控SentrySentry错误日志StatusPageStatusPage状态页面