跳转到主要内容

机器学习、统计学以及围绕开发者、公司、项目生产力的实用工具

项目描述

Codacy Badge CircleCI

devml

机器学习、统计学以及围绕开发者生产力的实用工具

一些实用的功能片段

  • 可以在Github上检查所有仓库

  • 将磁盘上检查出的仓库树转换为pandas数据框

  • 组合数据框的统计

安装

pip install devml

此pip安装安装了一个命令行工具:dml(在下面的文档中引用)。还安装了库devml,也在下面的文档中引用。

获取环境设置

代码编写支持Python 3.6或更高版本。您可以从这里获取:https://pythonlang.cn/downloads/release/python-360/

在本地运行项目的简单方法是检出仓库,并在仓库根目录下运行

make setup

这将在 ~/.devml 中创建一个虚拟环境

然后,源该虚拟环境

source ~/.devml/bin/activate

运行Make All(安装、检查和测试)

make all

# #Example output
#(.devml) ➜  devml git:(master) make all
#pip install -r requirements.txt
#Requirement already satisfied: pytest in /Users/noahgift/.devml/lib/python3.6/site-packages (from -r requirements.txt (line #1)
---------- coverage: platform darwin, python 3.6.2-final-0 -----------
Name                       Stmts   Miss  Cover
----------------------------------------------
devml/__init__.py              1      0   100%
devml/author_stats.py          6      6     0%
devml/fetch_repo.py           54     42    22%
devml/mkdata.py               84     21    75%
devml/org_stats.py            76     55    28%
devml/post_processing.py      50     35    30%
devml/state.py                29      9    69%
devml/stats.py                55     43    22%
devml/ts.py                   29     14    52%
devml/util.py                 12      4    67%
dml.py                       111     66    41%
----------------------------------------------
TOTAL                        507    295    42%
....

如果您不使用虚拟环境或不想使用虚拟环境,没问题,只需运行make all即可。如果您已安装Python 3.6,则可能应该正常工作。

make all

探索Github组织的Jupyter笔记本

您可以使用此示例作为起点在此处探索组合数据集

https://github.com/noahgift/devml/blob/master/notebooks/github_data_exploration.ipynb

Pallets Project

板条箱项目

探索仓库流失的Jupyter笔记本

您可以在以下位置探索文件元数据探索示例

https://github.com/noahgift/devml/blob/master/notebooks/repo_file_exploration.ipynb

按类型流失的所有文件

Pallets Project Relative Churn by file type

按文件类型流失的板条箱项目相对值

按类型总结的流失统计

Pallets Project by file type Churn statistics

按文件类型划分的板条箱项目流失统计

预期配置

命令行工具期望您创建一个项目目录,并包含一个config.json文件。在config.json文件中,您需要提供一个令牌。您可以在以下位置找到有关如何创建令牌的信息: https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/

或者,您可以通过Python API或命令行选项传递这些值。它们代表以下内容

  • org: Github组织(用于克隆整个仓库树)

  • checkout_dir: 检出位置

  • oath: 由Github生成的个人令牌

➜  devml git:(master) ✗ cat project/config.json
{
    "project" :
        {
            "org":"pallets",
            "checkout_dir": "/tmp/checkout",
            "oath": "<keygenerated from Github>"
        }

}

基本命令行用法

您可以通过以下方式查看检出或目录的统计信息

dml gstats author --path ~/src/mycompanyrepo(s)
Top Commits By Author:                     author_name  commits
0                     John Smith     3059
1                      Sally Joe     2995
2                   Greg Mathews     2194
3                 Jim Mayflower      1448

基本API用法(将仓库树转换为pandas DataFrame)

In [1]: from devml import (mkdata, stats)

In [2]: org_df = mkdata.create_org_df(path=/src/mycompanyrepo(s)")
In [3]: author_counts = stats.author_commit_count(org_df)

In [4]: author_counts.head()
Out[4]:
      author_name  commits
0       John Smith     3059
1        Sally Joe     2995
2     Greg Mathews     2194
3    Jim Mayflower     1448
4   Truck Pritter      1441

使用API克隆Github中的所有仓库

In [1]: from devml import (mkdata, stats, state, fetch_repo)

In [2]: dest, token, org = state.get_project_metadata("../project/config.json")
In [3]: fetch_repo.clone_org_repos(token, org,
        dest, branch="master")
017-10-14 17:11:36,590 - devml - INFO - Creating Checkout Root:  /tmp/checkout
2017-10-14 17:11:37,346 - devml - INFO - Found Repo # 1 REPO NAME: flask , URL: git@github.com:pallets/flask.git
2017-10-14 17:11:37,347 - devml - INFO - Found Repo # 2 REPO NAME: pallets-sphinx-themes , URL: git@github.com:pallets/pallets-sphinx-themes.git
2017-10-14 17:11:37,347 - devml - INFO - Found Repo # 3 REPO NAME: markupsafe , URL: git@github.com:pallets/markupsafe.git
2017-10-14 17:11:37,348 - devml - INFO - Found Repo # 4 REPO NAME: jinja , URL: git@github.com:pallets/jinja.git
2017-10-14 17:11:37,349 - devml - INFO - Found Repo # 5 REPO NAME: werkzeug , URL: git@githu
In [4]: !ls -l /tmp/checkout
total 0
drwxr-xr-x  21 noahgift  wheel  672 Oct 14 17:11 click
drwxr-xr-x  25 noahgift  wheel  800 Oct 14 17:11 flask
drwxr-xr-x  11 noahgift  wheel  352 Oct 14 17:11 flask-docs
drwxr-xr-x  12 noahgift  wheel  384 Oct 14 17:11 flask-ext-migrate
drwxr-xr-x   8 noahgift  wheel  256 Oct 14 17:11 flask-snippets
drwxr-xr-x  14 noahgift  wheel  448 Oct 14 17:11 flask-website
drwxr-xr-x  18 noahgift  wheel  576 Oct 14 17:11 itsdangerous
drwxr-xr-x  23 noahgift  wheel  736 Oct 14 17:11 jinja
drwxr-xr-x  18 noahgift  wheel  576 Oct 14 17:11 markupsafe
drwxr-xr-x   4 noahgift  wheel  128 Oct 14 17:11 meta
drwxr-xr-x  10 noahgift  wheel  320 Oct 14 17:11 pallets-sphinx-themes
drwxr-xr-x   9 noahgift  wheel  288 Oct 14 17:11 pocoo-sphinx-themes
drwxr-xr-x  15 noahgift  wheel  480 Oct 14 17:11 website
drwxr-xr-x  25 noahgift  wheel  800 Oct 14 17:11 werkzeug

高级CLI-作者:获取检出树或检出的活动统计信息并排序

 ➜  devml git:(master) ✗ dml gstats activity --path /tmp/checkout --sort active_days

Top Unique Active Days:               author_name  active_days active_duration  active_ratio
86         Armin Ronacher          989       3817 days      0.260000
501  Markus Unterwaditzer          342       1820 days      0.190000
216            David Lord          129        712 days      0.180000
664           Ron DuPlain           78        854 days      0.090000
444         Kenneth Reitz           68       2566 days      0.030000
197      Daniel Neuhäuser           42       1457 days      0.030000
297          Georg Brandl           41       1337 days      0.030000
196     Daniel Neuhäuser           36        435 days      0.080000
450      Keyan Pishdadian           28        885 days      0.030000
169     Christopher Grebs           28       1515 days      0.020000
666    Ronny Pfannschmidt           27       3060 days      0.010000
712           Simon Sapin           22        793 days      0.030000
372           Jeff Widman           19        840 days      0.020000
427    Julen Ruiz Aizpuru           16         36 days      0.440000
21                 Adrian           16       1935 days      0.010000
569        Nicholas Wiles           14        197 days      0.070000
912                lord63           14        692 days      0.020000
756           ThiefMaster           12       1287 days      0.010000
763       Thomas Waldmann           11       1560 days      0.010000
628            Priit Laes           10       1567 days      0.010000
23        Adrian Moennich           10        521 days      0.020000
391  Jochen Kupperschmidt           10       3060 days      0.000000

高级CLI-变更:按文件类型获取变更

按变更次数排序并获取扩展名为.py的前十个文件

✗ dml gstats churn --path /Users/noahgift/src/flask --limit 10 --ext .py
2017-10-15 12:10:55,783 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/flask]
                       files  churn_count  line_count extension  \
1            b'flask/app.py'          316      2183.0       .py
3        b'flask/helpers.py'          176      1019.0       .py
5    b'tests/flask_tests.py'          127         NaN       .py
7                b'flask.py'          104         NaN       .py
8                b'setup.py'           80       112.0       .py
10           b'flask/cli.py'           75       759.0       .py
11      b'flask/wrappers.py'           70       194.0       .py
12      b'flask/__init__.py'           65        49.0       .py
13           b'flask/ctx.py'           62       415.0       .py
14  b'tests/test_helpers.py'           62       888.0       .py

    relative_churn
1             0.14
3             0.17
5              NaN
7              NaN
8             0.71
10            0.10
11            0.36
12            1.33
13            0.15
14            0.07

获取扩展名为.py的描述性统计信息,并与另一个仓库进行比较

在此示例中,flask、此仓库和cpython都被比较,以查看中值变更。

(.devml) ➜  devml git:(master) dml gstats metachurn --path /Users/noahgift/src/flask --ext .py --statistic median
2017-10-15 12:39:44,781 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/flask]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  2        85.0            0.13
(.devml) ➜  devml git:(master) dml gstats metachurn --path /Users/noahgift/src/devml --ext .py --statistic median
2017-10-15 12:40:10,999 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/devml]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  1        62.5            0.02

(.devml) ➜  devml git:(master) dml gstats metachurn --path /Users/noahgift/src/cpython --ext .py --statistic median
2017-10-15 12:42:19,260 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/cpython]
MEDIAN Statistics:

           churn_count  line_count  relative_churn
extension
.py                  7       169.5             0.1

获取作者的相对变更

dml gstats authorchurnmeta --author "Armin Ronacher" --path /tmp/checkout/flask --ext .py

#He has 6.5% median relative churn...very good.

count    193.000000
mean       0.331860
std        0.625431
min        0.001000
25%        0.030000
50%        0.065000
75%        0.250000
max        3.000000
Name: author_rel_churn, dtype: float64

比较CPython活动比率与Linux活动比率

# Linux Development Active Ratio
dml gstats activity --path /Users/noahgift/src/linux --sort active_days

                       author_name  active_days active_duration  active_ratio
14541                 Takashi Iwai         1677       4590 days      0.370000
4382                  Eric Dumazet         1460       4504 days      0.320000
3641               David S. Miller         1428       4513 days      0.320000
7216                 Johannes Berg         1329       4328 days      0.310000
8717                Linus Torvalds         1281       4565 days      0.280000
275                        Al Viro         1249       4562 days      0.270000
9915         Mauro Carvalho Chehab         1227       4464 days      0.270000
9375                    Mark Brown         1198       4187 days      0.290000
3172                 Dan Carpenter         1158       3972 days      0.290000
12979                 Russell King         1141       4602 days      0.250000
1683                      Axel Lin         1040       2720 days      0.380000
400                   Alex Deucher         1036       3497 days      0.300000


# CPython Development Active Ratio

            author_name  active_days active_duration  active_ratio
146    Guido van Rossum         2256       9673 days      0.230000
301   Raymond Hettinger         1361       5635 days      0.240000
128          Fred Drake         1239       5335 days      0.230000
47    Benjamin Peterson         1234       3494 days      0.350000
132        Georg Brandl         1080       4091 days      0.260000
375      Victor Stinner          980       2818 days      0.350000
235     Martin v. Löwis          958       5266 days      0.180000
36       Antoine Pitrou          883       3376 days      0.260000
362          Tim Peters          869       5060 days      0.170000
164         Jack Jansen          800       4998 days      0.160000
24   Andrew M. Kuchling          743       4632 days      0.160000
330    Serhiy Storchaka          720       1759 days      0.410000
44         Barry Warsaw          696       8485 days      0.080000
52         Brett Cannon          681       5278 days      0.130000
262        Neal Norwitz          559       2573 days      0.220000

In this analysis, Guido of Python has a 23% probability of working on a given day, and Linux has a 28% chance.

删除统计信息

查找仓库中的所有删除文件

dml gstats deleted --path /Users/noahgift/src/flask

DELETION STATISTICS

                                                 files          ext
0                        b'tests/test_deprecations.py'          .py
1                       b'scripts/flask-07-upgrade.py'          .py
2                             b'flask/ext/__init__.py'          .py
3                                  b'flask/exthook.py'          .py
4                        b'scripts/flaskext_compat.py'          .py
5                                 b'tests/test_ext.py'          .py

常见问题解答

什么是变更以及为什么我要关心它?

代码变更是指文件被修改的次数。相对变更是指相对于代码行的修改次数。关于软件缺陷的研究表明,相对代码变更对缺陷具有高度预测性,即相对变更数越大,缺陷数量越多。

“相对代码变更的增加伴随着系统缺陷密度的增加;”

您可以在以下位置阅读整个研究: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/icse05churn.pdf

项目详情


下载文件

为您的平台下载文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。

源分布

devml-0.5.1.tar.gz (22.6 kB 查看散列)

上传时间

由以下支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面