YData开源数据质量工具。
项目描述
ydata-quality
ydata_quality是一个开源Python库,用于评估数据管道开发多个阶段中的数据质量。
只有从多个维度查看数据,才能全面了解数据,而ydata_quality通过模块化方式将其封装在一个数据质量引擎中。此存储库包含核心Python源代码脚本和教程。
快速入门
源代码目前托管在GitHub上,地址为:https://github.com/ydataai/ydata-quality
最新发布版本的二进制安装程序可在Python包索引(PyPI)找到:Python Package Index (PyPI)
pip install ydata-quality
几行代码即可进行全面的质控
from ydata_quality import DataQuality
import pandas as pd
#Load in the data
df = pd.read_csv('./datasets/transformed/census_10k.csv')
# create a DataQuality object from the main class that holds all quality modules
dq = DataQuality(df=df)
# run the tests and outputs a summary of the quality tests
results = dq.evaluate()
Warnings:
TOTAL: 5 warning(s)
Priority 1: 1 warning(s)
Priority 2: 4 warning(s)
Priority 1 - heavy impact expected:
* [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
* [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
* [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
* [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
* [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.
除了总结之外,您还可以获取检测到的警告列表,以便进行详细检查。
# retrieve a list of data quality warnings
warnings = dq.get_warnings()
示例
在这里,您可以找到教程和示例,以熟悉《ydata_quality》的不同模块。
要深入了解任何特定模块,了解它们的工作原理,以下是教程笔记本:
贡献
我们欢迎合作!如果您想开始贡献,您只需
- 搜索您想要工作的一个问题。新手的issue都带有“good first issue”的标签。
- 创建一个PR来解决该问题。
- 我们将审查每个PR,要么接受要么要求修改。
您还可以加入我们Slack上的#data-quality频道讨论,并通过在存储库中打开issue来请求功能/错误修复。
支持
如需使用此库的支持,请加入#help Slack频道。Slack社区非常友好,对快速回答有关库的使用和开发的问题非常有帮助。 点击这里加入我们的Slack社区!
许可证
项目详情
下载文件
下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于 安装包 的信息。
源代码分发
ydata-quality-0.1.0.tar.gz (55.2 kB 查看哈希)
构建分发
ydata_quality-0.1.0-py2.py3-none-any.whl (64.8 kB 查看哈希)
关闭
ydata-quality-0.1.0.tar.gz的哈希
算法 | 哈希摘要 | |
---|---|---|
SHA256 | e32bd5074a490d27606b42004d4d7cfbaaeeac21a403d9e40955ac8fb92c12c8 |
|
MD5 | 13b4c26a62ab96a2623bb8b9338231a3 |
|
BLAKE2b-256 | 552b71637a81f184ca8cc609c3259aeb729c110c4b693ca71df5cdaa8c2335c7 |
关闭
ydata_quality-0.1.0-py2.py3-none-any.whl的哈希
算法 | 哈希摘要 | |
---|---|---|
SHA256 | a213c8503de0b257b9bf8df5a68a0fdff94a7a7a966c8d39a52a450513c10c2b |
|
MD5 | ea00953b0ac0c17aa45e0443b03d218c |
|
BLAKE2b-256 | 7963913e43c202a1ee2f65384b40ff6bec981c8af5af21726fdadd764c69c8da |