未提供项目描述
项目描述
datapatch
一个用于在杂乱数据上定义基于规则的覆盖的Python库。例如,想象一下尝试导入每个行都与一个国家关联的数据集——这些国家是由人工输入的。您可能会发现像 Northkorea
或 Greet Britain
这样的国家名称,您希望将其标准化。datapatch
创建了一个机制来构建一个灵活的查找表(通常存储为YAML文件),以捕捉和修复这些数据问题。
安装
您可以从Python包索引安装 datapatch
pip install datapatch
示例
给定一个类似这样的YAML文件
countries:
normalize: true
lowercase: true
asciify: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
该文件可用于对原始输入应用数据修复
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
扩展选项
有许多选项可供配置数据修复的应用
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
normalize: false
# By default normalize perform transliteration across alphabets (Путин -> Putin)
# set asciify to false if you want to keep non-ascii alphabets as is
asciify: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
结果对象
您还可以关联更多与结果相关的详细信息,并访问它们
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
这可以作为具有属性的属性的结果对象访问
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
许可
datapatch
在MIT许可的条款下获得许可,该许可包含在 LICENSE
中。