跳转到主要内容

隐藏对齐条件随机场,一种判别性字符串编辑距离

项目描述

https://travis-ci.org/dedupeio/pyhacrf.svg?branch=master https://ci.appveyor.com/api/projects/status/kibqrd7wnsk2ilpf/branch/master?svg=true

用于对字符串对进行分类的隐藏对齐条件随机场 - 可学习的编辑距离。

这是Dedupe.io云服务和开源工具集的一部分,用于在您的数据中去除重复项并找到模糊匹配:[https://dedupe.io](https://dedupe.io)

此包旨在以类似sklearn的接口实现HACRF机器学习模型。它包括将模型拟合到训练示例并评分新示例的方法。

该模型接受字符串对作为输入并将它们分类到任意数量的类别。在McCallum的原始论文中,该模型应用于数据库去重问题。每个数据库条目都与每个其他条目配对,然后根据匹配和不匹配的训练示例对配对进行分类,判断该配对是“匹配”还是“不匹配”。

我还尝试将其用作可学习的字符串编辑距离来规范化噪声文本。参见McCallum、Bellare和Pereira的《判别性训练的有限状态字符串编辑距离的条件随机场》以及Dirko Coetsee的报告《噪声文本归一化的条件随机场》。

示例

from pyhacrf import StringPairFeatureExtractor, Hacrf

training_X = [('helloooo', 'hello'), # Matching examples
              ('h0me', 'home'),
              ('krazii', 'crazy'),
              ('non matching string example', 'no really'), # Non-matching examples
              ('and another one', 'yep')]
training_y = ['match',
              'match',
              'match',
              'non-match',
              'non-match']

# Extract features
feature_extractor = StringPairFeatureExtractor(match=True, numeric=True)
training_X_extracted = feature_extractor.fit_transform(training_X)

# Train model
model = Hacrf(l2_regularization=1.0)
model.fit(training_X_extracted, training_y)

# Evaluate
from sklearn.metrics import confusion_matrix
predictions = model.predict(training_X_extracted)

print(confusion_matrix(training_y, predictions))
> [[0 3]
>  [2 0]]

print(model.predict_proba(training_X_extracted))
> [[ 0.94914812  0.05085188]
>  [ 0.92397711  0.07602289]
>  [ 0.86756034  0.13243966]
>  [ 0.05438812  0.94561188]
>  [ 0.02641275  0.97358725]]

依赖项

此包依赖于numpy。使用pylbfgs中的LBFGS优化器,但可以传递其他优化器。

安装

通过运行以下命令进行安装:

python setup.py install

或从pypi

pip install pyhacrf

开发

从仓库克隆,然后

pip install -r requirements.txt
cython pyhacrf/*.pyx
python setup.py install

要将软件包部署到 PyPI,请确保您已将 *.pyx 文件编译成 *.c 文件

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分发

pyhacrf_datamade-0.2.8.tar.gz (355.7 kB 查看哈希值)

上传时间

构建分发

pyhacrf_datamade-0.2.8-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.9 kB 查看哈希值)

上传时间 PyPy manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (243.1 kB 查看哈希值)

上传时间 PyPy manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-pp310-pypy310_pp73-macosx_11_0_arm64.whl (179.0 kB 查看哈希值)

上传时间 PyPy macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-pp310-pypy310_pp73-macosx_10_15_x86_64.whl (193.4 kB 查看哈希值)

上传时间 PyPy macOS 10.15+ x86-64

pyhacrf_datamade-0.2.8-pp39-pypy39_pp73-win_amd64.whl (184.5 kB 查看哈希值)

上传时间 PyPy Windows x86-64

pyhacrf_datamade-0.2.8-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.7 kB 查看哈希值)

上传时间 PyPy manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (242.9 kB 查看哈希值)

上传时间 PyPy manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-pp39-pypy39_pp73-macosx_11_0_arm64.whl (178.7 kB 查看哈希值)

上传于 PyPy macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-pp39-pypy39_pp73-macosx_10_15_x86_64.whl (193.1 kB 查看哈希值)

上传于 PyPy macOS 10.15+ x86-64

pyhacrf_datamade-0.2.8-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (252.0 kB 查看哈希值)

上传于 PyPy manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (254.0 kB 查看哈希值)

上传于 PyPy manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp312-cp312-win_amd64.whl (192.7 kB 查看哈希值)

上传于 CPython 3.12 Windows x86-64

pyhacrf_datamade-0.2.8-cp312-cp312-win32.whl (162.9 kB 查看哈希值)

上传于 CPython 3.12 Windows x86

pyhacrf_datamade-0.2.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB 查看哈希值)

上传于 CPython 3.12 manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (1.2 MB 查看哈希值)

上传于 CPython 3.12 manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB 查看哈希值)

上传于 CPython 3.12 manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp312-cp312-macosx_11_0_arm64.whl (202.1 kB 查看哈希值)

上传于 CPython 3.12 macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-cp312-cp312-macosx_10_9_x86_64.whl (213.4 kB 查看哈希值)

上传于 CPython 3.12 macOS 10.9+ x86-64

pyhacrf_datamade-0.2.8-cp312-cp312-macosx_10_9_universal2.whl (404.2 kB 查看哈希值)

上传于 CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

pyhacrf_datamade-0.2.8-cp311-cp311-win_amd64.whl (190.5 kB 查看哈希值)

上传于 CPython 3.11 Windows x86-64

pyhacrf_datamade-0.2.8-cp311-cp311-win32.whl (160.4 kB 查看哈希值)

上传于 CPython 3.11 Windows x86

pyhacrf_datamade-0.2.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB 查看哈希值)

上传于 CPython 3.11 manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (1.2 MB 查看哈希值)

上传于 CPython 3.11 manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB 查看哈希值)

上传于 CPython 3.11 manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp311-cp311-macosx_11_0_arm64.whl (198.8 kB 查看哈希值)

上传于 CPython 3.11 macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-cp311-cp311-macosx_10_9_x86_64.whl (209.4 kB 查看哈希值)

上传于 CPython 3.11 macOS 10.9+ x86-64

pyhacrf_datamade-0.2.8-cp311-cp311-macosx_10_9_universal2.whl (396.8 kB 查看哈希值)

上传于 CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

pyhacrf_datamade-0.2.8-cp310-cp310-win_amd64.whl (190.6 kB 查看哈希值)

上传于 CPython 3.10 Windows x86-64

pyhacrf_datamade-0.2.8-cp310-cp310-win32.whl (161.9 kB 查看哈希值)

上传时间 CPython 3.10 Windows x86

pyhacrf_datamade-0.2.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.10 manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.10 manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.10 manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp310-cp310-macosx_11_0_arm64.whl (198.7 kB 查看哈希值)

上传时间 CPython 3.10 macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-cp310-cp310-macosx_10_9_x86_64.whl (208.9 kB 查看哈希值)

上传时间 CPython 3.10 macOS 10.9+ x86-64

pyhacrf_datamade-0.2.8-cp310-cp310-macosx_10_9_universal2.whl (396.3 kB 查看哈希值)

上传时间 CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

pyhacrf_datamade-0.2.8-cp39-cp39-win_amd64.whl (191.8 kB 查看哈希值)

上传时间 CPython 3.9 Windows x86-64

pyhacrf_datamade-0.2.8-cp39-cp39-win32.whl (163.2 kB 查看哈希值)

上传时间 CPython 3.9 Windows x86

pyhacrf_datamade-0.2.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.9 manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB 查看哈希值)

上传于 CPython 3.9 manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB 查看哈希)

上传于 CPython 3.9 manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp39-cp39-macosx_11_0_arm64.whl (199.8 kB 查看哈希)

上传于 CPython 3.9 macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-cp39-cp39-macosx_10_9_x86_64.whl (210.2 kB 查看哈希)

上传于 CPython 3.9 macOS 10.9+ x86-64

pyhacrf_datamade-0.2.8-cp39-cp39-macosx_10_9_universal2.whl (398.6 kB 查看哈希)

上传于 CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

pyhacrf_datamade-0.2.8-cp38-cp38-win_amd64.whl (193.9 kB 查看哈希)

上传于 CPython 3.8 Windows x86-64

pyhacrf_datamade-0.2.8-cp38-cp38-win32.whl (165.9 kB 查看哈希)

上传于 CPython 3.8 Windows x86

pyhacrf_datamade-0.2.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB 查看哈希)

上传于 CPython 3.8 manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB 查看哈希)

上传于 CPython 3.8 manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB 查看哈希)

上传于 CPython 3.8 manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp38-cp38-macosx_11_0_arm64.whl (195.1 kB 查看哈希)

上传于 CPython 3.8 macOS 11.0+ ARM64

pyhacrf_datamade-0.2.8-cp38-cp38-macosx_10_9_x86_64.whl (217.7 kB 查看哈希值)

上传时间 CPython 3.8 macOS 10.9+ x86_64

pyhacrf_datamade-0.2.8-cp38-cp38-macosx_10_9_universal2.whl (401.1 kB 查看哈希值)

上传时间 CPython 3.8 macOS 10.9+ universal2 (ARM64, x86_64)

pyhacrf_datamade-0.2.8-cp37-cp37m-win_amd64.whl (192.3 kB 查看哈希值)

上传时间 CPython 3.7m Windows x86-64

pyhacrf_datamade-0.2.8-cp37-cp37m-win32.whl (163.8 kB 查看哈希值)

上传时间 CPython 3.7m Windows x86

pyhacrf_datamade-0.2.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.7m manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (1.0 MB 查看哈希值)

上传时间 CPython 3.7m manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.7m manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp37-cp37m-macosx_10_9_x86_64.whl (215.8 kB 查看哈希值)

上传时间 CPython 3.7m macOS 10.9+ x86-64

pyhacrf_datamade-0.2.8-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB 查看哈希值)

上传时间 CPython 3.6m manylinux: glibc 2.17+ x86-64

pyhacrf_datamade-0.2.8-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl (1.0 MB 查看哈希值)

上传时间 CPython 3.6m manylinux: glibc 2.17+ i686

pyhacrf_datamade-0.2.8-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB 查看哈希值)

上传时间: CPython 3.6m manylinux: glibc 2.17+ ARM64

pyhacrf_datamade-0.2.8-cp36-cp36m-macosx_10_9_x86_64.whl (209.2 kB 查看哈希值)

上传时间: CPython 3.6m macOS 10.9+ x86-64

由以下机构支持