跳转到主要内容

一个开源Python库,用于机器学习系统中漂移检测

项目描述

logo


ci coverage documentation downloads downloads pypi python bsd_3_license arxiv

Frouros是一个用于机器学习系统中漂移检测的Python库,它提供了经典和更近期的算法组合,用于概念和数据漂移检测。

"一切都在变化,没有任何东西是静止的"

"你不能两次踏入同一条河流"

伊壁鸠鲁的赫拉克利特(公元前535-475年。)


⚡️ 快速入门

🔄 概念漂移

作为一个快速示例,我们可以使用乳腺癌数据集,该数据集已经受到概念漂移的影响,并展示如何使用DDM(漂移检测方法)等概念漂移检测器。我们可以看到概念漂移如何影响准确率。

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig
from frouros.metrics import PrequentialError

np.random.seed(seed=31)

# Load breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Define and fit model
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("model", LogisticRegression()),
    ]
)
pipeline.fit(X=X_train, y=y_train)

# Detector configuration and instantiation
config = DDMConfig(
    warning_level=2.0,
    drift_level=3.0,
    min_num_instances=25,  # minimum number of instances before checking for concept drift
)
detector = DDM(config=config)

# Metric to compute accuracy
metric = PrequentialError(alpha=1.0)  # alpha=1.0 is equivalent to normal accuracy

def stream_test(X_test, y_test, y, metric, detector):
    """Simulate data stream over X_test and y_test. y is the true label."""
    drift_flag = False
    for i, (X, y) in enumerate(zip(X_test, y_test)):
        y_pred = pipeline.predict(X.reshape(1, -1))
        error = 1 - (y_pred.item() == y.item())
        metric_error = metric(error_value=error)
        _ = detector.update(value=error)
        status = detector.status
        if status["drift"] and not drift_flag:
            drift_flag = True
            print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")
    if not drift_flag:
        print("No concept drift detected")
    print(f"Final accuracy: {1 - metric_error:.4f}\n")

# Simulate data stream (assuming test label available after each prediction)
# No concept drift is expected to occur
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> No concept drift detected
# >> Final accuracy: 0.9766

# IMPORTANT: Induce/simulate concept drift in the last part (20%)
# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))
drift_size = int(y_test.shape[0] * 0.2)
y_test_drift = y_test[-drift_size:]
modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5
y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))
y_test[-drift_size:] = y_test_drift

# Reset detector and metric
detector.reset()
metric.reset()

# Simulate data stream (assuming test label available after each prediction)
# Concept drift is expected to occur because of the label modification
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> Concept drift detected at step 142. Accuracy: 0.9510
# >> Final accuracy: 0.8480

更多概念漂移示例可以在这里找到。

📊 数据漂移

作为一个快速示例,我们可以使用受到数据漂移影响的天竺葵数据集,并展示如何使用Kolmogorov-Smirnov测试等数据漂移检测器。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

# Load iris dataset
X, y = load_iris(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Set the feature index to which detector is applied
feature_idx = 0

# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by
# applying some gaussian noise. Therefore, changing P(X))
X_test[:, feature_idx] += np.random.normal(
    loc=0.0,
    scale=3.0,
    size=X_test.shape[0],
)

# Define and fit model
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)

# Set significance level for hypothesis testing
alpha = 0.001
# Define and fit detector
detector = KSTest()
_ = detector.fit(X=X_train[:, feature_idx])

# Apply detector to the selected feature of X_test
result, _ = detector.compare(X=X_test[:, feature_idx])

# Check if drift is taking place
if result.p_value <= alpha:
    print(f"Data drift detected at feature {feature_idx}")
else:
    print(f"No data drift detected at feature {feature_idx}")
# >> Data drift detected at feature 0
# Therefore, we can reject H0 (both samples come from the same distribution).

更多数据漂移示例可以在这里找到。

🛠 安装

Frouros可以通过pip安装

pip install frouros

🕵🏻‍♂️️ 漂移检测方法

当前实现的检测器列表如下所示。

漂移检测器 类型 系列 单变量(U)/ 多变量(M) 数值(N)/ 分类(C) 方法 参考
概念漂移 流式处理 变化检测 U N BOCD Adams and MacKay (2007)
U N CUSUM Page (1954)
U N 几何移动平均 Roberts (1959)
U N Page Hinkley Page (1954)
统计过程控制 U N DDM Gama et al. (2004)
U N ECDD-WT Ross et al. (2012)
U N EDDM Baena-Garcıa et al. (2006)
U N HDDM-A Frias-Blanco et al. (2014)
U N HDDM-W Frias-Blanco et al. (2014)
U N RDDM Barros et al. (2017)
基于窗口的 U N ADWIN Bifet and Gavalda (2007)
U N KSWIN Raab et al. (2020)
U N STEPD Nishida and Yamauchi (2007)
数据漂移 批量 基于距离的 U N Bhattacharyya距离 Bhattacharyya (1946)
U N 地球迁移距离 Rubner et al. (2000)
U N 能量距离 Székely et al. (2013)
U N Hellinger距离 Hellinger (1909)
U N 直方图交集归一化补数 Swain and Ballard (1991)
U N Jensen-Shannon距离 Lin (1991)
U N Kullback-Leibler散度 Kullback and Leibler (1951)
M N 最大均值差异 Gretton et al. (2012)
U N 人口稳定性指数 Wu and Olson (2010)
统计测试 U N 安德森-达尔林格测试 Scholz and Stephens (1987)
U N 鲍姆加特纳-魏斯-辛德勒测试 Baumgartner et al. (1998)
U C 卡方检验 皮尔逊 (1900)
U N Cramér-von Mises测试 Cramér (1902)
U N Kolmogorov-Smirnov测试 Massey Jr (1951)
U N Kuiper's测试 Kuiper (1960)
U N Mann-Whitney U测试 Mann and Whitney (1947)
U N 威尔斯的t测试 威尔斯 (1947)
流式处理 基于距离的 M N 最大均值差异 Gretton et al. (2012)
统计测试 U N 增量Kolmogorov-Smirnov测试 dos Reis et al. (2016)

❗ Frouros是什么?什么不是Frouros?

与其他除了提供漂移检测算法外,还包含其他功能,如异常/离群值检测、对抗检测、不平衡学习等库不同,Frouros有,并且将只会有的一个目的:漂移检测

我们坚信,机器学习相关的库或框架不应遵循万能的,却什么都不会的原则。相反,它们应该专注于单一任务并做好。

✅ 谁在使用Frouros?

Frouros目前正在以下项目中积极使用,以在机器学习管道中实现漂移检测:

如果您想将您的项目列在这里,请不要犹豫,向我们发送pull request。

👍 贡献

查看贡献部分。

💬 引用

尽管Frouros论文仍在预印本阶段,但如果你想要引用它,可以使用预印本版本(一旦发表,将替换为论文)。

@article{cespedes2022frouros,
  title={Frouros: A Python library for drift detection in machine learning systems},
  author={C{\'e}spedes-Sisniega, Jaime and L{\'o}pez-Garc{\'\i}a, {\'A}lvaro },
  journal={arXiv preprint arXiv:2208.06868},
  year={2022}
}

📝 许可证

Frouros是一个开源软件,根据BSD-3-Clause许可证授权。

🙏 致谢

Frouros已获得Agencia Estatal de Investigación,Unidad de Excelencia María de Maeztu,编号MDM-2017-0765的资金支持。

项目详情


下载文件

下载适用于您的平台的文件。如果您不确定选择哪一个,请了解有关安装包的更多信息。

源分发

frouros-0.8.0.tar.gz (79.1 kB 查看散列值)

上传时间:

构建分发

frouros-0.8.0-py3-none-any.whl (126.0 kB 查看散列值)

上传时间: Python 3

支持者

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF 赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面