一个开源Python库,用于机器学习系统中漂移检测
项目描述
Frouros是一个用于机器学习系统中漂移检测的Python库,它提供了经典和更近期的算法组合,用于概念和数据漂移检测。
"一切都在变化,没有任何东西是静止的"
"你不能两次踏入同一条河流"
伊壁鸠鲁的赫拉克利特(公元前535-475年。)
⚡️ 快速入门
🔄 概念漂移
作为一个快速示例,我们可以使用乳腺癌数据集,该数据集已经受到概念漂移的影响,并展示如何使用DDM(漂移检测方法)等概念漂移检测器。我们可以看到概念漂移如何影响准确率。
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from frouros.detectors.concept_drift import DDM, DDMConfig
from frouros.metrics import PrequentialError
np.random.seed(seed=31)
# Load breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)
# Split train (70%) and test (30%)
(
X_train,
X_test,
y_train,
y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)
# Define and fit model
pipeline = Pipeline(
[
("scaler", StandardScaler()),
("model", LogisticRegression()),
]
)
pipeline.fit(X=X_train, y=y_train)
# Detector configuration and instantiation
config = DDMConfig(
warning_level=2.0,
drift_level=3.0,
min_num_instances=25, # minimum number of instances before checking for concept drift
)
detector = DDM(config=config)
# Metric to compute accuracy
metric = PrequentialError(alpha=1.0) # alpha=1.0 is equivalent to normal accuracy
def stream_test(X_test, y_test, y, metric, detector):
"""Simulate data stream over X_test and y_test. y is the true label."""
drift_flag = False
for i, (X, y) in enumerate(zip(X_test, y_test)):
y_pred = pipeline.predict(X.reshape(1, -1))
error = 1 - (y_pred.item() == y.item())
metric_error = metric(error_value=error)
_ = detector.update(value=error)
status = detector.status
if status["drift"] and not drift_flag:
drift_flag = True
print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")
if not drift_flag:
print("No concept drift detected")
print(f"Final accuracy: {1 - metric_error:.4f}\n")
# Simulate data stream (assuming test label available after each prediction)
# No concept drift is expected to occur
stream_test(
X_test=X_test,
y_test=y_test,
y=y,
metric=metric,
detector=detector,
)
# >> No concept drift detected
# >> Final accuracy: 0.9766
# IMPORTANT: Induce/simulate concept drift in the last part (20%)
# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))
drift_size = int(y_test.shape[0] * 0.2)
y_test_drift = y_test[-drift_size:]
modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5
y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))
y_test[-drift_size:] = y_test_drift
# Reset detector and metric
detector.reset()
metric.reset()
# Simulate data stream (assuming test label available after each prediction)
# Concept drift is expected to occur because of the label modification
stream_test(
X_test=X_test,
y_test=y_test,
y=y,
metric=metric,
detector=detector,
)
# >> Concept drift detected at step 142. Accuracy: 0.9510
# >> Final accuracy: 0.8480
更多概念漂移示例可以在这里找到。
📊 数据漂移
作为一个快速示例,我们可以使用受到数据漂移影响的天竺葵数据集,并展示如何使用Kolmogorov-Smirnov测试等数据漂移检测器。
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from frouros.detectors.data_drift import KSTest
np.random.seed(seed=31)
# Load iris dataset
X, y = load_iris(return_X_y=True)
# Split train (70%) and test (30%)
(
X_train,
X_test,
y_train,
y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)
# Set the feature index to which detector is applied
feature_idx = 0
# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by
# applying some gaussian noise. Therefore, changing P(X))
X_test[:, feature_idx] += np.random.normal(
loc=0.0,
scale=3.0,
size=X_test.shape[0],
)
# Define and fit model
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)
# Set significance level for hypothesis testing
alpha = 0.001
# Define and fit detector
detector = KSTest()
_ = detector.fit(X=X_train[:, feature_idx])
# Apply detector to the selected feature of X_test
result, _ = detector.compare(X=X_test[:, feature_idx])
# Check if drift is taking place
if result.p_value <= alpha:
print(f"Data drift detected at feature {feature_idx}")
else:
print(f"No data drift detected at feature {feature_idx}")
# >> Data drift detected at feature 0
# Therefore, we can reject H0 (both samples come from the same distribution).
更多数据漂移示例可以在这里找到。
🛠 安装
Frouros可以通过pip安装
pip install frouros
🕵🏻♂️️ 漂移检测方法
当前实现的检测器列表如下所示。
❗ Frouros是什么?什么不是Frouros?
与其他除了提供漂移检测算法外,还包含其他功能,如异常/离群值检测、对抗检测、不平衡学习等库不同,Frouros有,并且将只会有的一个目的:漂移检测。
我们坚信,机器学习相关的库或框架不应遵循万能的,却什么都不会的原则。相反,它们应该专注于单一任务并做好。
✅ 谁在使用Frouros?
Frouros目前正在以下项目中积极使用,以在机器学习管道中实现漂移检测:
如果您想将您的项目列在这里,请不要犹豫,向我们发送pull request。
👍 贡献
查看贡献部分。
💬 引用
尽管Frouros论文仍在预印本阶段,但如果你想要引用它,可以使用预印本版本(一旦发表,将替换为论文)。
@article{cespedes2022frouros,
title={Frouros: A Python library for drift detection in machine learning systems},
author={C{\'e}spedes-Sisniega, Jaime and L{\'o}pez-Garc{\'\i}a, {\'A}lvaro },
journal={arXiv preprint arXiv:2208.06868},
year={2022}
}
📝 许可证
Frouros是一个开源软件,根据BSD-3-Clause许可证授权。
🙏 致谢
Frouros已获得Agencia Estatal de Investigación,Unidad de Excelencia María de Maeztu,编号MDM-2017-0765的资金支持。
项目详情
下载文件
下载适用于您的平台的文件。如果您不确定选择哪一个,请了解有关安装包的更多信息。
源分发
构建分发
frouros-0.8.0.tar.gz的散列值
算法 | 散列摘要 | |
---|---|---|
SHA256 | aeca3180eea5e6d279a716ed3230fc8dafcba15782fcdcf8281acf818569b1f5 |
|
MD5 | 1e981aaa6fcfb479964c063a7bd76008 |
|
BLAKE2b-256 | 67af72f2b051d80b7fd4a752085020648f8046dcbf81086fbda07378a95e50d0 |
frouros-0.8.0-py3-none-any.whl的散列值
算法 | 散列摘要 | |
---|---|---|
SHA256 | 5a5459b89ee77ab6149e18888501bd7d30f886a664b8b4cbbe67935e8e4d14cb |
|
MD5 | 0032197a2df0e7e4230d51ccf260e914 |
|
BLAKE2b-256 | 1ab362988e8ebd7a87c5715f7dfd8c3a1feaf73d72e8034ee55a60ff262cd23a |