跳转到主要内容

Python包,用于以标准方式清理与机器学习相关的标签。

项目描述

Sanitize ML Labels

PyPI Downloads License CI

Sanitize ML Labels是一个Python包,旨在标准化和清理与机器学习相关的标签。目前支持超过100个标签,包括指标和模型名称。

如果您有与机器学习相关的标签,并且您发现自己以一致的方式重命名和清理它们,使用适当的格式化,此包确保它们始终以标准方式清理。

如何安装此包?

您可以使用pip进行安装

pip install sanitize_ml_labels

使用示例

以下是规范化标签的一些常见用例

指标示例

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "acc",
    "loss",
    "auroc",
    "lr"
]

assert sanitize_ml_labels(labels) == [
    "Accuracy",
    "Loss",
    "AUROC",
    "Learning rate"
]

模型示例

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "mlp",
    "cnn",
    "ffNN",
    "Feed-forward neural network",
    "perceptron",
    "recurrent neural network",
    "LStM"
]

assert sanitize_ml_labels(labels) == [
    "MLP",
    "CNN",
    "FFNN",
    "FFNN",
    "Perceptron",
    "RNN",
    "LSTM"
]

assert sanitize_ml_labels("vanilla mlp") == "MLP"
assert sanitize_ml_labels("vanilla cnn") == "CNN"

assert sanitize_ml_labels([
    "Large Language Model",
    "transe",
    "Generative Pre-trained Transformer",
    "Graph Convolutional Neural Network",
    "Convolutional Graph Neural Network",
    "Graph Neural Network",
    "Graph Attention Network",
    "Graph Attention Neural Network",
]) == ["LLM","TransE","GPT","GCN","GCN","GNN","GAT","GAT"]

有时,您可能会遇到所有模型前缀为“vanilla”或“simple”或“basic”的情况。此包可以帮助您删除这些前缀。

from sanitize_ml_labels import sanitize_ml_labels

labels = [
    "vanilla mlp",
    "vanilla cnn",
    "vanilla ffnn",
    "vanilla perceptron"
]

assert sanitize_ml_labels(labels) == ["MLP", "CNN", "FFNN", "Perceptron"]

边界情况

有时,您可能会遇到需要正确识别和规范的带连字符的术语。我们使用基于一个超过45K个带连字符的英语单词的扩展列表的启发式方法,这些单词最初来自Metadata consulting网站

Tommaso Fontana编写的查找启发式方法确保高效且准确地识别带连字符的单词。

from sanitize_ml_labels import sanitize_ml_labels

# Running the following
assert sanitize_ml_labels("non-existent-edges-in-graph") == "Non-existent edges in graph"

额外工具

除了标签清理之外,该包还提供检查指标规范化的方法

是否是规范化的指标

验证指标是否落在[0, 1]的范围内。

from sanitize_ml_labels import is_normalized_metric

assert not is_normalized_metric("MSE")
assert is_normalized_metric("acc")
assert is_normalized_metric("accuracy")
assert is_normalized_metric("AUROC")
assert is_normalized_metric("auprc")

是否是绝对规范化的指标

验证指标是否落在[-1, 1]的范围内。

from sanitize_ml_labels import is_absolutely_normalized_metric

assert not is_absolutely_normalized_metric("auprc")
assert is_absolutely_normalized_metric("MCC")
assert is_absolutely_normalized_metric("Markedness")

应最大化

一个指标应该最大化还是最小化。未知指标将引发一个 NotImplementedError

from sanitize_ml_labels import should_be_maximized

assert not should_be_maximized("MSE")
assert should_be_maximized("AUROC")
assert should_be_maximized("accuracy")

许可

本软件遵照MIT许可证发布。查看LICENSE

项目详情


下载文件

下载适合您平台的文件。如果您不确定该选择哪个,请了解有关安装包的更多信息。

源分发

sanitize_ml_labels-1.1.2.tar.gz (326.3 kB 查看哈希值)

上传时间

由以下机构支持