跳转到主要内容

轻松实现机器学习算法和库之间的超参数优化和自动结果保存

项目描述

HyperparameterHunter

HyperparameterHunter Overview

Build Status Documentation Status Coverage Status codecov Maintainability Codacy Badge

PyPI version Downloads Donate Code style: black

自动保存并从实验结果中学习,实现长期持久优化,并记住所有测试。

HyperparameterHunter为机器学习算法提供了一个包装器,它保存所有重要数据。通过让HyperparameterHunter完成记录、组织和从您的测试中学习等繁重工作,简化实验和超参数调优过程——同时使用您已经使用的相同库。不要让任何实验浪费,以应有的方式开始超参数优化。

功能

  • 自动记录实验结果
  • 真正了解的超参数优化,自动使用过去的实验
  • 消除交叉验证循环、预测和评分的样板代码
  • 停止担心跟踪超参数、得分或重新运行相同的实验
  • 使用您已经喜爱的库和工具

如何使用HyperparameterHunter

不要将HyperparameterHunter视为仅在需要进行超参数优化时才使用的另一个优化库。当然,它可以进行优化,但最好将其视为您自己的个性化机器学习工具箱/助手。

理念是立即开始使用HyperparameterHunter。将所有基准/一次性实验通过它运行。

您使用HyperparameterHunter越多,结果越好。如果您只是用它进行优化,当然,它会按照您的要求执行,但这忽略了HyperparameterHunter的宗旨。

如果您在整个项目过程中一直用它进行实验和优化,那么当您决定进行超参数优化时,HyperparameterHunter已经了解您所做的一切,那时HyperparameterHunter会表现出色。它不会像其他库那样从头开始优化。它从您已经通过它运行的所有实验和之前的优化轮次开始。

入门指南

1) 环境

设置环境以组织实验和优化结果。
我们执行的任何实验或优化轮次都将使用我们的活动环境。

from hyperparameter_hunter import Environment, CVExperiment
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold

data = load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target'] = data.target

env = Environment(
    train_dataset=df,  # Add holdout/test dataframes, too
    results_path='path/to/results/directory',  # Where your result files will go
    metrics=['roc_auc_score'],  # Callables, or strings referring to `sklearn.metrics`
    cv_type=StratifiedKFold,  # Class, or string in `sklearn.model_selection`
    cv_params=dict(n_splits=5, shuffle=True, random_state=32)
)

2) 单个实验

通过提供模型初始化器和超参数,简单地使用您喜欢的库进行实验

Keras
# Same format used by `keras.wrappers.scikit_learn`. Nothing new to learn
def build_fn(input_shape):  # `input_shape` calculated for you
    model = Sequential([
        Dense(100, kernel_initializer='uniform', input_shape=input_shape, activation='relu'),
        Dropout(0.5),
        Dense(1, kernel_initializer='uniform', activation='sigmoid')
    ])  # All layer arguments saved (whether explicit or Keras default) for future use
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

experiment = CVExperiment(
    model_initializer=KerasClassifier,
    model_init_params=build_fn,  # We interpret your build_fn to save hyperparameters in a useful, readable format
    model_extra_params=dict(
        callbacks=[ReduceLROnPlateau(patience=5)],  # Use Keras callbacks
        batch_size=32, epochs=10, verbose=0  # Fit/predict arguments
    )
)
SKLearn
experiment = CVExperiment(
    model_initializer=LinearSVC,  # (Or any of the dozens of other SK-Learn algorithms)
    model_init_params=dict(penalty='l1', C=0.9)  # Default values used and recorded for kwargs not given
)
XGBoost
experiment = CVExperiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(objective='reg:linear', max_depth=3, n_estimators=100, subsample=0.5)
)
LightGBM
experiment = CVExperiment(
    model_initializer=LGBMClassifier,
    model_init_params=dict(boosting_type='gbdt', num_leaves=31, max_depth=-1, min_child_samples=5, subsample=0.5)
)
CatBoost
experiment = CVExperiment(
    model_initializer=CatboostClassifier,
    model_init_params=dict(iterations=500, learning_rate=0.01, depth=7, allow_writing_files=False),
    model_extra_params=dict(fit=dict(verbose=True))  # Send kwargs to `fit` and other extra methods
)
RGF
experiment = CVExperiment(
    model_initializer=RGFClassifier,
    model_init_params=dict(max_leaf=1000, algorithm='RGF', min_samples_leaf=10)
)

3) 超参数优化

就像实验一样,但如果您想优化超参数,请使用以下导入的类

from hyperparameter_hunter import Real, Integer, Categorical
from hyperparameter_hunter import optimization as opt
Keras
def build_fn(input_shape):
    model = Sequential([
        Dense(Integer(50, 150), input_shape=input_shape, activation='relu'),
        Dropout(Real(0.2, 0.7)),
        Dense(1, activation=Categorical(['sigmoid', 'softmax']))
    ])
    model.compile(
        optimizer=Categorical(['adam', 'rmsprop', 'sgd', 'adadelta']),
        loss='binary_crossentropy', metrics=['accuracy']
    )
    return model

optimizer = opt.RandomForestOptPro(iterations=7)
optimizer.forge_experiment(
    model_initializer=KerasClassifier,
    model_init_params=build_fn,
    model_extra_params=dict(
        callbacks=[ReduceLROnPlateau(patience=Integer(5, 10))],
        batch_size=Categorical([32, 64]),
        epochs=10, verbose=0
    )
)
optimizer.go()
SKLearn
optimizer = opt.DummyOptPro(iterations=42)
optimizer.forge_experiment(
    model_initializer=AdaBoostClassifier,  # (Or any of the dozens of other SKLearn algorithms)
    model_init_params=dict(
        n_estimators=Integer(75, 150),
        learning_rate=Real(0.8, 1.3),
        algorithm='SAMME.R'
    )
)
optimizer.go()
XGBoost
optimizer = opt.BayesianOptPro(iterations=10)
optimizer.forge_experiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(
        max_depth=Integer(low=2, high=20),
        learning_rate=Real(0.0001, 0.5),
        n_estimators=200,
        subsample=0.5,
        booster=Categorical(['gbtree', 'gblinear', 'dart']),
    )
)
optimizer.go()
LightGBM
optimizer = opt.BayesianOptPro(iterations=100)
optimizer.forge_experiment(
    model_initializer=LGBMClassifier,
    model_init_params=dict(
        boosting_type=Categorical(['gbdt', 'dart']),
        num_leaves=Integer(5, 20),
        max_depth=-1,
        min_child_samples=5,
        subsample=0.5
    )
)
optimizer.go()
CatBoost
optimizer = opt.GradientBoostedRegressionTreeOptPro(iterations=32)
optimizer.forge_experiment(
    model_initializer=CatBoostClassifier,
    model_init_params=dict(
        iterations=100,
        eval_metric=Categorical(['Logloss', 'Accuracy', 'AUC']),
        learning_rate=Real(low=0.0001, high=0.5),
        depth=Integer(4, 7),
        allow_writing_files=False
    )
)
optimizer.go()
RGF
optimizer = opt.ExtraTreesOptPro(iterations=10)
optimizer.forge_experiment(
    model_initializer=RGFClassifier,
    model_init_params=dict(
        max_leaf=1000,
        algorithm=Categorical(['RGF', 'RGF_Opt', 'RGF_Sib']),
        l2=Real(0.01, 0.3),
        normalize=Categorical([True, False]),
        learning_rate=Real(0.3, 0.7),
        loss=Categorical(['LS', 'Expo', 'Log', 'Abs'])
    )
)
optimizer.go()

输出文件结构

这是一个简单说明您预期的Experiment生成的文件结构。有关目录结构和各种文件内容的详细介绍,请参阅文档中的文件结构概述部分。但是,基本内容如下

  1. Experiment向每个HyperparameterHunterAssets/Experiments子目录添加一个文件,文件名为experiment_id
  2. 每个Experiment还将条目添加到HyperparameterHunterAssets/Leaderboards/GlobalLeaderboard.csv
  3. 通过Environmentfile_blacklistdo_full_save关键字参数(见此处)自定义要创建的文件
HyperparameterHunterAssets
|   Heartbeat.log
|
└───Experiments
|   |
|   └───Descriptions
|   |   |   <Files describing Experiment results, conditions, etc.>.json
|   |
|   └───Predictions<OOF/Holdout/Test>
|   |   |   <Files containing Experiment predictions for the indicated dataset>.csv
|   |
|   └───Heartbeats
|   |   |   <Files containing the log produced by the Experiment>.log
|   |
|   └───ScriptBackups
|       |   <Files containing a copy of the script that created the Experiment>.py
|
└───Leaderboards
|   |   GlobalLeaderboard.csv
|   |   <Other leaderboards>.csv
|
└───TestedKeys
|   |   <Files named by Environment key, containing hyperparameter keys>.json
|
└───KeyAttributeLookup
    |   <Files linking complex objects used in Experiments to their hashes>

安装

pip install hyperparameter-hunter

如果您喜欢走在前沿,并希望拥有所有最新发展,请运行

pip install git+https://github.com/HunterMcGushion/hyperparameter_hunter.git

如果您想为HyperparameterHunter做出贡献,请从此处开始。

我仍然不明白

没关系。不要难过。这有点难以理解。以下是一个示例,说明了所有事物之间的关系

from hyperparameter_hunter import Environment, CVExperiment, BayesianOptPro, Integer
from hyperparameter_hunter.utils.learning_utils import get_breast_cancer_data
from xgboost import XGBClassifier

# Start by creating an `Environment` - This is where you define how Experiments (and optimization) will be conducted
env = Environment(
    train_dataset=get_breast_cancer_data(target='target'),
    results_path='HyperparameterHunterAssets',
    metrics=['roc_auc_score'],
    cv_type='StratifiedKFold',
    cv_params=dict(n_splits=10, shuffle=True, random_state=32),
)

# Now, conduct an `Experiment`
# This tells HyperparameterHunter to use the settings in the active `Environment` to train a model with these hyperparameters
experiment = CVExperiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(
        objective='reg:linear',
        max_depth=3
    )
)

# That's it. No annoying boilerplate code to fit models and record results
# Now, the `Environment`'s `results_path` directory will contain new files describing the Experiment just conducted

# Time for the fun part. We'll set up some hyperparameter optimization by first defining the `OptPro` (Optimization Protocol) we want
optimizer = BayesianOptPro(verbose=1)

# Now we're going to say which hyperparameters we want to optimize.
# Notice how this looks just like our `experiment` above
optimizer.forge_experiment(
    model_initializer=XGBClassifier,
    model_init_params=dict(
        objective='reg:linear',  # We're setting this as a constant guideline - Not one to optimize
        max_depth=Integer(2, 10)  # Instead of using an int like the `experiment` above, we provide a space to search
    )
)
# Notice that our range for `max_depth` includes the `max_depth=3` value we used in our `experiment` earlier

optimizer.go()  # Now, we go

assert experiment.experiment_id in [_[2] for _ in optimizer.similar_experiments]
# Here we're verifying that the `experiment` we conducted first was found by `optimizer` and used as learning material
# You can also see via the console that we found `experiment`'s saved files, and used it to start optimization

last_experiment_id = optimizer.current_experiment.experiment_id
# Let's save the id of the experiment that was just conducted by `optimizer`

optimizer.go()  # Now, we'll start up `optimizer` again...

# And we can see that this second optimization round learned from both our first `experiment` and our first optimization round
assert experiment.experiment_id in [_[2] for _ in optimizer.similar_experiments]
assert last_experiment_id in [_[2] for _ in optimizer.similar_experiments]
# It even did all this without us having to tell it what experiments to learn from

# Now think about how much better your hyperparameter optimization will be when it learns from:
# - All your past experiments, and
# - All your past optimization rounds
# And the best part: HyperparameterHunter figures out which experiments are compatible all on its own
# You don't have to worry about telling it that KFold=5 is different from KFold=10,
# Or that max_depth=12 is outside of max_depth=Integer(2, 10)

测试过的库

注意事项/常见问题解答

这些都是可能会让您感到意外的事情

一般

  • 无法向OptPro提供初始搜索点?
    • 这是故意的。如果您希望优化轮次从特定的搜索点(您尚未记录的)开始,请在初始化OptPro之前执行一个CVExperiment
    • 假设这两个具有相同的指导性超参数,并且Experiment符合您OptPro定义的搜索空间,优化器将定位并读取Experiment的结果
    • 请注意,您可能希望在完成一次实验后移除Experiment,因为结果已经被保存。将其保留在那里只会重复执行相同的Experiment
  • 更改了我的“HyperparameterHunterAssets”目录中的内容后,一切停止工作
    • 是的,不要这样做。尤其是不要对“描述”、“排行榜”或“测试键”这样做
    • HyperparameterHunter通过直接读取这些文件来确定发生了什么
    • 删除它们或更改其内容可能会破坏HyperparameterHunter的许多功能

Keras

  • 找不到与简单的Dense/激活神经网络类似的实验吗?
    • 这可能是由于在单独使用Activation层和向Dense层提供activation关键字参数之间切换所导致的
    • 每个层都被视为一组独立的超参数(同时它本身也是超参数),这意味着对于HyperparameterHunter而言,以下两个示例并不等价
      • Dense(10, activation='sigmoid')
      • Dense(10); Activation('sigmoid')
    • 我们正在解决这个问题,但到目前为止,解决方案只是保持添加激活到您的模型中的一致性
      • 要么使用单独的Activation层,要么为其他层提供activation关键字参数,并坚持下去!
  • 无法同时优化model.compile参数:optimizeroptimizer_params
    • 这是因为Keras的optimizers期望不同的参数
    • 例如,当optimizer=Categorical(['adam', 'rmsprop'])时,有两个不同的可能的optimizer_params字典
    • 目前,您只能单独优化optimizeroptimizer_params
    • 一个可能的好方法是选择几个您想要测试的优化器,并且不要提供optimizer_params值。这样,每个optimizer将使用其默认参数
      • 然后您可以选择哪个optimizer是最好的,设置optimizer=<best optimizer>,然后继续调整optimizer_params,使用特定于您选择的optimizer的参数

CatBoost

  • 找不到CatBoost的类似实验?
    • 这可能是因为CatBoost模型__init__方法中期望的kwargs的默认值定义在别处,并在它们的签名中提供了占位符值None
    • 因此,如果未明确为该参数提供值,HyperparameterHunter假定该参数的默认值确实是None
    • 这显然不是情况,但我似乎找不到CatBoost实际使用的默认值所在的位置,所以如果有人知道如何解决这个问题,我将非常感激您的帮助!

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分布

hyperparameter_hunter-3.0.0.tar.gz (208.9 kB 查看哈希值)

上传于 源代码

构建版本

hyperparameter_hunter-3.0.0-py3-none-any.whl (233.3 kB 查看哈希值)

上传于 Python 3

由以下支持