轻松实现机器学习算法和库之间的超参数优化和自动结果保存
项目描述
HyperparameterHunter
自动保存并从实验结果中学习,实现长期持久优化,并记住所有测试。
HyperparameterHunter为机器学习算法提供了一个包装器,它保存所有重要数据。通过让HyperparameterHunter完成记录、组织和从您的测试中学习等繁重工作,简化实验和超参数调优过程——同时使用您已经使用的相同库。不要让任何实验浪费,以应有的方式开始超参数优化。
- 安装:
pip install hyperparameter-hunter
- 源代码: https://github.com/HunterMcGushion/hyperparameter_hunter
- 文档: https://hyperparameter-hunter.readthedocs.io
功能
- 自动记录实验结果
- 真正了解的超参数优化,自动使用过去的实验
- 消除交叉验证循环、预测和评分的样板代码
- 停止担心跟踪超参数、得分或重新运行相同的实验
- 使用您已经喜爱的库和工具
如何使用HyperparameterHunter
不要将HyperparameterHunter视为仅在需要进行超参数优化时才使用的另一个优化库。当然,它可以进行优化,但最好将其视为您自己的个性化机器学习工具箱/助手。
理念是立即开始使用HyperparameterHunter。将所有基准/一次性实验通过它运行。
您使用HyperparameterHunter越多,结果越好。如果您只是用它进行优化,当然,它会按照您的要求执行,但这忽略了HyperparameterHunter的宗旨。
如果您在整个项目过程中一直用它进行实验和优化,那么当您决定进行超参数优化时,HyperparameterHunter已经了解您所做的一切,那时HyperparameterHunter会表现出色。它不会像其他库那样从头开始优化。它从您已经通过它运行的所有实验和之前的优化轮次开始。
入门指南
1) 环境
设置环境以组织实验和优化结果。
我们执行的任何实验或优化轮次都将使用我们的活动环境。
from hyperparameter_hunter import Environment, CVExperiment
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold
data = load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target'] = data.target
env = Environment(
train_dataset=df, # Add holdout/test dataframes, too
results_path='path/to/results/directory', # Where your result files will go
metrics=['roc_auc_score'], # Callables, or strings referring to `sklearn.metrics`
cv_type=StratifiedKFold, # Class, or string in `sklearn.model_selection`
cv_params=dict(n_splits=5, shuffle=True, random_state=32)
)
2) 单个实验
通过提供模型初始化器和超参数,简单地使用您喜欢的库进行实验
Keras
# Same format used by `keras.wrappers.scikit_learn`. Nothing new to learn
def build_fn(input_shape): # `input_shape` calculated for you
model = Sequential([
Dense(100, kernel_initializer='uniform', input_shape=input_shape, activation='relu'),
Dropout(0.5),
Dense(1, kernel_initializer='uniform', activation='sigmoid')
]) # All layer arguments saved (whether explicit or Keras default) for future use
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
experiment = CVExperiment(
model_initializer=KerasClassifier,
model_init_params=build_fn, # We interpret your build_fn to save hyperparameters in a useful, readable format
model_extra_params=dict(
callbacks=[ReduceLROnPlateau(patience=5)], # Use Keras callbacks
batch_size=32, epochs=10, verbose=0 # Fit/predict arguments
)
)
SKLearn
experiment = CVExperiment(
model_initializer=LinearSVC, # (Or any of the dozens of other SK-Learn algorithms)
model_init_params=dict(penalty='l1', C=0.9) # Default values used and recorded for kwargs not given
)
XGBoost
experiment = CVExperiment(
model_initializer=XGBClassifier,
model_init_params=dict(objective='reg:linear', max_depth=3, n_estimators=100, subsample=0.5)
)
LightGBM
experiment = CVExperiment(
model_initializer=LGBMClassifier,
model_init_params=dict(boosting_type='gbdt', num_leaves=31, max_depth=-1, min_child_samples=5, subsample=0.5)
)
CatBoost
experiment = CVExperiment(
model_initializer=CatboostClassifier,
model_init_params=dict(iterations=500, learning_rate=0.01, depth=7, allow_writing_files=False),
model_extra_params=dict(fit=dict(verbose=True)) # Send kwargs to `fit` and other extra methods
)
RGF
experiment = CVExperiment(
model_initializer=RGFClassifier,
model_init_params=dict(max_leaf=1000, algorithm='RGF', min_samples_leaf=10)
)
3) 超参数优化
就像实验一样,但如果您想优化超参数,请使用以下导入的类
from hyperparameter_hunter import Real, Integer, Categorical
from hyperparameter_hunter import optimization as opt
Keras
def build_fn(input_shape):
model = Sequential([
Dense(Integer(50, 150), input_shape=input_shape, activation='relu'),
Dropout(Real(0.2, 0.7)),
Dense(1, activation=Categorical(['sigmoid', 'softmax']))
])
model.compile(
optimizer=Categorical(['adam', 'rmsprop', 'sgd', 'adadelta']),
loss='binary_crossentropy', metrics=['accuracy']
)
return model
optimizer = opt.RandomForestOptPro(iterations=7)
optimizer.forge_experiment(
model_initializer=KerasClassifier,
model_init_params=build_fn,
model_extra_params=dict(
callbacks=[ReduceLROnPlateau(patience=Integer(5, 10))],
batch_size=Categorical([32, 64]),
epochs=10, verbose=0
)
)
optimizer.go()
SKLearn
optimizer = opt.DummyOptPro(iterations=42)
optimizer.forge_experiment(
model_initializer=AdaBoostClassifier, # (Or any of the dozens of other SKLearn algorithms)
model_init_params=dict(
n_estimators=Integer(75, 150),
learning_rate=Real(0.8, 1.3),
algorithm='SAMME.R'
)
)
optimizer.go()
XGBoost
optimizer = opt.BayesianOptPro(iterations=10)
optimizer.forge_experiment(
model_initializer=XGBClassifier,
model_init_params=dict(
max_depth=Integer(low=2, high=20),
learning_rate=Real(0.0001, 0.5),
n_estimators=200,
subsample=0.5,
booster=Categorical(['gbtree', 'gblinear', 'dart']),
)
)
optimizer.go()
LightGBM
optimizer = opt.BayesianOptPro(iterations=100)
optimizer.forge_experiment(
model_initializer=LGBMClassifier,
model_init_params=dict(
boosting_type=Categorical(['gbdt', 'dart']),
num_leaves=Integer(5, 20),
max_depth=-1,
min_child_samples=5,
subsample=0.5
)
)
optimizer.go()
CatBoost
optimizer = opt.GradientBoostedRegressionTreeOptPro(iterations=32)
optimizer.forge_experiment(
model_initializer=CatBoostClassifier,
model_init_params=dict(
iterations=100,
eval_metric=Categorical(['Logloss', 'Accuracy', 'AUC']),
learning_rate=Real(low=0.0001, high=0.5),
depth=Integer(4, 7),
allow_writing_files=False
)
)
optimizer.go()
RGF
optimizer = opt.ExtraTreesOptPro(iterations=10)
optimizer.forge_experiment(
model_initializer=RGFClassifier,
model_init_params=dict(
max_leaf=1000,
algorithm=Categorical(['RGF', 'RGF_Opt', 'RGF_Sib']),
l2=Real(0.01, 0.3),
normalize=Categorical([True, False]),
learning_rate=Real(0.3, 0.7),
loss=Categorical(['LS', 'Expo', 'Log', 'Abs'])
)
)
optimizer.go()
输出文件结构
这是一个简单说明您预期的Experiment
生成的文件结构。有关目录结构和各种文件内容的详细介绍,请参阅文档中的文件结构概述部分。但是,基本内容如下
Experiment
向每个HyperparameterHunterAssets/Experiments子目录添加一个文件,文件名为experiment_id
- 每个
Experiment
还将条目添加到HyperparameterHunterAssets/Leaderboards/GlobalLeaderboard.csv - 通过
Environment
的file_blacklist
和do_full_save
关键字参数(见此处)自定义要创建的文件
HyperparameterHunterAssets
| Heartbeat.log
|
└───Experiments
| |
| └───Descriptions
| | | <Files describing Experiment results, conditions, etc.>.json
| |
| └───Predictions<OOF/Holdout/Test>
| | | <Files containing Experiment predictions for the indicated dataset>.csv
| |
| └───Heartbeats
| | | <Files containing the log produced by the Experiment>.log
| |
| └───ScriptBackups
| | <Files containing a copy of the script that created the Experiment>.py
|
└───Leaderboards
| | GlobalLeaderboard.csv
| | <Other leaderboards>.csv
|
└───TestedKeys
| | <Files named by Environment key, containing hyperparameter keys>.json
|
└───KeyAttributeLookup
| <Files linking complex objects used in Experiments to their hashes>
安装
pip install hyperparameter-hunter
如果您喜欢走在前沿,并希望拥有所有最新发展,请运行
pip install git+https://github.com/HunterMcGushion/hyperparameter_hunter.git
如果您想为HyperparameterHunter做出贡献,请从此处开始。
我仍然不明白
没关系。不要难过。这有点难以理解。以下是一个示例,说明了所有事物之间的关系
from hyperparameter_hunter import Environment, CVExperiment, BayesianOptPro, Integer
from hyperparameter_hunter.utils.learning_utils import get_breast_cancer_data
from xgboost import XGBClassifier
# Start by creating an `Environment` - This is where you define how Experiments (and optimization) will be conducted
env = Environment(
train_dataset=get_breast_cancer_data(target='target'),
results_path='HyperparameterHunterAssets',
metrics=['roc_auc_score'],
cv_type='StratifiedKFold',
cv_params=dict(n_splits=10, shuffle=True, random_state=32),
)
# Now, conduct an `Experiment`
# This tells HyperparameterHunter to use the settings in the active `Environment` to train a model with these hyperparameters
experiment = CVExperiment(
model_initializer=XGBClassifier,
model_init_params=dict(
objective='reg:linear',
max_depth=3
)
)
# That's it. No annoying boilerplate code to fit models and record results
# Now, the `Environment`'s `results_path` directory will contain new files describing the Experiment just conducted
# Time for the fun part. We'll set up some hyperparameter optimization by first defining the `OptPro` (Optimization Protocol) we want
optimizer = BayesianOptPro(verbose=1)
# Now we're going to say which hyperparameters we want to optimize.
# Notice how this looks just like our `experiment` above
optimizer.forge_experiment(
model_initializer=XGBClassifier,
model_init_params=dict(
objective='reg:linear', # We're setting this as a constant guideline - Not one to optimize
max_depth=Integer(2, 10) # Instead of using an int like the `experiment` above, we provide a space to search
)
)
# Notice that our range for `max_depth` includes the `max_depth=3` value we used in our `experiment` earlier
optimizer.go() # Now, we go
assert experiment.experiment_id in [_[2] for _ in optimizer.similar_experiments]
# Here we're verifying that the `experiment` we conducted first was found by `optimizer` and used as learning material
# You can also see via the console that we found `experiment`'s saved files, and used it to start optimization
last_experiment_id = optimizer.current_experiment.experiment_id
# Let's save the id of the experiment that was just conducted by `optimizer`
optimizer.go() # Now, we'll start up `optimizer` again...
# And we can see that this second optimization round learned from both our first `experiment` and our first optimization round
assert experiment.experiment_id in [_[2] for _ in optimizer.similar_experiments]
assert last_experiment_id in [_[2] for _ in optimizer.similar_experiments]
# It even did all this without us having to tell it what experiments to learn from
# Now think about how much better your hyperparameter optimization will be when it learns from:
# - All your past experiments, and
# - All your past optimization rounds
# And the best part: HyperparameterHunter figures out which experiments are compatible all on its own
# You don't have to worry about telling it that KFold=5 is different from KFold=10,
# Or that max_depth=12 is outside of max_depth=Integer(2, 10)
测试过的库
- Keras
- scikit-learn
- LightGBM
- CatBoost
- XGBoost
- rgf_python
- ... 更多即将到来
注意事项/常见问题解答
这些都是可能会让您感到意外的事情
一般
- 无法向
OptPro
提供初始搜索点?- 这是故意的。如果您希望优化轮次从特定的搜索点(您尚未记录的)开始,请在初始化
OptPro
之前执行一个CVExperiment
- 假设这两个具有相同的指导性超参数,并且
Experiment
符合您OptPro
定义的搜索空间,优化器将定位并读取Experiment
的结果 - 请注意,您可能希望在完成一次实验后移除
Experiment
,因为结果已经被保存。将其保留在那里只会重复执行相同的Experiment
- 这是故意的。如果您希望优化轮次从特定的搜索点(您尚未记录的)开始,请在初始化
- 更改了我的“HyperparameterHunterAssets”目录中的内容后,一切停止工作
- 是的,不要这样做。尤其是不要对“描述”、“排行榜”或“测试键”这样做
- HyperparameterHunter通过直接读取这些文件来确定发生了什么
- 删除它们或更改其内容可能会破坏HyperparameterHunter的许多功能
Keras
- 找不到与简单的Dense/激活神经网络类似的实验吗?
- 这可能是由于在单独使用
Activation
层和向Dense
层提供activation
关键字参数之间切换所导致的 - 每个层都被视为一组独立的超参数(同时它本身也是超参数),这意味着对于HyperparameterHunter而言,以下两个示例并不等价
Dense(10, activation='sigmoid')
Dense(10); Activation('sigmoid')
- 我们正在解决这个问题,但到目前为止,解决方案只是保持添加激活到您的模型中的一致性
- 要么使用单独的
Activation
层,要么为其他层提供activation
关键字参数,并坚持下去!
- 要么使用单独的
- 这可能是由于在单独使用
- 无法同时优化
model.compile
参数:optimizer
和optimizer_params
?- 这是因为Keras的
optimizers
期望不同的参数 - 例如,当
optimizer=Categorical(['adam', 'rmsprop'])
时,有两个不同的可能的optimizer_params
字典 - 目前,您只能单独优化
optimizer
和optimizer_params
- 一个可能的好方法是选择几个您想要测试的优化器,并且不要提供
optimizer_params
值。这样,每个optimizer
将使用其默认参数- 然后您可以选择哪个
optimizer
是最好的,设置optimizer=<best optimizer>
,然后继续调整optimizer_params
,使用特定于您选择的optimizer
的参数
- 然后您可以选择哪个
- 这是因为Keras的
CatBoost
- 找不到CatBoost的类似实验?
- 这可能是因为CatBoost模型
__init__
方法中期望的kwargs的默认值定义在别处,并在它们的签名中提供了占位符值None
- 因此,如果未明确为该参数提供值,HyperparameterHunter假定该参数的默认值确实是
None
- 这显然不是情况,但我似乎找不到CatBoost实际使用的默认值所在的位置,所以如果有人知道如何解决这个问题,我将非常感激您的帮助!
- 这可能是因为CatBoost模型
项目详情
下载文件
下载适合您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。
源分布
构建版本
hyperparameter_hunter-3.0.0.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 279baf496c31aa2542479e97db96dccd182d9ddc2e1a7da4ad7a570e83d135a2 |
|
MD5 | a029938f443d3930cbbcf84d7c03df8f |
|
BLAKE2b-256 | 1ac89e703dba3866daf8255ac0fdb2b68c79d6e5567645d88aea59240c9c5966 |
hyperparameter_hunter-3.0.0-py3-none-any.whl的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 4fbe5c320cb29c6043f7532d14dcb05b42dc0d4c613dfa206b732302e985d2d8 |
|
MD5 | 48cc0da3cc8efc68c747cf43ec9a4e0c |
|
BLAKE2b-256 | 7ae0e6e73f1bb07cc738746a5a5bfdbc575b4123c8a6aa388b4d64e4880c802f |