Live Notebook
You can run this notebook in a live session or view it on Github.
Hyperparameter optimization with Dask¶
Every machine learning model has some values that are specified before training begins. These values help adapt the model to the data but must be given before any training data is seen. For example, this might be penalty
or C
in Scikitlearn’s LogisiticRegression. These values that come before any training data and are called “hyperparameters”. Typical usage looks something like:
from sklearn.linear_model import LogisiticRegression
from sklearn.datasets import make_classification
X, y = make_classification()
est = LogisiticRegression(C=10, penalty="l2")
est.fit(X, y)
These hyperparameters influence the quality of the prediction. For example, if C
is too small in the example above, the output of the estimator will not fit the data well.
Determining the values of these hyperparameters is difficult. In fact, Scikitlearn has an entire documentation page on finding the best values: https://scikitlearn.org/stable/modules/grid_search.html
Dask enables some new techniques and opportunities for hyperparameter optimization. One of these opportunities involves stopping training early to limit computation. Naturally, this requires some way to stop and restart training (partial_fit
or warm_start
in Scikitlearn parlance).
This is especially useful when the search is complex and has many search parameters. Good examples are most deep learning models, which has specialized algorithms for handling many data but have difficulty providing basic hyperparameters (e.g., “learning rate”, “momentum” or “weight decay”).
This notebook will walk through
setting up a realistic example
how to use
HyperbandSearchCV
, includingunderstanding the input parameters to
HyperbandSearchCV
running the hyperparameter optimization
how to access informantion from
HyperbandSearchCV
This notebook will specifically not show a performance comparison motivating HyperbandSearchCV
use. HyperbandSearchCV
finds high scores with minimal training; however, this is a tutorial on how to use it. All performance comparisons are relegated to section Learn more.
[1]:
%matplotlib inline
Setup Dask¶
[2]:
from distributed import Client
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='2GB')
client
[2]:
Client

Cluster

Create Data¶
[3]:
from sklearn.datasets import make_circles
import numpy as np
import pandas as pd
X, y = make_circles(n_samples=30_000, random_state=0, noise=0.09)
pd.DataFrame({0: X[:, 0], 1: X[:, 1], "class": y}).sample(4_000).plot.scatter(
x=0, y=1, alpha=0.2, c="class", cmap="bwr"
);
Add random dimensions¶
[4]:
from sklearn.utils import check_random_state
rng = check_random_state(42)
random_feats = rng.uniform(1, 1, size=(X.shape[0], 4))
X = np.hstack((X, random_feats))
X.shape
[4]:
(30000, 6)
Split and scale data¶
[5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5_000, random_state=42)
[6]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
[7]:
from dask.utils import format_bytes
for name, X in [("train", X_train), ("test", X_test)]:
print("dataset =", name)
print("shape =", X.shape)
print("bytes =", format_bytes(X.nbytes))
print("" * 20)
dataset = train
shape = (25000, 6)
bytes = 1.20 MB

dataset = test
shape = (5000, 6)
bytes = 240.00 kB

Now we have our train and test sets.
Create model and search space¶
Let’s use Scikitlearn’s MLPClassifier as our model (for convenience). Let’s use this model with 24 neurons and tune some of the other basic hyperparameters.
[8]:
import numpy as np
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
Deep learning libraries can be used as well. In particular, PyTorch’s ScikitLearn wrapper Skorch works well with HyperbandSearchCV
.
[9]:
params = {
"hidden_layer_sizes": [
(24, ),
(12, 12),
(6, 6, 6, 6),
(4, 4, 4, 4, 4, 4),
(12, 6, 3, 3),
],
"activation": ["relu", "logistic", "tanh"],
"alpha": np.logspace(6, 3, num=1000), # cnts
"batch_size": [16, 32, 64, 128, 256, 512],
}
Hyperparameter optimization¶
HyperbandSearchCV
is DaskML’s metaestimator to find the best hyperparameters. It can be used as an alternative to RandomizedSearchCV
to find similar hyperparameters in less time by not wasting time on hyperparameters that are not promising. Specifically, it is almost guaranteed that it will find high performing models with minimal training.
This section will focus on
Understanding the input parameters to
HyperbandSearchCV
Using
HyperbandSearchCV
to find the best hyperparametersSeeing other use cases of
HyperbandSearchCV
[10]:
from dask_ml.model_selection import HyperbandSearchCV
Determining input parameters¶
A ruleofthumb to determine HyperbandSearchCV
’s input parameters requires knowing:
the number of examples the longest trained model will see
the number of hyperparameters to evaluate
Let’s write down what these should be for this example:
[11]:
# For quick response
n_examples = 4 * len(X_train)
n_params = 8
# In practice, HyperbandSearchCV is most useful for longer searches
# n_examples = 15 * len(X_train)
# n_params = 15
In this, models that are trained the longest will see n_examples
examples. This is how much data is required, normally set be the problem difficulty. Simple problems may only need 10 passes through the dataset; more complex problems may need 100 passes through the dataset.
There will be n_params
parameters sampled so n_params
models will be evaluated. Models with low scores will be terminated before they see n_examples
examples. This helps perserve computation.
How can we use these values to determine the inputs for HyperbandSearchCV
?
[12]:
max_iter = n_params # number of times partial_fit will be called
chunks = n_examples // n_params # number of examples each call sees
max_iter, chunks
[12]:
(8, 12500)
This means that the longest trained estimator will see about n_examples
examples (specifically n_params * (n_examples // n_params
).
Applying input parameters¶
Let’s create a Dask array with this chunk size:
[13]:
import dask.array as da
X_train2 = da.from_array(X_train, chunks=chunks)
y_train2 = da.from_array(y_train, chunks=chunks)
X_train2
[13]:

Each partial_fit
call will receive one chunk.
That means the number of exmaples in each chunk should be (about) the same, and n_examples
and n_params
should be chosen to make that happen. (e.g., with 100 examples, shoot for chunks with (33, 33, 34)
examples not (48, 48, 4)
examples).
Now let’s use max_iter
to create our HyperbandSearchCV
object:
[14]:
search = HyperbandSearchCV(
model,
params,
max_iter=max_iter,
patience=True,
)
How much computation will be performed?¶
It isn’t clear how to determine how much computation is done from max_iter
and chunks
. Luckily, HyperbandSearchCV
has a metadata
attribute to determine this beforehand:
[15]:
search.metadata["partial_fit_calls"]
[15]:
26
This shows how many partial_fit
calls will be performed in the computation. metadata
also includes information on the number of models created.
So far, all that’s been done is getting the search ready for computation (and seeing how much computation will be performed). So far, all the computation has been quick and easy.
Performing the computation¶
Now, let’s do the model selection search and find the best hyperparameters. This is the real core of this notebook. This computation will be take place on all the hardware Dask has available.
[16]:
%%time
search.fit(X_train2, y_train2, classes=[0, 1, 2, 3])
CPU times: user 4.27 s, sys: 1.03 s, total: 5.3 s
Wall time: 3.78 s
[16]:
HyperbandSearchCV(aggressiveness=3,
estimator=MLPClassifier(activation='relu', alpha=0.0001,
batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False,
epsilon=1e08,
hidden_layer_sizes=(100,),
learning_rate='constant',
learning_rate_init=0.001,
max_iter=200, momentum=0.9,
n_iter_no_change=10,
nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle...
9.26759330e04, 9.33189772e04, 9.39664831e04, 9.46184819e04,
9.52750047e04, 9.59360829e04, 9.66017480e04, 9.72720319e04,
9.79469667e04, 9.86265846e04, 9.93109181e04, 1.00000000e03]),
'batch_size': [16, 32, 64, 128, 256, 512],
'hidden_layer_sizes': [(24,), (12, 12),
(6, 6, 6, 6),
(4, 4, 4, 4, 4, 4),
(12, 6, 3, 3)]},
patience=True, random_state=None, scoring=None,
test_size=None, tol=0.001)
The dashboard will be active while this is running. It will show which workers are running partial_fit
and score
calls. This takes about 10 seconds.
Integration¶
HyperbandSearchCV
follows the Scikitlearn API and mirrors Scikitlearn’s RandomizedSearchCV
. This means that it “just works”. All the Scikitlearn attributes and methods are available:
[17]:
search.best_score_
[17]:
0.8286
[18]:
search.best_estimator_
[18]:
MLPClassifier(activation='relu', alpha=0.0004242556430717777, batch_size=32,
beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e08,
hidden_layer_sizes=(12, 12), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
[19]:
cv_results = pd.DataFrame(search.cv_results_)
cv_results.head()
[19]:
param_batch_size  param_alpha  model_id  param_activation  rank_test_score  bracket  params  param_hidden_layer_sizes  mean_partial_fit_time  partial_fit_calls  mean_score_time  std_partial_fit_time  test_score  std_score_time  

0  128  0.000130  bracket=10  relu  2  1  {'hidden_layer_sizes': (24,), 'batch_size': 12...  (24,)  0.536319  2  0.010327  0.002661  0.5232  0.001270 
1  64  0.000288  bracket=11  relu  1  1  {'hidden_layer_sizes': (24,), 'batch_size': 64...  (24,)  0.387257  6  0.014947  0.235277  0.7128  0.007558 
2  512  0.000030  bracket=12  relu  3  1  {'hidden_layer_sizes': (12, 6, 3, 3), 'batch_s...  (12, 6, 3, 3)  0.440670  2  0.021061  0.036416  0.5074  0.005991 
3  512  0.000008  bracket=00  logistic  2  0  {'hidden_layer_sizes': (4, 4, 4, 4, 4, 4), 'ba...  (4, 4, 4, 4, 4, 4)  0.397423  3  0.024672  0.181207  0.5074  0.005530 
4  32  0.000424  bracket=01  relu  1  0  {'hidden_layer_sizes': (12, 12), 'batch_size':...  (12, 12)  0.307633  8  0.015034  0.042627  0.8286  0.009554 
[20]:
search.score(X_test, y_test)
[20]:
0.819
[21]:
search.predict(X_test)
[21]:

[22]:
search.predict(X_test).compute()
[22]:
array([1, 0, 1, ..., 1, 0, 0])
It also has some other attributes.
[23]:
hist = pd.DataFrame(search.history_)
hist.head()
[23]:
model_id  params  partial_fit_calls  partial_fit_time  score  score_time  elapsed_wall_time  bracket  

0  bracket=00  {'hidden_layer_sizes': (4, 4, 4, 4, 4, 4), 'ba...  1  0.578630  0.5074  0.019141  0.735826  0 
1  bracket=01  {'hidden_layer_sizes': (12, 12), 'batch_size':...  1  0.297158  0.5238  0.016905  0.735829  0 
2  bracket=10  {'hidden_layer_sizes': (24,), 'batch_size': 12...  1  0.538979  0.5010  0.009057  0.798567  1 
3  bracket=11  {'hidden_layer_sizes': (24,), 'batch_size': 64...  1  0.282382  0.5178  0.017484  0.798569  1 
4  bracket=12  {'hidden_layer_sizes': (12, 6, 3, 3), 'batch_s...  1  0.477087  0.5074  0.015070  0.798570  1 
This illustrates the history after every partial_fit
call. There’s also an attributed model_history_
that records the history for each model (it’s a reorganization of history_
).
Learn more¶
This notebook covered basic usage HyperbandSearchCV
. The following documentation and resources might be useful to learn more about HyperbandSearchCV
, including some of the finer use cases:
A talk introducing
HyperbandSearchCV
to the SciPy 2019 audience and the corresponding paper
Performance comparisons can be found in the SciPy 2019 talk/paper.