Contents

Automate Machine Learning with TPOT

Live Notebook

You can run this notebook in a live session or view it on Github.

Automate Machine Learning with TPOT¶

This example shows how TPOT can be used with Dask.

TPOT is an automated machine learning library. It evaluates many scikit-learn pipelines and hyperparameter combinations to find a model that works well for your data. Evaluating all these computations is computationally expensive, but ammenable to parallelism. TPOT can use Dask to distribute these computations on a cluster of machines.

This notebook can be run interactively on the dask examples binder. The following video shows a larger version of this notebook on a cluster.

[1]:

from IPython.display import YouTubeVideo

YouTubeVideo("uyx9nBuOYQQ")

[1]:

[2]:

import tpot
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.
  warnings.warn("Warning: optional dependency `torch` is not available. - skipping import of NN models.")

Setup Dask¶

We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.

You can view the dashboard by clicking on the dashboard link after you run the cell.

[3]:

from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client

[3]:

Client

Client-9d3a3f10-0de1-11ed-a59a-000d3a8f7959

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

554e7aff

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 6.78 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-21db5fb4-9a69-4b11-a57a-880ad23c4052

Comm: tcp://127.0.0.1:40911	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 6.78 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:34711	Total threads: 1
Dashboard: http://127.0.0.1:42307/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:43647
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-m8hopet5

Worker: 1

Comm: tcp://127.0.0.1:46865	Total threads: 1
Dashboard: http://127.0.0.1:38167/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:42469
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-1d4jm_9g

Worker: 2

Comm: tcp://127.0.0.1:39607	Total threads: 1
Dashboard: http://127.0.0.1:46559/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:34247
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-_ojdgmeq

Worker: 3

Comm: tcp://127.0.0.1:37721	Total threads: 1
Dashboard: http://127.0.0.1:37659/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:44933
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-nlky1gwa

Create Data¶

We’ll use the digits dataset. To ensure the example runs quickly, we’ll make the training dataset relatively small.

[4]:

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.05,
    test_size=0.95,
)

These are just small, in-memory NumPy arrays. This example is not applicable to larger-than-memory Dask arrays.

Using Dask¶

TPOT follows the scikit-learn API; we specify a TPOTClassifier with a few hyperparameters, and then fit it on some data. By default, TPOT trains on your single machine. To ensure your cluster is used, specify the use_dask keyword.

[5]:

# scale up: Increase the TPOT parameters like population_size, generations
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    config_dict=tpot.config.classifier_config_dict_light,
    use_dask=True,
)

[6]:

tp.fit(X_train, y_train)

[6]:

TPOTClassifier(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.PCA': {'iterated_power': range(1, 11),
                                                          'svd_solver': ['randomized']},
                            'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007...
                                                                          'max']},
                            'sklearn.preprocessing.RobustScaler': {},
                            'sklearn.preprocessing.StandardScaler': {},
                            'sklearn.tree.DecisionTreeClassifier': {'criterion': ['gini',
                                                                                  'entropy'],
                                                                    'max_depth': range(1, 11),
                                                                    'min_samples_leaf': range(1, 21),
                                                                    'min_samples_split': range(2, 21)},
                            'tpot.builtins.ZeroCount': {}},
               cv=2, generations=2, n_jobs=-1, population_size=10,
               random_state=0, use_dask=True)

Learn More¶

See the Dask-ML and TPOT documenation for more information on using Dask and TPOT.

Use Voting Classifiers

Generalized Linear Models

Dask Examples documentation