Live Notebook

You can run this notebook in a live session Binder or view it on Github.

Automate Machine Learning with TPOT

This example shows how TPOT can be used with Dask.

TPOT is an automated machine learning library. It evaluates many scikit-learn pipelines and hyperparameter combinations to find a model that works well for your data. Evaluating all these computations is computationally expensive, but ammenable to parallelism. TPOT can use Dask to distribute these computations on a cluster of machines.

This notebook can be run interactively on the dask examples binder. The following video shows a larger version of this notebook on a cluster.

[1]:
from IPython.display import HTML

HTML('<div style="position:relative;height:0;padding-bottom:56.25%"><iframe src="https://www.youtube.com/embed/uyx9nBuOYQQ?ecver=2" width="640" height="360" frameborder="0" allow="autoplay; encrypted-media" style="position:absolute;width:100%;height:100%;left:0" allowfullscreen></iframe></div>')
[1]:
[2]:
!pip install tpot
Collecting tpot
  Downloading https://files.pythonhosted.org/packages/a5/29/f38a5751276cd901bca8f04ca9a98569a9d4eacd3236bc19a0bf0c834f74/TPOT-0.11.0.tar.gz (896kB)

Requirement already satisfied: numpy>=1.16.3 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from tpot) (1.17.3)
Requirement already satisfied: scipy>=1.3.1 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from tpot) (1.3.2)
Requirement already satisfied: scikit-learn>=0.21.0 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from tpot) (0.21.3)
Collecting deap>=1.2
  Downloading https://files.pythonhosted.org/packages/01/71/7b68e4a79812afbf074be0286d21f54444e01c8612c747241bc0cfaeb6c5/deap-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (152kB)

Collecting update_checker>=0.16
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting tqdm>=4.36.1
  Downloading https://files.pythonhosted.org/packages/bb/62/6f823501b3bf2bac242bd3c320b592ad1516b3081d82c77c1d813f076856/tqdm-4.39.0-py2.py3-none-any.whl (53kB)

Collecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Requirement already satisfied: pandas>=0.24.2 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from tpot) (0.25.3)
Requirement already satisfied: joblib>=0.13.2 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from tpot) (0.14.0)
Requirement already satisfied: requests>=2.3.0 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from update_checker>=0.16->tpot) (2.22.0)
Requirement already satisfied: pytz>=2017.2 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from pandas>=0.24.2->tpot) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from pandas>=0.24.2->tpot) (2.8.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from requests>=2.3.0->update_checker>=0.16->tpot) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from requests>=2.3.0->update_checker>=0.16->tpot) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from requests>=2.3.0->update_checker>=0.16->tpot) (1.25.7)
Requirement already satisfied: idna<2.9,>=2.5 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from requests>=2.3.0->update_checker>=0.16->tpot) (2.8)
Requirement already satisfied: six>=1.5 in /home/travis/miniconda/envs/test/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->tpot) (1.13.0)
Building wheels for collected packages: tpot, stopit
  Building wheel for tpot (setup.py) ... - done
  Created wheel for tpot: filename=TPOT-0.11.0-cp37-none-any.whl size=75682 sha256=1e4cf602fa7bf87f8d89e89b152e81c4e92f7000dd55035ecb2fbd51a00f128b
  Stored in directory: /home/travis/.cache/pip/wheels/5e/79/3b/49ccea9a29f28d0cdecbca22d71515c23bf45e1e65bc925cb8
  Building wheel for stopit (setup.py) ... - done
  Created wheel for stopit: filename=stopit-1.1.2-cp37-none-any.whl size=11958 sha256=d243ef393f53f729a708ee20e28243e3a9cbce62040caa8d6ac83bc8f5016a9a
  Stored in directory: /home/travis/.cache/pip/wheels/3c/85/2b/2580190404636bfc63e8de3dff629c03bb795021e1983a6cc7
Successfully built tpot stopit
Installing collected packages: deap, update-checker, tqdm, stopit, tpot
Successfully installed deap-1.3.0 stopit-1.1.2 tpot-0.11.0 tqdm-4.39.0 update-checker-0.16
[3]:
import tpot
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

Setup Dask

We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.

You can view the dashboard by clicking on the dashboard link after you run the cell.

[4]:
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client
[4]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 7.84 GB

Create Data

We’ll use the digits dataset. To ensure the example runs quickly, we’ll make the training dataset relatively small.

[5]:
digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.05,
    test_size=0.95,
)

These are just small, in-memory NumPy arrays. This example is not applicable to larger-than-memory Dask arrays.

Using Dask

TPOT follows the scikit-learn API; we specify a TPOTClassifier with a few hyperparameters, and then fit it on some data. By default, TPOT trains on your single machine. To ensure your cluster is used, specify the use_dask keyword.

[6]:
# scale up: Increase the TPOT parameters like population_size, generations
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    config_dict=tpot.config.classifier_config_dict_light,
    use_dask=True,
)
[7]:
tp.fit(X_train, y_train)
[7]:
TPOTClassifier(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.PCA': {'iterated_power': range(1, 11),
                                                          'svd_solver': ['randomized']},
                            'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007...
                            'tpot.builtins.ZeroCount': {}},
               crossover_rate=0.1, cv=2, disable_update_check=False,
               early_stop=None, generations=2, max_eval_time_mins=5,
               max_time_mins=None, memory=None, mutation_rate=0.9, n_jobs=-1,
               offspring_size=None, periodic_checkpoint_folder=None,
               population_size=10, random_state=0, scoring=None, subsample=1.0,
               template=None, use_dask=True, verbosity=0, warm_start=False)

Learn More

See the Dask-ML and TPOT documenation for more information on using Dask and TPOT.