{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Use Voting Classifiers\n", "======================\n", "\n", "A [Voting classifier](http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone. \n", "\n", "[Dask](http://ml.dask.org/joblib.html) provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).\n", "\n", "What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).\n", "\n", "\"Dask" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:58.109186Z", "iopub.status.busy": "2022-07-27T19:23:58.108930Z", "iopub.status.idle": "2022-07-27T19:23:58.664462Z", "shell.execute_reply": "2022-07-27T19:23:58.663821Z" } }, "outputs": [], "source": [ "from sklearn.ensemble import VotingClassifier\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "\n", "import sklearn.datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:58.668444Z", "iopub.status.busy": "2022-07-27T19:23:58.667883Z", "iopub.status.idle": "2022-07-27T19:23:58.673984Z", "shell.execute_reply": "2022-07-27T19:23:58.673154Z" } }, "outputs": [], "source": [ "X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the ```n_jobs``` argument to be -1, which instructs sklearn to use all available cores (notice that we haven't used dask)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:58.676868Z", "iopub.status.busy": "2022-07-27T19:23:58.676456Z", "iopub.status.idle": "2022-07-27T19:23:58.680413Z", "shell.execute_reply": "2022-07-27T19:23:58.679808Z" } }, "outputs": [], "source": [ "classifiers = [\n", " ('sgd', SGDClassifier(max_iter=1000)),\n", " ('logisticregression', LogisticRegression()),\n", " ('svc', SVC(gamma='auto')),\n", "]\n", "clf = VotingClassifier(classifiers, n_jobs=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We call the classifier's fit method in order to train the classifier." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:58.683335Z", "iopub.status.busy": "2022-07-27T19:23:58.682851Z", "iopub.status.idle": "2022-07-27T19:23:59.758585Z", "shell.execute_reply": "2022-07-27T19:23:59.758029Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.6 ms, sys: 28 ms, total: 43.6 ms\n", "Wall time: 1.05 s\n" ] }, { "data": { "text/plain": [ "VotingClassifier(estimators=[('sgd', SGDClassifier()),\n", " ('logisticregression', LogisticRegression()),\n", " ('svc', SVC(gamma='auto'))],\n", " n_jobs=-1)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time clf.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a Dask [client](https://distributed.readthedocs.io/en/latest/client.html) provides performance and progress metrics via the dashboard. Because ```Client``` is given no arugments, its output refers to a [local cluster](http://distributed.readthedocs.io/en/latest/local-cluster.html) (not a distributed cluster).\n", "\n", "We can view the dashboard by clicking the link after running the cell." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:59.762539Z", "iopub.status.busy": "2022-07-27T19:23:59.761927Z", "iopub.status.idle": "2022-07-27T19:24:01.914766Z", "shell.execute_reply": "2022-07-27T19:24:01.913930Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-aab9cc32-0de1-11ed-a68c-000d3a8f7959

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://127.0.0.1:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

0fb6a26e

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Workers: 2\n", "
\n", " Total threads: 2\n", " \n", " Total memory: 6.78 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-5cd77096-ccb9-489c-8f73-bb70bd3c0de6

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:37223\n", " \n", " Workers: 2\n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Total threads: 2\n", "
\n", " Started: Just now\n", " \n", " Total memory: 6.78 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:37799\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:40739/status\n", " \n", " Memory: 3.39 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:37995\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-p2j5wsnz\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:33541\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:42027/status\n", " \n", " Memory: 3.39 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:37335\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-u_ezjujo\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import joblib\n", "from distributed import Client\n", "\n", "client = Client()\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:01.918419Z", "iopub.status.busy": "2022-07-27T19:24:01.917854Z", "iopub.status.idle": "2022-07-27T19:24:03.043674Z", "shell.execute_reply": "2022-07-27T19:24:03.043124Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VotingClassifier(estimators=[('sgd', SGDClassifier()),\n", " ('logisticregression', LogisticRegression()),\n", " ('svc', SVC(gamma='auto'))],\n", " n_jobs=-1)\n", "CPU times: user 203 ms, sys: 79.1 ms, total: 282 ms\n", "Wall time: 1.12 s\n" ] } ], "source": [ "%%time \n", "with joblib.parallel_backend(\"dask\"):\n", " clf.fit(X, y)\n", "\n", "print(clf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer's cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }