Use Voting Classifiers
Live Notebook
You can run this notebook in a live session or view it on Github.
Use Voting Classifiers¶
A Voting classifier model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone.
Dask provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).
What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).
[1]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import sklearn.datasets
We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model.
[2]:
X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)
We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the n_jobs
argument to be -1, which instructs sklearn to use all available cores (notice that we haven’t used dask).
[3]:
classifiers = [
('sgd', SGDClassifier(max_iter=1000)),
('logisticregression', LogisticRegression()),
('svc', SVC(gamma='auto')),
]
clf = VotingClassifier(classifiers, n_jobs=-1)
We call the classifier’s fit method in order to train the classifier.
[4]:
%time clf.fit(X, y)
CPU times: user 15.6 ms, sys: 28 ms, total: 43.6 ms
Wall time: 1.05 s
[4]:
VotingClassifier(estimators=[('sgd', SGDClassifier()),
('logisticregression', LogisticRegression()),
('svc', SVC(gamma='auto'))],
n_jobs=-1)
Creating a Dask client provides performance and progress metrics via the dashboard. Because Client
is given no arugments, its output refers to a local cluster (not a distributed cluster).
We can view the dashboard by clicking the link after running the cell.
[5]:
import joblib
from distributed import Client
client = Client()
client
[5]:
Client
Client-aab9cc32-0de1-11ed-a68c-000d3a8f7959
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
0fb6a26e
Dashboard: http://127.0.0.1:8787/status | Workers: 2 |
Total threads: 2 | Total memory: 6.78 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-5cd77096-ccb9-489c-8f73-bb70bd3c0de6
Comm: tcp://127.0.0.1:37223 | Workers: 2 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 2 |
Started: Just now | Total memory: 6.78 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:37799 | Total threads: 1 |
Dashboard: http://127.0.0.1:40739/status | Memory: 3.39 GiB |
Nanny: tcp://127.0.0.1:37995 | |
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-p2j5wsnz |
Worker: 1
Comm: tcp://127.0.0.1:33541 | Total threads: 1 |
Dashboard: http://127.0.0.1:42027/status | Memory: 3.39 GiB |
Nanny: tcp://127.0.0.1:37335 | |
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-u_ezjujo |
To train the voting classifier, we call the classifier’s fit method, but enclosed in joblib’s parallel_backend
context manager. This distributes training of sub-estimators acoss the cluster.
[6]:
%%time
with joblib.parallel_backend("dask"):
clf.fit(X, y)
print(clf)
VotingClassifier(estimators=[('sgd', SGDClassifier()),
('logisticregression', LogisticRegression()),
('svc', SVC(gamma='auto'))],
n_jobs=-1)
CPU times: user 203 ms, sys: 79.1 ms, total: 282 ms
Wall time: 1.12 s
Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer’s cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them.