Use Voting Classifiers¶
A Voting classifier model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone.
Dask provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).
What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).
from sklearn.ensemble import VotingClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC import sklearn.datasets
We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model.
X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)
We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the
n_jobs argument to be -1, which instructs sklearn to use all available cores (notice that we haven’t used dask).
classifiers = [ ('sgd', SGDClassifier(max_iter=1000)), ('logisticregression', LogisticRegression()), ('svc', SVC(gamma='auto')), ] clf = VotingClassifier(classifiers, n_jobs=-1)
We call the classifier’s fit method in order to train the classifier.
%time clf.fit(X, y)
CPU times: user 16.1 ms, sys: 20.1 ms, total: 36.2 ms Wall time: 1.28 s
VotingClassifier(estimators=[('sgd', SGDClassifier()), ('logisticregression', LogisticRegression()), ('svc', SVC(gamma='auto'))], n_jobs=-1)
We can view the dashboard by clicking the link after running the cell.
import joblib from distributed import Client client = Client() client
/usr/share/miniconda3/envs/dask-examples/lib/python3.8/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 34895 instead warnings.warn(
To train the voting classifier, we call the classifier’s fit method, but enclosed in joblib’s
parallel_backend context manager. This distributes training of sub-estimators acoss the cluster.
%%time with joblib.parallel_backend("dask"): clf.fit(X, y) print(clf)
VotingClassifier(estimators=[('sgd', SGDClassifier()), ('logisticregression', LogisticRegression()), ('svc', SVC(gamma='auto'))], n_jobs=-1) CPU times: user 86 ms, sys: 16.2 ms, total: 102 ms Wall time: 1.45 s
Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer’s cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them.