{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use Voting Classifiers\n",
"======================\n",
"\n",
"A [Voting classifier](http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone. \n",
"\n",
"[Dask](http://ml.dask.org/joblib.html) provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).\n",
"\n",
"What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2021-01-14T10:51:12.329873Z",
"iopub.status.busy": "2021-01-14T10:51:12.329381Z",
"iopub.status.idle": "2021-01-14T10:51:13.498574Z",
"shell.execute_reply": "2021-01-14T10:51:13.499639Z"
}
},
"outputs": [],
"source": [
"from sklearn.ensemble import VotingClassifier\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.svm import SVC\n",
"\n",
"import sklearn.datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2021-01-14T10:51:13.504077Z",
"iopub.status.busy": "2021-01-14T10:51:13.501982Z",
"iopub.status.idle": "2021-01-14T10:51:13.512134Z",
"shell.execute_reply": "2021-01-14T10:51:13.512550Z"
}
},
"outputs": [],
"source": [
"X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the ```n_jobs``` argument to be -1, which instructs sklearn to use all available cores (notice that we haven't used dask)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2021-01-14T10:51:13.515566Z",
"iopub.status.busy": "2021-01-14T10:51:13.514269Z",
"iopub.status.idle": "2021-01-14T10:51:13.519290Z",
"shell.execute_reply": "2021-01-14T10:51:13.520070Z"
}
},
"outputs": [],
"source": [
"classifiers = [\n",
" ('sgd', SGDClassifier(max_iter=1000)),\n",
" ('logisticregression', LogisticRegression()),\n",
" ('svc', SVC(gamma='auto')),\n",
"]\n",
"clf = VotingClassifier(classifiers, n_jobs=-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We call the classifier's fit method in order to train the classifier."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2021-01-14T10:51:13.523335Z",
"iopub.status.busy": "2021-01-14T10:51:13.522296Z",
"iopub.status.idle": "2021-01-14T10:51:15.897739Z",
"shell.execute_reply": "2021-01-14T10:51:15.897252Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 21.7 ms, sys: 7.2 ms, total: 28.9 ms\n",
"Wall time: 2.35 s\n"
]
},
{
"data": {
"text/plain": [
"VotingClassifier(estimators=[('sgd', SGDClassifier()),\n",
" ('logisticregression', LogisticRegression()),\n",
" ('svc', SVC(gamma='auto'))],\n",
" n_jobs=-1)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time clf.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a Dask [client](https://distributed.readthedocs.io/en/latest/client.html) provides performance and progress metrics via the dashboard. Because ```Client``` is given no arugments, its output refers to a [local cluster](http://distributed.readthedocs.io/en/latest/local-cluster.html) (not a distributed cluster).\n",
"\n",
"We can view the dashboard by clicking the link after running the cell."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2021-01-14T10:51:15.901236Z",
"iopub.status.busy": "2021-01-14T10:51:15.900826Z",
"iopub.status.idle": "2021-01-14T10:51:18.900652Z",
"shell.execute_reply": "2021-01-14T10:51:18.901416Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"Client\n", "
| \n",
"\n",
"Cluster\n", "
| \n",
"