{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Blockwise Ensemble Methods\n", "\n", "Dask-ML provides some [ensemble methods](https://ml.dask.org/modules/api.html#module-dask_ml.ensemble) that are tailored to `dask.array`'s and `dask.dataframe`'s blocked structure. The basic idea is to fit a copy of some sub-estimator to each block (or partition) of the dask Array or DataFrame. Becuase each block fits in memory, the sub-estimator only needs to handle in-memory data structures like a NumPy array or pandas DataFrame. It also will be relatively fast, since each block fits in memory and we won't need to move large amounts of data between workers on a cluster. We end up with an ensemble of models: one per block in the training dataset.\n", "\n", "At prediction time, we combine the results from all the models in the ensemble. For regression problems, this means averaging the predictions from each sub-estimator. For classification problems, each sub-estimator votes and the results are combined. See https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier for details on how they can be combeind. See https://scikit-learn.org/stable/modules/ensemble.html for a general overview of why averaging ensemble methods can be useful.\n", "\n", "It's crucially important that the distribution of values in your dataset be relatively uniform across partitions. Otherwise the parameters learned on any given partition of the data will be poor for the dataset as a whole. This will be shown in detail later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's randomly generate an example dataset. In practice, you would load the data from storage. We'll create a `dask.array` with 10 blocks." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:21:24.260576Z", "iopub.status.busy": "2022-07-27T19:21:24.259856Z", "iopub.status.idle": "2022-07-27T19:21:28.359080Z", "shell.execute_reply": "2022-07-27T19:21:28.358324Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/base.py:1283: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n",
"
| \n",
" \n", " \n", " | \n", "
\n",
"
| \n",
" \n", " \n", " | \n", "
\n",
"
| \n",
" \n", " \n", " | \n", "