{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Scale XGBoost\n", "=============\n", "\n", "Dask and XGBoost can work together to train gradient boosted trees in parallel. This notebook shows how to use Dask and XGBoost together.\n", "\n", "XGBoost provides a powerful prediction framework, and it works well in practice. It wins Kaggle contests and is popular in industry because it has good performance and can be easily interpreted (i.e., it's easy to find the important features from a XGBoost model).\n", "\n", "\"Dask \"Dask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup Dask\n", "We setup a Dask client, which provides performance and progress metrics via the dashboard.\n", "\n", "You can view the dashboard by clicking the link after running the cell." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:05.696722Z", "iopub.status.busy": "2022-07-27T19:24:05.696078Z", "iopub.status.idle": "2022-07-27T19:24:09.100023Z", "shell.execute_reply": "2022-07-27T19:24:09.099239Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-ae48a4c8-0de1-11ed-a6d2-000d3a8f7959

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://127.0.0.1:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

d17d0b16

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Workers: 4\n", "
\n", " Total threads: 4\n", " \n", " Total memory: 6.78 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-63756a1e-88c9-43fb-9a77-fb66783417d3

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:36303\n", " \n", " Workers: 4\n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Total threads: 4\n", "
\n", " Started: Just now\n", " \n", " Total memory: 6.78 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:36301\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:39597/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:46201\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-ddcw2w5v\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:40821\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:33095/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:36319\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-5hsjt1n7\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 2

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:34869\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:44313/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:40433\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-a0hc6mn9\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 3

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:44521\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:38003/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:34813\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-r6mejztr\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client\n", "\n", "client = Client(n_workers=4, threads_per_worker=1)\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we create a bunch of synthetic data, with 100,000 examples and 20 features." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:09.103687Z", "iopub.status.busy": "2022-07-27T19:24:09.103117Z", "iopub.status.idle": "2022-07-27T19:24:09.910766Z", "shell.execute_reply": "2022-07-27T19:24:09.910187Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/base.py:1283: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 15.26 MiB 156.25 kiB
Shape (100000, 20) (1000, 20)
Count 100 Tasks 100 Chunks
Type float64 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 20\n", " 100000\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask_ml.datasets import make_classification\n", "\n", "X, y = make_classification(n_samples=100000, n_features=20,\n", " chunks=1000, n_informative=4,\n", " random_state=0)\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dask-XGBoost works with both arrays and dataframes. For more information on creating dask arrays and dataframes from real data, see documentation on [Dask arrays](https://dask.pydata.org/en/latest/array-creation.html) or [Dask dataframes](https://dask.pydata.org/en/latest/dataframe-create.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split data for training and testing\n", "We split our dataset into training and testing data to aid evaluation by making sure we have a fair test:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:09.913973Z", "iopub.status.busy": "2022-07-27T19:24:09.913564Z", "iopub.status.idle": "2022-07-27T19:24:10.150306Z", "shell.execute_reply": "2022-07-27T19:24:10.149670Z" } }, "outputs": [], "source": [ "from dask_ml.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try to do something with this data using [dask-xgboost][dxgb].\n", "\n", "[dxgb]:https://github.com/dask/dask-xgboost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Dask-XGBoost" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:10.154009Z", "iopub.status.busy": "2022-07-27T19:24:10.153574Z", "iopub.status.idle": "2022-07-27T19:24:10.199907Z", "shell.execute_reply": "2022-07-27T19:24:10.199244Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n", " from pandas import MultiIndex, Int64Index\n" ] } ], "source": [ "import dask\n", "import xgboost\n", "import dask_xgboost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dask-xgboost is a small wrapper around xgboost. Dask sets XGBoost up, gives XGBoost data and lets XGBoost do it's training in the background using all the workers Dask has available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do some training:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:10.203663Z", "iopub.status.busy": "2022-07-27T19:24:10.203158Z", "iopub.status.idle": "2022-07-27T19:24:15.697658Z", "shell.execute_reply": "2022-07-27T19:24:15.693295Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Exception in thread Thread-4:\n", "Traceback (most recent call last):\n", " File \"/usr/share/miniconda3/envs/dask-examples/lib/python3.9/threading.py\", line 973, in _bootstrap_inner\n", " self.run()\n", " File \"/usr/share/miniconda3/envs/dask-examples/lib/python3.9/threading.py\", line 910, in run\n", " self._target(*self._args, **self._kwargs)\n", " File \"/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask_xgboost/tracker.py\", line 365, in join\n", " while self.thread.isAlive():\n", "AttributeError: 'Thread' object has no attribute 'isAlive'\n" ] } ], "source": [ "params = {'objective': 'binary:logistic',\n", " 'max_depth': 4, 'eta': 0.01, 'subsample': 0.5, \n", " 'min_child_weight': 0.5}\n", "\n", "bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `bst` object is a regular `xgboost.Booster` object. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:15.705148Z", "iopub.status.busy": "2022-07-27T19:24:15.701284Z", "iopub.status.idle": "2022-07-27T19:24:15.712171Z", "shell.execute_reply": "2022-07-27T19:24:15.711623Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means all the methods mentioned in the [XGBoost documentation][2] are available. We show two examples to expand on this, but these examples are of XGBoost instead of Dask.\n", "\n", "[2]:https://xgboost.readthedocs.io/en/latest/python/python_intro.html#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot feature importance" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:15.715948Z", "iopub.status.busy": "2022-07-27T19:24:15.714611Z", "iopub.status.idle": "2022-07-27T19:24:16.525209Z", "shell.execute_reply": "2022-07-27T19:24:16.524705Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "ax = xgboost.plot_importance(bst, height=0.8, max_num_features=9)\n", "ax.grid(False, axis=\"y\")\n", "ax.set_title('Estimated feature importance')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We specified that only 4 features were informative while creating our data, and only 3 features show up as important." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the Receiver Operating Characteristic curve\n", "We can use a fancier metric to determine how well our classifier is doing by plotting the [Receiver Operating Characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:16.528288Z", "iopub.status.busy": "2022-07-27T19:24:16.527835Z", "iopub.status.idle": "2022-07-27T19:24:16.596872Z", "shell.execute_reply": "2022-07-27T19:24:16.596102Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[19:24:16] WARNING: /home/conda/feedstock_root/build_artifacts/xgboost-split_1645117766796/work/src/learner.cc:1264: Empty dataset at worker: 0\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 58.59 kiB 600 B
Shape (15000,) (150,)
Count 100 Tasks 100 Chunks
Type float32 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 15000\n", " 1\n", "\n", "
" ], "text/plain": [ "dask.array<_predict_part, shape=(15000,), dtype=float32, chunksize=(150,), chunktype=numpy.ndarray>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_hat = dask_xgboost.predict(client, bst, X_test).persist()\n", "y_hat" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:16.599758Z", "iopub.status.busy": "2022-07-27T19:24:16.599424Z", "iopub.status.idle": "2022-07-27T19:24:18.592923Z", "shell.execute_reply": "2022-07-27T19:24:18.577114Z" } }, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "\n", "y_test, y_hat = dask.compute(y_test, y_hat)\n", "fpr, tpr, _ = roc_curve(y_test, y_hat)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:24:18.597157Z", "iopub.status.busy": "2022-07-27T19:24:18.596945Z", "iopub.status.idle": "2022-07-27T19:24:18.742262Z", "shell.execute_reply": "2022-07-27T19:24:18.741740Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import auc\n", "\n", "fig, ax = plt.subplots(figsize=(5, 5))\n", "ax.plot(fpr, tpr, lw=3,\n", " label='ROC Curve (area = {:.2f})'.format(auc(fpr, tpr)))\n", "ax.plot([0, 1], [0, 1], 'k--', lw=2)\n", "ax.set(\n", " xlim=(0, 1),\n", " ylim=(0, 1),\n", " title=\"ROC Curve\",\n", " xlabel=\"False Positive Rate\",\n", " ylabel=\"True Positive Rate\",\n", ")\n", "ax.legend();\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Receiver Operating Characteristic (ROC) curve tells how well our classifier is doing. We can tell it's doing well by how far it bends the upper-left. A perfect classifier would be in the upper-left corner, and a random classifier would follow the diagonal line.\n", "\n", "The area under this curve is `area = 0.76`. This tells us the probability that our classifier will predict correctly for a randomly chosen instance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn more\n", "* Recorded screencast stepping through the real world example above:\n", "* A blogpost on dask-xgboost http://matthewrocklin.com/blog/work/2017/03/28/dask-xgboost\n", "* XGBoost documentation: https://xgboost.readthedocs.io/en/latest/python/python_intro.html#\n", "* Dask-XGBoost documentation: http://ml.dask.org/xgboost.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }