{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score and Predict Large Datasets\n",
    "================================"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes you'll train on a smaller dataset that fits in memory, but need to predict or score for a much larger (possibly larger than memory) dataset. Perhaps your [learning curve](http://scikit-learn.org/stable/modules/learning_curve.html) has leveled off, or you only have labels for a subset of the data.\n",
    "\n",
    "In this situation, you can use [ParallelPostFit](http://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html) to parallelize and distribute the scoring or prediction steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:35.095475Z",
     "iopub.status.busy": "2022-07-27T19:22:35.095170Z",
     "iopub.status.idle": "2022-07-27T19:22:36.218649Z",
     "shell.execute_reply": "2022-07-27T19:22:36.218003Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "    <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\"> </div>\n",
       "    <div style=\"margin-left: 48px;\">\n",
       "        <h3 style=\"margin-bottom: 0px;\">Client</h3>\n",
       "        <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Client-7847b57a-0de1-11ed-a43b-000d3a8f7959</p>\n",
       "        <table style=\"width: 100%; text-align: left;\">\n",
       "\n",
       "        <tr>\n",
       "        \n",
       "            <td style=\"text-align: left;\"><strong>Connection method:</strong> Cluster object</td>\n",
       "            <td style=\"text-align: left;\"><strong>Cluster type:</strong> distributed.LocalCluster</td>\n",
       "        \n",
       "        </tr>\n",
       "\n",
       "        \n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Dashboard: </strong> <a href=\"http://10.1.1.64:8787/status\" target=\"_blank\">http://10.1.1.64:8787/status</a>\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\"></td>\n",
       "            </tr>\n",
       "        \n",
       "\n",
       "        </table>\n",
       "\n",
       "        \n",
       "            <details>\n",
       "            <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Cluster Info</h3></summary>\n",
       "            <div class=\"jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output\">\n",
       "    <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\">\n",
       "    </div>\n",
       "    <div style=\"margin-left: 48px;\">\n",
       "        <h3 style=\"margin-bottom: 0px; margin-top: 0px;\">LocalCluster</h3>\n",
       "        <p style=\"color: #9D9D9D; margin-bottom: 0px;\">0bbc581a</p>\n",
       "        <table style=\"width: 100%; text-align: left;\">\n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Dashboard:</strong> <a href=\"http://10.1.1.64:8787/status\" target=\"_blank\">http://10.1.1.64:8787/status</a>\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Workers:</strong> 1\n",
       "                </td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Total threads:</strong> 4\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Total memory:</strong> 1.86 GiB\n",
       "                </td>\n",
       "            </tr>\n",
       "            \n",
       "            <tr>\n",
       "    <td style=\"text-align: left;\"><strong>Status:</strong> running</td>\n",
       "    <td style=\"text-align: left;\"><strong>Using processes:</strong> False</td>\n",
       "</tr>\n",
       "\n",
       "            \n",
       "        </table>\n",
       "\n",
       "        <details>\n",
       "            <summary style=\"margin-bottom: 20px;\">\n",
       "                <h3 style=\"display: inline;\">Scheduler Info</h3>\n",
       "            </summary>\n",
       "\n",
       "            <div style=\"\">\n",
       "    <div>\n",
       "        <div style=\"width: 24px; height: 24px; background-color: #FFF7E5; border: 3px solid #FF6132; border-radius: 5px; position: absolute;\"> </div>\n",
       "        <div style=\"margin-left: 48px;\">\n",
       "            <h3 style=\"margin-bottom: 0px;\">Scheduler</h3>\n",
       "            <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Scheduler-02b344ef-0a9b-4ca2-b806-8205e284cc50</p>\n",
       "            <table style=\"width: 100%; text-align: left;\">\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Comm:</strong> inproc://10.1.1.64/9275/1\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Workers:</strong> 1\n",
       "                    </td>\n",
       "                </tr>\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Dashboard:</strong> <a href=\"http://10.1.1.64:8787/status\" target=\"_blank\">http://10.1.1.64:8787/status</a>\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Total threads:</strong> 4\n",
       "                    </td>\n",
       "                </tr>\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Started:</strong> Just now\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Total memory:</strong> 1.86 GiB\n",
       "                    </td>\n",
       "                </tr>\n",
       "            </table>\n",
       "        </div>\n",
       "    </div>\n",
       "\n",
       "    <details style=\"margin-left: 48px;\">\n",
       "        <summary style=\"margin-bottom: 20px;\">\n",
       "            <h3 style=\"display: inline;\">Workers</h3>\n",
       "        </summary>\n",
       "\n",
       "        \n",
       "        <div style=\"margin-bottom: 20px;\">\n",
       "            <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
       "            <div style=\"margin-left: 48px;\">\n",
       "            <details>\n",
       "                <summary>\n",
       "                    <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 0</h4>\n",
       "                </summary>\n",
       "                <table style=\"width: 100%; text-align: left;\">\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Comm: </strong> inproc://10.1.1.64/9275/4\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Total threads: </strong> 4\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Dashboard: </strong> <a href=\"http://10.1.1.64:39787/status\" target=\"_blank\">http://10.1.1.64:39787/status</a>\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Memory: </strong> 1.86 GiB\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Nanny: </strong> None\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\"></td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td colspan=\"2\" style=\"text-align: left;\">\n",
       "                            <strong>Local directory: </strong> /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-ocmx77q9\n",
       "                        </td>\n",
       "                    </tr>\n",
       "\n",
       "                    \n",
       "\n",
       "                    \n",
       "\n",
       "                </table>\n",
       "            </details>\n",
       "            </div>\n",
       "        </div>\n",
       "        \n",
       "\n",
       "    </details>\n",
       "</div>\n",
       "\n",
       "        </details>\n",
       "    </div>\n",
       "</div>\n",
       "            </details>\n",
       "        \n",
       "\n",
       "    </div>\n",
       "</div>"
      ],
      "text/plain": [
       "<Client: 'inproc://10.1.1.64/9275/1' processes=1 threads=4, memory=1.86 GiB>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dask.distributed import Client, progress\n",
    "\n",
    "# Scale up: connect to your own cluster with bmore resources\n",
    "# see http://dask.pydata.org/en/latest/setup.html\n",
    "client = Client(processes=False, threads_per_worker=4,\n",
    "                n_workers=1, memory_limit='2GB')\n",
    "client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:36.222013Z",
     "iopub.status.busy": "2022-07-27T19:22:36.221476Z",
     "iopub.status.idle": "2022-07-27T19:22:36.798250Z",
     "shell.execute_reply": "2022-07-27T19:22:36.797609Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import dask.array as da\n",
    "from sklearn.datasets import make_classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll generate a small random dataset with scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:36.801826Z",
     "iopub.status.busy": "2022-07-27T19:22:36.801361Z",
     "iopub.status.idle": "2022-07-27T19:22:36.809530Z",
     "shell.execute_reply": "2022-07-27T19:22:36.808999Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.53682958, -1.39869399],\n",
       "       [ 1.36917601, -0.63734411],\n",
       "       [ 0.50231787, -0.45910529],\n",
       "       [ 1.83319262, -1.29808229],\n",
       "       [ 1.04235568,  1.12152929]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, y_train = make_classification(\n",
    "    n_features=2, n_redundant=0, n_informative=2,\n",
    "    random_state=1, n_clusters_per_class=1, n_samples=1000)\n",
    "X_train[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we'll clone that dataset many times with `dask.array`. `X_large` and `y_large` represent our larger than memory dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:36.812548Z",
     "iopub.status.busy": "2022-07-27T19:22:36.812219Z",
     "iopub.status.idle": "2022-07-27T19:22:36.865427Z",
     "shell.execute_reply": "2022-07-27T19:22:36.864750Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <td>\n",
       "            <table>\n",
       "                <thead>\n",
       "                    <tr>\n",
       "                        <td> </td>\n",
       "                        <th> Array </th>\n",
       "                        <th> Chunk </th>\n",
       "                    </tr>\n",
       "                </thead>\n",
       "                <tbody>\n",
       "                    \n",
       "                    <tr>\n",
       "                        <th> Bytes </th>\n",
       "                        <td> 1.53 MiB </td>\n",
       "                        <td> 15.62 kiB </td>\n",
       "                    </tr>\n",
       "                    \n",
       "                    <tr>\n",
       "                        <th> Shape </th>\n",
       "                        <td> (100000, 2) </td>\n",
       "                        <td> (1000, 2) </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <th> Count </th>\n",
       "                        <td> 101 Tasks </td>\n",
       "                        <td> 100 Chunks </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                    <th> Type </th>\n",
       "                    <td> float64 </td>\n",
       "                    <td> numpy.ndarray </td>\n",
       "                    </tr>\n",
       "                </tbody>\n",
       "            </table>\n",
       "        </td>\n",
       "        <td>\n",
       "        <svg width=\"75\" height=\"170\" style=\"stroke:rgb(0,0,0);stroke-width:1\" >\n",
       "\n",
       "  <!-- Horizontal lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"25\" y2=\"0\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"0\" y1=\"6\" x2=\"25\" y2=\"6\" />\n",
       "  <line x1=\"0\" y1=\"12\" x2=\"25\" y2=\"12\" />\n",
       "  <line x1=\"0\" y1=\"18\" x2=\"25\" y2=\"18\" />\n",
       "  <line x1=\"0\" y1=\"25\" x2=\"25\" y2=\"25\" />\n",
       "  <line x1=\"0\" y1=\"31\" x2=\"25\" y2=\"31\" />\n",
       "  <line x1=\"0\" y1=\"37\" x2=\"25\" y2=\"37\" />\n",
       "  <line x1=\"0\" y1=\"43\" x2=\"25\" y2=\"43\" />\n",
       "  <line x1=\"0\" y1=\"50\" x2=\"25\" y2=\"50\" />\n",
       "  <line x1=\"0\" y1=\"56\" x2=\"25\" y2=\"56\" />\n",
       "  <line x1=\"0\" y1=\"62\" x2=\"25\" y2=\"62\" />\n",
       "  <line x1=\"0\" y1=\"68\" x2=\"25\" y2=\"68\" />\n",
       "  <line x1=\"0\" y1=\"75\" x2=\"25\" y2=\"75\" />\n",
       "  <line x1=\"0\" y1=\"81\" x2=\"25\" y2=\"81\" />\n",
       "  <line x1=\"0\" y1=\"87\" x2=\"25\" y2=\"87\" />\n",
       "  <line x1=\"0\" y1=\"93\" x2=\"25\" y2=\"93\" />\n",
       "  <line x1=\"0\" y1=\"100\" x2=\"25\" y2=\"100\" />\n",
       "  <line x1=\"0\" y1=\"106\" x2=\"25\" y2=\"106\" />\n",
       "  <line x1=\"0\" y1=\"112\" x2=\"25\" y2=\"112\" />\n",
       "  <line x1=\"0\" y1=\"120\" x2=\"25\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Vertical lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"0\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"25\" y1=\"0\" x2=\"25\" y2=\"120\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Colored Rectangle -->\n",
       "  <polygon points=\"0.0,0.0 25.412616514582485,0.0 25.412616514582485,120.0 0.0,120.0\" style=\"fill:#8B4903A0;stroke-width:0\"/>\n",
       "\n",
       "  <!-- Text -->\n",
       "  <text x=\"12.706308\" y=\"140.000000\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" >2</text>\n",
       "  <text x=\"45.412617\" y=\"60.000000\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" transform=\"rotate(-90,45.412617,60.000000)\">100000</text>\n",
       "</svg>\n",
       "        </td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "dask.array<concatenate, shape=(100000, 2), dtype=float64, chunksize=(1000, 2), chunktype=numpy.ndarray>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Scale up: increase N, the number of times we replicate the data.\n",
    "N = 100\n",
    "X_large = da.concatenate([da.from_array(X_train, chunks=X_train.shape)\n",
    "                          for _ in range(N)])\n",
    "y_large = da.concatenate([da.from_array(y_train, chunks=y_train.shape)\n",
    "                          for _ in range(N)])\n",
    "X_large"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our training dataset fits in memory, we can use a scikit-learn estimator as the actual estimator fit during traning.\n",
    "But we know that we'll want to predict for a large dataset, so we'll wrap the scikit-learn estimator with `ParallelPostFit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:36.869111Z",
     "iopub.status.busy": "2022-07-27T19:22:36.868532Z",
     "iopub.status.idle": "2022-07-27T19:22:37.108154Z",
     "shell.execute_reply": "2022-07-27T19:22:37.107407Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegressionCV\n",
    "from dask_ml.wrappers import ParallelPostFit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:37.111724Z",
     "iopub.status.busy": "2022-07-27T19:22:37.111335Z",
     "iopub.status.idle": "2022-07-27T19:22:37.118145Z",
     "shell.execute_reply": "2022-07-27T19:22:37.117693Z"
    }
   },
   "outputs": [],
   "source": [
    "clf = ParallelPostFit(LogisticRegressionCV(cv=3), scoring=\"r2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the note in the `dask-ml`'s documentation about when and why a `scoring` parameter is needed: https://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html#dask_ml.wrappers.ParallelPostFit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll call `clf.fit`. Dask-ML does nothing here, so this step can only use datasets that fit in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:37.121297Z",
     "iopub.status.busy": "2022-07-27T19:22:37.120808Z",
     "iopub.status.idle": "2022-07-27T19:22:37.202502Z",
     "shell.execute_reply": "2022-07-27T19:22:37.201739Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ParallelPostFit(estimator=LogisticRegressionCV(cv=3), scoring='r2')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that training is done, we'll turn to predicting for the full (larger than memory) dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:37.205665Z",
     "iopub.status.busy": "2022-07-27T19:22:37.205144Z",
     "iopub.status.idle": "2022-07-27T19:22:37.219261Z",
     "shell.execute_reply": "2022-07-27T19:22:37.218742Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <td>\n",
       "            <table>\n",
       "                <thead>\n",
       "                    <tr>\n",
       "                        <td> </td>\n",
       "                        <th> Array </th>\n",
       "                        <th> Chunk </th>\n",
       "                    </tr>\n",
       "                </thead>\n",
       "                <tbody>\n",
       "                    \n",
       "                    <tr>\n",
       "                        <th> Bytes </th>\n",
       "                        <td> 781.25 kiB </td>\n",
       "                        <td> 7.81 kiB </td>\n",
       "                    </tr>\n",
       "                    \n",
       "                    <tr>\n",
       "                        <th> Shape </th>\n",
       "                        <td> (100000,) </td>\n",
       "                        <td> (1000,) </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <th> Count </th>\n",
       "                        <td> 201 Tasks </td>\n",
       "                        <td> 100 Chunks </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                    <th> Type </th>\n",
       "                    <td> int64 </td>\n",
       "                    <td> numpy.ndarray </td>\n",
       "                    </tr>\n",
       "                </tbody>\n",
       "            </table>\n",
       "        </td>\n",
       "        <td>\n",
       "        <svg width=\"170\" height=\"75\" style=\"stroke:rgb(0,0,0);stroke-width:1\" >\n",
       "\n",
       "  <!-- Horizontal lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"120\" y2=\"0\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"0\" y1=\"25\" x2=\"120\" y2=\"25\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Vertical lines -->\n",
       "  <line x1=\"0\" y1=\"0\" x2=\"0\" y2=\"25\" style=\"stroke-width:2\" />\n",
       "  <line x1=\"6\" y1=\"0\" x2=\"6\" y2=\"25\" />\n",
       "  <line x1=\"12\" y1=\"0\" x2=\"12\" y2=\"25\" />\n",
       "  <line x1=\"18\" y1=\"0\" x2=\"18\" y2=\"25\" />\n",
       "  <line x1=\"25\" y1=\"0\" x2=\"25\" y2=\"25\" />\n",
       "  <line x1=\"31\" y1=\"0\" x2=\"31\" y2=\"25\" />\n",
       "  <line x1=\"37\" y1=\"0\" x2=\"37\" y2=\"25\" />\n",
       "  <line x1=\"43\" y1=\"0\" x2=\"43\" y2=\"25\" />\n",
       "  <line x1=\"50\" y1=\"0\" x2=\"50\" y2=\"25\" />\n",
       "  <line x1=\"56\" y1=\"0\" x2=\"56\" y2=\"25\" />\n",
       "  <line x1=\"62\" y1=\"0\" x2=\"62\" y2=\"25\" />\n",
       "  <line x1=\"68\" y1=\"0\" x2=\"68\" y2=\"25\" />\n",
       "  <line x1=\"75\" y1=\"0\" x2=\"75\" y2=\"25\" />\n",
       "  <line x1=\"81\" y1=\"0\" x2=\"81\" y2=\"25\" />\n",
       "  <line x1=\"87\" y1=\"0\" x2=\"87\" y2=\"25\" />\n",
       "  <line x1=\"93\" y1=\"0\" x2=\"93\" y2=\"25\" />\n",
       "  <line x1=\"100\" y1=\"0\" x2=\"100\" y2=\"25\" />\n",
       "  <line x1=\"106\" y1=\"0\" x2=\"106\" y2=\"25\" />\n",
       "  <line x1=\"112\" y1=\"0\" x2=\"112\" y2=\"25\" />\n",
       "  <line x1=\"120\" y1=\"0\" x2=\"120\" y2=\"25\" style=\"stroke-width:2\" />\n",
       "\n",
       "  <!-- Colored Rectangle -->\n",
       "  <polygon points=\"0.0,0.0 120.0,0.0 120.0,25.412616514582485 0.0,25.412616514582485\" style=\"fill:#8B4903A0;stroke-width:0\"/>\n",
       "\n",
       "  <!-- Text -->\n",
       "  <text x=\"60.000000\" y=\"45.412617\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" >100000</text>\n",
       "  <text x=\"140.000000\" y=\"12.706308\" font-size=\"1.0rem\" font-weight=\"100\" text-anchor=\"middle\" transform=\"rotate(0,140.000000,12.706308)\">1</text>\n",
       "</svg>\n",
       "        </td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "dask.array<_predict, shape=(100000,), dtype=int64, chunksize=(1000,), chunktype=numpy.ndarray>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = clf.predict(X_large)\n",
    "y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "y_pred is Dask arary. Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine.\n",
    "\n",
    "Or we can check the models score on the entire large dataset. The computation will be done in parallel, and no single machine will have to hold all the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:22:37.222440Z",
     "iopub.status.busy": "2022-07-27T19:22:37.221942Z",
     "iopub.status.idle": "2022-07-27T19:22:38.322122Z",
     "shell.execute_reply": "2022-07-27T19:22:38.321406Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.596"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_large, y_large)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}