{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dask Bags\n",
    "\n",
    "\n",
    "Dask Bag implements operations like `map`, `filter`, `groupby` and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators. It is similar to a parallel version of itertools or a Pythonic version of the PySpark RDD.\n",
    "\n",
    "Dask Bags are often used to do simple preprocessing on log files, JSON records, or other user defined Python objects.\n",
    "\n",
    "Full API documentation is available here: http://docs.dask.org/en/latest/bag-api.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Dask Client for Dashboard\n",
    "\n",
    "Starting the Dask Client is optional.  It will provide a dashboard which \n",
    "is useful to gain insight on the computation.  \n",
    "\n",
    "The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:01.928636Z",
     "iopub.status.busy": "2022-07-27T19:11:01.927990Z",
     "iopub.status.idle": "2022-07-27T19:11:05.062630Z",
     "shell.execute_reply": "2022-07-27T19:11:05.061872Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "    <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\"> </div>\n",
       "    <div style=\"margin-left: 48px;\">\n",
       "        <h3 style=\"margin-bottom: 0px;\">Client</h3>\n",
       "        <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Client-db1fb37f-0ddf-11ed-9823-000d3a8f7959</p>\n",
       "        <table style=\"width: 100%; text-align: left;\">\n",
       "\n",
       "        <tr>\n",
       "        \n",
       "            <td style=\"text-align: left;\"><strong>Connection method:</strong> Cluster object</td>\n",
       "            <td style=\"text-align: left;\"><strong>Cluster type:</strong> distributed.LocalCluster</td>\n",
       "        \n",
       "        </tr>\n",
       "\n",
       "        \n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\"></td>\n",
       "            </tr>\n",
       "        \n",
       "\n",
       "        </table>\n",
       "\n",
       "        \n",
       "            <details>\n",
       "            <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Cluster Info</h3></summary>\n",
       "            <div class=\"jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output\">\n",
       "    <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\">\n",
       "    </div>\n",
       "    <div style=\"margin-left: 48px;\">\n",
       "        <h3 style=\"margin-bottom: 0px; margin-top: 0px;\">LocalCluster</h3>\n",
       "        <p style=\"color: #9D9D9D; margin-bottom: 0px;\">fce50585</p>\n",
       "        <table style=\"width: 100%; text-align: left;\">\n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Workers:</strong> 4\n",
       "                </td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Total threads:</strong> 4\n",
       "                </td>\n",
       "                <td style=\"text-align: left;\">\n",
       "                    <strong>Total memory:</strong> 6.78 GiB\n",
       "                </td>\n",
       "            </tr>\n",
       "            \n",
       "            <tr>\n",
       "    <td style=\"text-align: left;\"><strong>Status:</strong> running</td>\n",
       "    <td style=\"text-align: left;\"><strong>Using processes:</strong> True</td>\n",
       "</tr>\n",
       "\n",
       "            \n",
       "        </table>\n",
       "\n",
       "        <details>\n",
       "            <summary style=\"margin-bottom: 20px;\">\n",
       "                <h3 style=\"display: inline;\">Scheduler Info</h3>\n",
       "            </summary>\n",
       "\n",
       "            <div style=\"\">\n",
       "    <div>\n",
       "        <div style=\"width: 24px; height: 24px; background-color: #FFF7E5; border: 3px solid #FF6132; border-radius: 5px; position: absolute;\"> </div>\n",
       "        <div style=\"margin-left: 48px;\">\n",
       "            <h3 style=\"margin-bottom: 0px;\">Scheduler</h3>\n",
       "            <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Scheduler-6db95d18-fecb-4e65-8bb9-c1ecfb644d25</p>\n",
       "            <table style=\"width: 100%; text-align: left;\">\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Comm:</strong> tcp://127.0.0.1:37071\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Workers:</strong> 4\n",
       "                    </td>\n",
       "                </tr>\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Total threads:</strong> 4\n",
       "                    </td>\n",
       "                </tr>\n",
       "                <tr>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Started:</strong> Just now\n",
       "                    </td>\n",
       "                    <td style=\"text-align: left;\">\n",
       "                        <strong>Total memory:</strong> 6.78 GiB\n",
       "                    </td>\n",
       "                </tr>\n",
       "            </table>\n",
       "        </div>\n",
       "    </div>\n",
       "\n",
       "    <details style=\"margin-left: 48px;\">\n",
       "        <summary style=\"margin-bottom: 20px;\">\n",
       "            <h3 style=\"display: inline;\">Workers</h3>\n",
       "        </summary>\n",
       "\n",
       "        \n",
       "        <div style=\"margin-bottom: 20px;\">\n",
       "            <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
       "            <div style=\"margin-left: 48px;\">\n",
       "            <details>\n",
       "                <summary>\n",
       "                    <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 0</h4>\n",
       "                </summary>\n",
       "                <table style=\"width: 100%; text-align: left;\">\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Comm: </strong> tcp://127.0.0.1:38523\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Total threads: </strong> 1\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:33803/status\" target=\"_blank\">http://127.0.0.1:33803/status</a>\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Memory: </strong> 1.70 GiB\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Nanny: </strong> tcp://127.0.0.1:46261\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\"></td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td colspan=\"2\" style=\"text-align: left;\">\n",
       "                            <strong>Local directory: </strong> /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-z2ivdlv5\n",
       "                        </td>\n",
       "                    </tr>\n",
       "\n",
       "                    \n",
       "\n",
       "                    \n",
       "\n",
       "                </table>\n",
       "            </details>\n",
       "            </div>\n",
       "        </div>\n",
       "        \n",
       "        <div style=\"margin-bottom: 20px;\">\n",
       "            <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
       "            <div style=\"margin-left: 48px;\">\n",
       "            <details>\n",
       "                <summary>\n",
       "                    <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 1</h4>\n",
       "                </summary>\n",
       "                <table style=\"width: 100%; text-align: left;\">\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Comm: </strong> tcp://127.0.0.1:38005\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Total threads: </strong> 1\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:43833/status\" target=\"_blank\">http://127.0.0.1:43833/status</a>\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Memory: </strong> 1.70 GiB\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Nanny: </strong> tcp://127.0.0.1:36091\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\"></td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td colspan=\"2\" style=\"text-align: left;\">\n",
       "                            <strong>Local directory: </strong> /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-dknui8s8\n",
       "                        </td>\n",
       "                    </tr>\n",
       "\n",
       "                    \n",
       "\n",
       "                    \n",
       "\n",
       "                </table>\n",
       "            </details>\n",
       "            </div>\n",
       "        </div>\n",
       "        \n",
       "        <div style=\"margin-bottom: 20px;\">\n",
       "            <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
       "            <div style=\"margin-left: 48px;\">\n",
       "            <details>\n",
       "                <summary>\n",
       "                    <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 2</h4>\n",
       "                </summary>\n",
       "                <table style=\"width: 100%; text-align: left;\">\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Comm: </strong> tcp://127.0.0.1:36089\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Total threads: </strong> 1\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:46853/status\" target=\"_blank\">http://127.0.0.1:46853/status</a>\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Memory: </strong> 1.70 GiB\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Nanny: </strong> tcp://127.0.0.1:45553\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\"></td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td colspan=\"2\" style=\"text-align: left;\">\n",
       "                            <strong>Local directory: </strong> /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-ei436jch\n",
       "                        </td>\n",
       "                    </tr>\n",
       "\n",
       "                    \n",
       "\n",
       "                    \n",
       "\n",
       "                </table>\n",
       "            </details>\n",
       "            </div>\n",
       "        </div>\n",
       "        \n",
       "        <div style=\"margin-bottom: 20px;\">\n",
       "            <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
       "            <div style=\"margin-left: 48px;\">\n",
       "            <details>\n",
       "                <summary>\n",
       "                    <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 3</h4>\n",
       "                </summary>\n",
       "                <table style=\"width: 100%; text-align: left;\">\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Comm: </strong> tcp://127.0.0.1:34709\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Total threads: </strong> 1\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:35105/status\" target=\"_blank\">http://127.0.0.1:35105/status</a>\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Memory: </strong> 1.70 GiB\n",
       "                        </td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td style=\"text-align: left;\">\n",
       "                            <strong>Nanny: </strong> tcp://127.0.0.1:40747\n",
       "                        </td>\n",
       "                        <td style=\"text-align: left;\"></td>\n",
       "                    </tr>\n",
       "                    <tr>\n",
       "                        <td colspan=\"2\" style=\"text-align: left;\">\n",
       "                            <strong>Local directory: </strong> /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-uvqwrfkf\n",
       "                        </td>\n",
       "                    </tr>\n",
       "\n",
       "                    \n",
       "\n",
       "                    \n",
       "\n",
       "                </table>\n",
       "            </details>\n",
       "            </div>\n",
       "        </div>\n",
       "        \n",
       "\n",
       "    </details>\n",
       "</div>\n",
       "\n",
       "        </details>\n",
       "    </div>\n",
       "</div>\n",
       "            </details>\n",
       "        \n",
       "\n",
       "    </div>\n",
       "</div>"
      ],
      "text/plain": [
       "<Client: 'tcp://127.0.0.1:37071' processes=4 threads=4, memory=6.78 GiB>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dask.distributed import Client, progress\n",
    "client = Client(n_workers=4, threads_per_worker=1)\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Random Data\n",
    "\n",
    "We create a random set of record data and store it to disk as many JSON files.  This will serve as our data for this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:05.066392Z",
     "iopub.status.busy": "2022-07-27T19:11:05.065869Z",
     "iopub.status.idle": "2022-07-27T19:11:06.045546Z",
     "shell.execute_reply": "2022-07-27T19:11:06.044659Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/home/runner/work/dask-examples/dask-examples/data/0.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/1.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/2.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/3.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/4.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/5.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/6.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/7.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/8.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/9.json']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import dask\n",
    "import json\n",
    "import os\n",
    "\n",
    "os.makedirs('data', exist_ok=True)              # Create data/ directory\n",
    "\n",
    "b = dask.datasets.make_people()                 # Make records of people\n",
    "b.map(json.dumps).to_textfiles('data/*.json')   # Encode as JSON, write to disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read JSON data\n",
    "\n",
    "Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.049185Z",
     "iopub.status.busy": "2022-07-27T19:11:06.048739Z",
     "iopub.status.idle": "2022-07-27T19:11:06.219200Z",
     "shell.execute_reply": "2022-07-27T19:11:06.218356Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"age\": 61, \"name\": [\"Emiko\", \"Oliver\"], \"occupation\": \"Medical Student\", \"telephone\": \"166.814.5565\", \"address\": {\"address\": \"645 Drumm Line\", \"city\": \"Kennewick\"}, \"credit-card\": {\"number\": \"3792 459318 98518\", \"expiration-date\": \"12/23\"}}\r\n",
      "{\"age\": 54, \"name\": [\"Wendolyn\", \"Ortega\"], \"occupation\": \"Tractor Driver\", \"telephone\": \"1-975-090-1672\", \"address\": {\"address\": \"1274 Harbor Court\", \"city\": \"Mustang\"}, \"credit-card\": {\"number\": \"4600 5899 6829 6887\", \"expiration-date\": \"11/25\"}}\r\n"
     ]
    }
   ],
   "source": [
    "!head -n 2 data/0.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.223012Z",
     "iopub.status.busy": "2022-07-27T19:11:06.222660Z",
     "iopub.status.idle": "2022-07-27T19:11:06.231528Z",
     "shell.execute_reply": "2022-07-27T19:11:06.231076Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dask.bag<loads, npartitions=10>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import dask.bag as db\n",
    "import json\n",
    "\n",
    "b = db.read_text('data/*.json').map(json.loads)\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.234515Z",
     "iopub.status.busy": "2022-07-27T19:11:06.234184Z",
     "iopub.status.idle": "2022-07-27T19:11:06.270736Z",
     "shell.execute_reply": "2022-07-27T19:11:06.270111Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'age': 61,\n",
       "  'name': ['Emiko', 'Oliver'],\n",
       "  'occupation': 'Medical Student',\n",
       "  'telephone': '166.814.5565',\n",
       "  'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},\n",
       "  'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},\n",
       " {'age': 54,\n",
       "  'name': ['Wendolyn', 'Ortega'],\n",
       "  'occupation': 'Tractor Driver',\n",
       "  'telephone': '1-975-090-1672',\n",
       "  'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},\n",
       "  'credit-card': {'number': '4600 5899 6829 6887',\n",
       "   'expiration-date': '11/25'}})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.take(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Map, Filter, Aggregate\n",
    "\n",
    "We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.274283Z",
     "iopub.status.busy": "2022-07-27T19:11:06.273886Z",
     "iopub.status.idle": "2022-07-27T19:11:06.303637Z",
     "shell.execute_reply": "2022-07-27T19:11:06.302652Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'age': 61,\n",
       "  'name': ['Emiko', 'Oliver'],\n",
       "  'occupation': 'Medical Student',\n",
       "  'telephone': '166.814.5565',\n",
       "  'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},\n",
       "  'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},\n",
       " {'age': 54,\n",
       "  'name': ['Wendolyn', 'Ortega'],\n",
       "  'occupation': 'Tractor Driver',\n",
       "  'telephone': '1-975-090-1672',\n",
       "  'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},\n",
       "  'credit-card': {'number': '4600 5899 6829 6887',\n",
       "   'expiration-date': '11/25'}})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.filter(lambda record: record['age'] > 30).take(2)  # Select only people over 30"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.307595Z",
     "iopub.status.busy": "2022-07-27T19:11:06.307305Z",
     "iopub.status.idle": "2022-07-27T19:11:06.333420Z",
     "shell.execute_reply": "2022-07-27T19:11:06.332857Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Medical Student', 'Tractor Driver')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.map(lambda record: record['occupation']).take(2)  # Select the occupation field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.336503Z",
     "iopub.status.busy": "2022-07-27T19:11:06.335965Z",
     "iopub.status.idle": "2022-07-27T19:11:06.443307Z",
     "shell.execute_reply": "2022-07-27T19:11:06.442782Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10000"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.count().compute()  # Count total number of records"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chain computations\n",
    "\n",
    "It is common to do many of these steps in one pipeline, only calling `compute` or `take` at the end."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.446729Z",
     "iopub.status.busy": "2022-07-27T19:11:06.446300Z",
     "iopub.status.idle": "2022-07-27T19:11:06.454584Z",
     "shell.execute_reply": "2022-07-27T19:11:06.453916Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dask.bag<topk-aggregate, npartitions=1>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = (b.filter(lambda record: record['age'] > 30)\n",
    "           .map(lambda record: record['occupation'])\n",
    "           .frequencies(sort=True)\n",
    "           .topk(10, key=1))\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with all lazy Dask collections, we need to call `compute` to actually evaluate our result.  The `take` method used in earlier examples is also like `compute` and will also trigger computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.458009Z",
     "iopub.status.busy": "2022-07-27T19:11:06.457452Z",
     "iopub.status.idle": "2022-07-27T19:11:06.613782Z",
     "shell.execute_reply": "2022-07-27T19:11:06.613162Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Merchant', 16),\n",
       " ('Coroner', 14),\n",
       " ('Book Binder', 13),\n",
       " ('Medical Practitioner', 13),\n",
       " ('Payroll Supervisor', 13),\n",
       " ('Telecommunications', 13),\n",
       " ('Thermal Insulator', 13),\n",
       " ('Pattern Maker', 12),\n",
       " ('Advertising Executive', 12),\n",
       " ('Insurance Staff', 12)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform and Store\n",
    "\n",
    "Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses.  For that we can use methods like `to_textfiles` and `json.dumps`, or we can convert to Dask Dataframes and use their storage systems, which we'll see more of in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.617720Z",
     "iopub.status.busy": "2022-07-27T19:11:06.617294Z",
     "iopub.status.idle": "2022-07-27T19:11:06.755352Z",
     "shell.execute_reply": "2022-07-27T19:11:06.754614Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/home/runner/work/dask-examples/dask-examples/data/processed.0.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.1.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.2.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.3.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.4.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.5.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.6.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.7.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.8.json',\n",
       " '/home/runner/work/dask-examples/dask-examples/data/processed.9.json']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(b.filter(lambda record: record['age'] > 30)  # Select records of interest\n",
    "  .map(json.dumps)                            # Convert Python objects to text\n",
    "  .to_textfiles('data/processed.*.json'))     # Write to local disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert to Dask Dataframes\n",
    "\n",
    "Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes.  Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.  \n",
    "\n",
    "However, Dask Dataframes also expect data that is organized as flat columns.  It does not support nested JSON data very well (Bag is better for this).\n",
    "\n",
    "Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.758945Z",
     "iopub.status.busy": "2022-07-27T19:11:06.758145Z",
     "iopub.status.idle": "2022-07-27T19:11:06.793464Z",
     "shell.execute_reply": "2022-07-27T19:11:06.791813Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'age': 61,\n",
       "  'name': ['Emiko', 'Oliver'],\n",
       "  'occupation': 'Medical Student',\n",
       "  'telephone': '166.814.5565',\n",
       "  'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},\n",
       "  'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b.take(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.796992Z",
     "iopub.status.busy": "2022-07-27T19:11:06.796570Z",
     "iopub.status.idle": "2022-07-27T19:11:06.824927Z",
     "shell.execute_reply": "2022-07-27T19:11:06.824383Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'age': 61,\n",
       "  'occupation': 'Medical Student',\n",
       "  'telephone': '166.814.5565',\n",
       "  'credit-card-number': '3792 459318 98518',\n",
       "  'credit-card-expiration': '12/23',\n",
       "  'name': 'Emiko Oliver',\n",
       "  'street-address': '645 Drumm Line',\n",
       "  'city': 'Kennewick'},)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def flatten(record):\n",
    "    return {\n",
    "        'age': record['age'],\n",
    "        'occupation': record['occupation'],\n",
    "        'telephone': record['telephone'],\n",
    "        'credit-card-number': record['credit-card']['number'],\n",
    "        'credit-card-expiration': record['credit-card']['expiration-date'],\n",
    "        'name': ' '.join(record['name']),\n",
    "        'street-address': record['address']['address'],\n",
    "        'city': record['address']['city']   \n",
    "    }\n",
    "\n",
    "b.map(flatten).take(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:06.828565Z",
     "iopub.status.busy": "2022-07-27T19:11:06.828007Z",
     "iopub.status.idle": "2022-07-27T19:11:07.428161Z",
     "shell.execute_reply": "2022-07-27T19:11:07.427562Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>occupation</th>\n",
       "      <th>telephone</th>\n",
       "      <th>credit-card-number</th>\n",
       "      <th>credit-card-expiration</th>\n",
       "      <th>name</th>\n",
       "      <th>street-address</th>\n",
       "      <th>city</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>61</td>\n",
       "      <td>Medical Student</td>\n",
       "      <td>166.814.5565</td>\n",
       "      <td>3792 459318 98518</td>\n",
       "      <td>12/23</td>\n",
       "      <td>Emiko Oliver</td>\n",
       "      <td>645 Drumm Line</td>\n",
       "      <td>Kennewick</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>54</td>\n",
       "      <td>Tractor Driver</td>\n",
       "      <td>1-975-090-1672</td>\n",
       "      <td>4600 5899 6829 6887</td>\n",
       "      <td>11/25</td>\n",
       "      <td>Wendolyn Ortega</td>\n",
       "      <td>1274 Harbor Court</td>\n",
       "      <td>Mustang</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>33</td>\n",
       "      <td>Doctor</td>\n",
       "      <td>107-044-4885</td>\n",
       "      <td>3464 081512 23342</td>\n",
       "      <td>03/20</td>\n",
       "      <td>Alvin Rich</td>\n",
       "      <td>1242 Vidal Plantation</td>\n",
       "      <td>Wyandotte</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>34</td>\n",
       "      <td>Counsellor</td>\n",
       "      <td>219-748-6795</td>\n",
       "      <td>4018 1801 8111 7757</td>\n",
       "      <td>08/23</td>\n",
       "      <td>Toccara Rogers</td>\n",
       "      <td>252 Sampson Drive</td>\n",
       "      <td>Parma Heights</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34</td>\n",
       "      <td>Graphic Designer</td>\n",
       "      <td>1-509-313-7125</td>\n",
       "      <td>4886 7380 4681 0434</td>\n",
       "      <td>05/18</td>\n",
       "      <td>Randal Roberts</td>\n",
       "      <td>767 Telegraph Side road</td>\n",
       "      <td>New York</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age        occupation       telephone   credit-card-number  \\\n",
       "0   61   Medical Student    166.814.5565    3792 459318 98518   \n",
       "1   54    Tractor Driver  1-975-090-1672  4600 5899 6829 6887   \n",
       "2   33            Doctor    107-044-4885    3464 081512 23342   \n",
       "3   34        Counsellor    219-748-6795  4018 1801 8111 7757   \n",
       "4   34  Graphic Designer  1-509-313-7125  4886 7380 4681 0434   \n",
       "\n",
       "  credit-card-expiration             name           street-address  \\\n",
       "0                  12/23     Emiko Oliver           645 Drumm Line   \n",
       "1                  11/25  Wendolyn Ortega        1274 Harbor Court   \n",
       "2                  03/20       Alvin Rich    1242 Vidal Plantation   \n",
       "3                  08/23   Toccara Rogers        252 Sampson Drive   \n",
       "4                  05/18   Randal Roberts  767 Telegraph Side road   \n",
       "\n",
       "            city  \n",
       "0      Kennewick  \n",
       "1        Mustang  \n",
       "2      Wyandotte  \n",
       "3  Parma Heights  \n",
       "4       New York  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = b.map(flatten).to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now perform the same computation as before, but now using Pandas and Dask dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-07-27T19:11:07.431326Z",
     "iopub.status.busy": "2022-07-27T19:11:07.430992Z",
     "iopub.status.idle": "2022-07-27T19:11:07.954480Z",
     "shell.execute_reply": "2022-07-27T19:11:07.953886Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Merchant                 16\n",
       "Coroner                  14\n",
       "Thermal Insulator        13\n",
       "Book Binder              13\n",
       "Payroll Supervisor       13\n",
       "Medical Practitioner     13\n",
       "Telecommunications       13\n",
       "Optometrist              12\n",
       "Advertising Assistant    12\n",
       "Care Manager             12\n",
       "Name: occupation, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.age > 30].occupation.value_counts().nlargest(10).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learn More\n",
    "\n",
    "You may be interested in the following links:\n",
    "\n",
    "-  [Dask Bag Documentation](https://docs.dask.org/en/latest/bag.html)\n",
    "-  [API Documentation](http://docs.dask.org/en/latest/bag-api.html)\n",
    "-  [dask tutorial](https://github.com/dask/dask-tutorial), notebook 02, for a more in-depth introduction."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}