{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dask Bags\n", "\n", "\n", "Dask Bag implements operations like `map`, `filter`, `groupby` and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators. It is similar to a parallel version of itertools or a Pythonic version of the PySpark RDD.\n", "\n", "Dask Bags are often used to do simple preprocessing on log files, JSON records, or other user defined Python objects.\n", "\n", "Full API documentation is available here: http://docs.dask.org/en/latest/bag-api.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start Dask Client for Dashboard\n", "\n", "Starting the Dask Client is optional. It will provide a dashboard which \n", "is useful to gain insight on the computation. \n", "\n", "The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:20.038460Z", "iopub.status.busy": "2022-05-16T13:48:20.037728Z", "iopub.status.idle": "2022-05-16T13:48:23.068040Z", "shell.execute_reply": "2022-05-16T13:48:23.067344Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-d8c74c6b-d51e-11ec-987d-000d3aeabb7a

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://127.0.0.1:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

ccf92864

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Workers: 4\n", "
\n", " Total threads: 4\n", " \n", " Total memory: 6.78 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-87053c93-37ec-41b5-b3ed-400eb43172d4

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:37121\n", " \n", " Workers: 4\n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Total threads: 4\n", "
\n", " Started: Just now\n", " \n", " Total memory: 6.78 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:41045\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:44457/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:46149\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-_ekrc_hm\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:46571\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:35441/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:36089\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-afk9mjt9\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 2

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:35719\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:37417/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:44165\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-o41nhzyx\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 3

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:38085\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:42333/status\n", " \n", " Memory: 1.70 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:46073\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-bfdfuzhe\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client, progress\n", "client = Client(n_workers=4, threads_per_worker=1)\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Random Data\n", "\n", "We create a random set of record data and store it to disk as many JSON files. This will serve as our data for this notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:23.071548Z", "iopub.status.busy": "2022-05-16T13:48:23.071156Z", "iopub.status.idle": "2022-05-16T13:48:24.044674Z", "shell.execute_reply": "2022-05-16T13:48:24.043836Z" } }, "outputs": [ { "data": { "text/plain": [ "['/home/runner/work/dask-examples/dask-examples/data/0.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/1.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/2.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/3.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/4.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/5.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/6.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/7.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/8.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/9.json']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask\n", "import json\n", "import os\n", "\n", "os.makedirs('data', exist_ok=True) # Create data/ directory\n", "\n", "b = dask.datasets.make_people() # Make records of people\n", "b.map(json.dumps).to_textfiles('data/*.json') # Encode as JSON, write to disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read JSON data\n", "\n", "Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.048220Z", "iopub.status.busy": "2022-05-16T13:48:24.047519Z", "iopub.status.idle": "2022-05-16T13:48:24.210720Z", "shell.execute_reply": "2022-05-16T13:48:24.210025Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"age\": 38, \"name\": [\"Foster\", \"Mills\"], \"occupation\": \"Ice Cream Vendor\", \"telephone\": \"(797) 126-1527\", \"address\": {\"address\": \"718 Heritage Glen\", \"city\": \"Pompano Beach\"}, \"credit-card\": {\"number\": \"2603 3923 5402 6336\", \"expiration-date\": \"11/23\"}}\r\n", "{\"age\": 49, \"name\": [\"Hollis\", \"Carroll\"], \"occupation\": \"Publishing Manager\", \"telephone\": \"294.650.2561\", \"address\": {\"address\": \"384 Hillside Path\", \"city\": \"South Holland\"}, \"credit-card\": {\"number\": \"4839 9246 7834 2635\", \"expiration-date\": \"08/19\"}}\r\n" ] } ], "source": [ "!head -n 2 data/0.json" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.214334Z", "iopub.status.busy": "2022-05-16T13:48:24.213857Z", "iopub.status.idle": "2022-05-16T13:48:24.223566Z", "shell.execute_reply": "2022-05-16T13:48:24.223056Z" } }, "outputs": [ { "data": { "text/plain": [ "dask.bag" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.bag as db\n", "import json\n", "\n", "b = db.read_text('data/*.json').map(json.loads)\n", "b" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.226847Z", "iopub.status.busy": "2022-05-16T13:48:24.226139Z", "iopub.status.idle": "2022-05-16T13:48:24.256269Z", "shell.execute_reply": "2022-05-16T13:48:24.254168Z" } }, "outputs": [ { "data": { "text/plain": [ "({'age': 38,\n", " 'name': ['Foster', 'Mills'],\n", " 'occupation': 'Ice Cream Vendor',\n", " 'telephone': '(797) 126-1527',\n", " 'address': {'address': '718 Heritage Glen', 'city': 'Pompano Beach'},\n", " 'credit-card': {'number': '2603 3923 5402 6336',\n", " 'expiration-date': '11/23'}},\n", " {'age': 49,\n", " 'name': ['Hollis', 'Carroll'],\n", " 'occupation': 'Publishing Manager',\n", " 'telephone': '294.650.2561',\n", " 'address': {'address': '384 Hillside Path', 'city': 'South Holland'},\n", " 'credit-card': {'number': '4839 9246 7834 2635',\n", " 'expiration-date': '08/19'}})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.take(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Map, Filter, Aggregate\n", "\n", "We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.260230Z", "iopub.status.busy": "2022-05-16T13:48:24.259828Z", "iopub.status.idle": "2022-05-16T13:48:24.285797Z", "shell.execute_reply": "2022-05-16T13:48:24.285132Z" } }, "outputs": [ { "data": { "text/plain": [ "({'age': 38,\n", " 'name': ['Foster', 'Mills'],\n", " 'occupation': 'Ice Cream Vendor',\n", " 'telephone': '(797) 126-1527',\n", " 'address': {'address': '718 Heritage Glen', 'city': 'Pompano Beach'},\n", " 'credit-card': {'number': '2603 3923 5402 6336',\n", " 'expiration-date': '11/23'}},\n", " {'age': 49,\n", " 'name': ['Hollis', 'Carroll'],\n", " 'occupation': 'Publishing Manager',\n", " 'telephone': '294.650.2561',\n", " 'address': {'address': '384 Hillside Path', 'city': 'South Holland'},\n", " 'credit-card': {'number': '4839 9246 7834 2635',\n", " 'expiration-date': '08/19'}})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.filter(lambda record: record['age'] > 30).take(2) # Select only people over 30" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.289370Z", "iopub.status.busy": "2022-05-16T13:48:24.289065Z", "iopub.status.idle": "2022-05-16T13:48:24.313829Z", "shell.execute_reply": "2022-05-16T13:48:24.313175Z" } }, "outputs": [ { "data": { "text/plain": [ "('Ice Cream Vendor', 'Publishing Manager')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.map(lambda record: record['occupation']).take(2) # Select the occupation field" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.316446Z", "iopub.status.busy": "2022-05-16T13:48:24.316136Z", "iopub.status.idle": "2022-05-16T13:48:24.413977Z", "shell.execute_reply": "2022-05-16T13:48:24.413346Z" } }, "outputs": [ { "data": { "text/plain": [ "10000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.count().compute() # Count total number of records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chain computations\n", "\n", "It is common to do many of these steps in one pipeline, only calling `compute` or `take` at the end." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.417019Z", "iopub.status.busy": "2022-05-16T13:48:24.416823Z", "iopub.status.idle": "2022-05-16T13:48:24.424247Z", "shell.execute_reply": "2022-05-16T13:48:24.423689Z" } }, "outputs": [ { "data": { "text/plain": [ "dask.bag" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = (b.filter(lambda record: record['age'] > 30)\n", " .map(lambda record: record['occupation'])\n", " .frequencies(sort=True)\n", " .topk(10, key=1))\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with all lazy Dask collections, we need to call `compute` to actually evaluate our result. The `take` method used in earlier examples is also like `compute` and will also trigger computation." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.427098Z", "iopub.status.busy": "2022-05-16T13:48:24.426539Z", "iopub.status.idle": "2022-05-16T13:48:24.566971Z", "shell.execute_reply": "2022-05-16T13:48:24.566224Z" } }, "outputs": [ { "data": { "text/plain": [ "[('Midwife', 15),\n", " ('English Teacher', 14),\n", " ('Furniture Dealer', 14),\n", " ('Arbitrator', 14),\n", " ('Product Installer', 14),\n", " ('Aircraft Engineer', 13),\n", " ('Ledger Clerk', 13),\n", " ('Recorder', 13),\n", " ('Metal Worker', 13),\n", " ('Quality Inspector', 12)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transform and Store\n", "\n", "Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses. For that we can use methods like `to_textfiles` and `json.dumps`, or we can convert to Dask Dataframes and use their storage systems, which we'll see more of in the next section." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.570022Z", "iopub.status.busy": "2022-05-16T13:48:24.569820Z", "iopub.status.idle": "2022-05-16T13:48:24.723926Z", "shell.execute_reply": "2022-05-16T13:48:24.723354Z" } }, "outputs": [ { "data": { "text/plain": [ "['/home/runner/work/dask-examples/dask-examples/data/processed.0.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.1.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.2.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.3.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.4.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.5.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.6.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.7.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.8.json',\n", " '/home/runner/work/dask-examples/dask-examples/data/processed.9.json']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(b.filter(lambda record: record['age'] > 30) # Select records of interest\n", " .map(json.dumps) # Convert Python objects to text\n", " .to_textfiles('data/processed.*.json')) # Write to local disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert to Dask Dataframes\n", "\n", "Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes. Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms. \n", "\n", "However, Dask Dataframes also expect data that is organized as flat columns. It does not support nested JSON data very well (Bag is better for this).\n", "\n", "Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.727559Z", "iopub.status.busy": "2022-05-16T13:48:24.726962Z", "iopub.status.idle": "2022-05-16T13:48:24.752483Z", "shell.execute_reply": "2022-05-16T13:48:24.751975Z" } }, "outputs": [ { "data": { "text/plain": [ "({'age': 38,\n", " 'name': ['Foster', 'Mills'],\n", " 'occupation': 'Ice Cream Vendor',\n", " 'telephone': '(797) 126-1527',\n", " 'address': {'address': '718 Heritage Glen', 'city': 'Pompano Beach'},\n", " 'credit-card': {'number': '2603 3923 5402 6336',\n", " 'expiration-date': '11/23'}},)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.take(1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.755914Z", "iopub.status.busy": "2022-05-16T13:48:24.755410Z", "iopub.status.idle": "2022-05-16T13:48:24.780172Z", "shell.execute_reply": "2022-05-16T13:48:24.779599Z" } }, "outputs": [ { "data": { "text/plain": [ "({'age': 38,\n", " 'occupation': 'Ice Cream Vendor',\n", " 'telephone': '(797) 126-1527',\n", " 'credit-card-number': '2603 3923 5402 6336',\n", " 'credit-card-expiration': '11/23',\n", " 'name': 'Foster Mills',\n", " 'street-address': '718 Heritage Glen',\n", " 'city': 'Pompano Beach'},)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def flatten(record):\n", " return {\n", " 'age': record['age'],\n", " 'occupation': record['occupation'],\n", " 'telephone': record['telephone'],\n", " 'credit-card-number': record['credit-card']['number'],\n", " 'credit-card-expiration': record['credit-card']['expiration-date'],\n", " 'name': ' '.join(record['name']),\n", " 'street-address': record['address']['address'],\n", " 'city': record['address']['city'] \n", " }\n", "\n", "b.map(flatten).take(1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:24.782975Z", "iopub.status.busy": "2022-05-16T13:48:24.782578Z", "iopub.status.idle": "2022-05-16T13:48:25.360762Z", "shell.execute_reply": "2022-05-16T13:48:25.360224Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageoccupationtelephonecredit-card-numbercredit-card-expirationnamestreet-addresscity
038Ice Cream Vendor(797) 126-15272603 3923 5402 633611/23Foster Mills718 Heritage GlenPompano Beach
149Publishing Manager294.650.25614839 9246 7834 263508/19Hollis Carroll384 Hillside PathSouth Holland
218School Inspector941.109.01063770 881336 8819201/24Serita Mccray619 Benton DrungLittleton
361Housing Supervisor(940) 965-59313779 697762 1201206/19Leland Curry488 Gladiolus TerraceCitrus Heights
442Ice Cream Vendor(128) 669-66044088 2915 3269 547711/16Janyce Good1178 Cresta Vista LineLawrenceville
\n", "
" ], "text/plain": [ " age occupation telephone credit-card-number \\\n", "0 38 Ice Cream Vendor (797) 126-1527 2603 3923 5402 6336 \n", "1 49 Publishing Manager 294.650.2561 4839 9246 7834 2635 \n", "2 18 School Inspector 941.109.0106 3770 881336 88192 \n", "3 61 Housing Supervisor (940) 965-5931 3779 697762 12012 \n", "4 42 Ice Cream Vendor (128) 669-6604 4088 2915 3269 5477 \n", "\n", " credit-card-expiration name street-address \\\n", "0 11/23 Foster Mills 718 Heritage Glen \n", "1 08/19 Hollis Carroll 384 Hillside Path \n", "2 01/24 Serita Mccray 619 Benton Drung \n", "3 06/19 Leland Curry 488 Gladiolus Terrace \n", "4 11/16 Janyce Good 1178 Cresta Vista Line \n", "\n", " city \n", "0 Pompano Beach \n", "1 South Holland \n", "2 Littleton \n", "3 Citrus Heights \n", "4 Lawrenceville " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = b.map(flatten).to_dataframe()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now perform the same computation as before, but now using Pandas and Dask dataframe." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-05-16T13:48:25.363490Z", "iopub.status.busy": "2022-05-16T13:48:25.363174Z", "iopub.status.idle": "2022-05-16T13:48:25.814260Z", "shell.execute_reply": "2022-05-16T13:48:25.813539Z" } }, "outputs": [ { "data": { "text/plain": [ "Midwife 15\n", "Arbitrator 14\n", "Furniture Dealer 14\n", "English Teacher 14\n", "Product Installer 14\n", "Recorder 13\n", "Aircraft Engineer 13\n", "Metal Worker 13\n", "Ledger Clerk 13\n", "History Teacher 12\n", "Name: occupation, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.age > 30].occupation.value_counts().nlargest(10).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn More\n", "\n", "You may be interested in the following links:\n", "\n", "- [Dask Bag Documentation](https://docs.dask.org/en/latest/bag.html)\n", "- [API Documentation](http://docs.dask.org/en/latest/bag-api.html)\n", "- [dask tutorial](https://github.com/dask/dask-tutorial), notebook 02, for a more in-depth introduction." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }