{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dask Arrays\n", "\n", "\"Dask\n", " \n", "Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid. They support a large subset of the Numpy API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start Dask Client for Dashboard\n", "\n", "Starting the Dask Client is optional. It will provide a dashboard which \n", "is useful to gain insight on the computation. \n", "\n", "The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:55.442672Z", "iopub.status.busy": "2022-07-27T19:10:55.442027Z", "iopub.status.idle": "2022-07-27T19:10:56.703070Z", "shell.execute_reply": "2022-07-27T19:10:56.702553Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-d7513320-0ddf-11ed-9808-000d3a8f7959

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://10.1.1.64:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

eced48ba

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://10.1.1.64:8787/status\n", " \n", " Workers: 1\n", "
\n", " Total threads: 4\n", " \n", " Total memory: 1.86 GiB\n", "
Status: runningUsing processes: False
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-245cbcab-5c52-43bc-bcad-524a2981a5bf

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: inproc://10.1.1.64/6152/1\n", " \n", " Workers: 1\n", "
\n", " Dashboard: http://10.1.1.64:8787/status\n", " \n", " Total threads: 4\n", "
\n", " Started: Just now\n", " \n", " Total memory: 1.86 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: inproc://10.1.1.64/6152/4\n", " \n", " Total threads: 4\n", "
\n", " Dashboard: http://10.1.1.64:36121/status\n", " \n", " Memory: 1.86 GiB\n", "
\n", " Nanny: None\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-94bm6jfp\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client, progress\n", "client = Client(processes=False, threads_per_worker=4,\n", " n_workers=1, memory_limit='2GB')\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Random array\n", "\n", "This creates a 10000x10000 array of random numbers, represented as many numpy arrays of size 1000x1000 (or smaller if the array cannot be divided evenly). In this case there are 100 (10x10) numpy arrays of size 1000x1000." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:56.733881Z", "iopub.status.busy": "2022-07-27T19:10:56.733296Z", "iopub.status.idle": "2022-07-27T19:10:56.974478Z", "shell.execute_reply": "2022-07-27T19:10:56.973800Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 762.94 MiB 7.63 MiB
Shape (10000, 10000) (1000, 1000)
Count 100 Tasks 100 Chunks
Type float64 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 10000\n", " 10000\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.array as da\n", "x = da.random.random((10000, 10000), chunks=(1000, 1000))\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use NumPy syntax as usual" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:56.978190Z", "iopub.status.busy": "2022-07-27T19:10:56.977693Z", "iopub.status.idle": "2022-07-27T19:10:56.992401Z", "shell.execute_reply": "2022-07-27T19:10:56.991890Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 39.06 kiB 3.91 kiB
Shape (5000,) (500,)
Count 430 Tasks 10 Chunks
Type float64 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 5000\n", " 1\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = x + x.T\n", "z = y[::2, 5000:].mean(axis=1)\n", "z" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call `.compute()` when you want your result as a NumPy array.\n", "\n", "If you started `Client()` above then you may want to watch the status page during computation." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:56.995860Z", "iopub.status.busy": "2022-07-27T19:10:56.995238Z", "iopub.status.idle": "2022-07-27T19:10:57.994768Z", "shell.execute_reply": "2022-07-27T19:10:57.994181Z" } }, "outputs": [ { "data": { "text/plain": [ "array([1.00226063, 1.01066798, 1.00353892, ..., 1.00020978, 1.00972641,\n", " 0.99609573])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "z.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persist data in memory\n", "\n", "If you have the available RAM for your dataset then you can persist data in memory. \n", "\n", "This allows future computations to be much faster." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:57.998288Z", "iopub.status.busy": "2022-07-27T19:10:57.997795Z", "iopub.status.idle": "2022-07-27T19:10:58.054299Z", "shell.execute_reply": "2022-07-27T19:10:58.027749Z" } }, "outputs": [], "source": [ "y = y.persist()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:58.059088Z", "iopub.status.busy": "2022-07-27T19:10:58.057863Z", "iopub.status.idle": "2022-07-27T19:10:59.111061Z", "shell.execute_reply": "2022-07-27T19:10:59.110575Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.53 s, sys: 338 ms, total: 1.86 s\n", "Wall time: 1.04 s\n" ] }, { "data": { "text/plain": [ "0.6048766839597692" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time y[0, 0].compute()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:10:59.117179Z", "iopub.status.busy": "2022-07-27T19:10:59.116525Z", "iopub.status.idle": "2022-07-27T19:10:59.424633Z", "shell.execute_reply": "2022-07-27T19:10:59.424059Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 399 ms, sys: 53.2 ms, total: 452 ms\n", "Wall time: 298 ms\n" ] }, { "data": { "text/plain": [ "99992368.08411336" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time y.sum().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further Reading \n", "\n", "A more in-depth guide to working with Dask arrays can be found in the [dask tutorial](https://github.com/dask/dask-tutorial), notebook 03." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }