{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Xarray with Dask Arrays\n", "\n", "\"Xarray\n", " \n", "**[Xarray](http://xarray.pydata.org/en/stable/)** is an open source project and Python package that extends the labeled data functionality of [Pandas](https://pandas.pydata.org/) to N-dimensional array-like datasets. It shares a similar API to [NumPy](http://www.numpy.org/) and [Pandas](https://pandas.pydata.org/) and supports both [Dask](https://dask.org/) and [NumPy](http://www.numpy.org/) arrays under the hood." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:14.990388Z", "iopub.status.busy": "2022-07-27T19:14:14.990139Z", "iopub.status.idle": "2022-07-27T19:14:16.610557Z", "shell.execute_reply": "2022-07-27T19:14:16.609763Z" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "from dask.distributed import Client\n", "import xarray as xr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start Dask Client for Dashboard\n", "\n", "Starting the Dask Client is optional. It will provide a dashboard which \n", "is useful to gain insight on the computation. \n", "\n", "The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:16.615961Z", "iopub.status.busy": "2022-07-27T19:14:16.615121Z", "iopub.status.idle": "2022-07-27T19:14:18.159546Z", "shell.execute_reply": "2022-07-27T19:14:18.159000Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-4eea680b-0de0-11ed-9d1a-000d3a8f7959

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://127.0.0.1:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

8c9bb588

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Workers: 2\n", "
\n", " Total threads: 4\n", " \n", " Total memory: 1.86 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-f26c7784-2ac7-471c-91ac-1b0f9e3135b1

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:36327\n", " \n", " Workers: 2\n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Total threads: 4\n", "
\n", " Started: Just now\n", " \n", " Total memory: 1.86 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:34819\n", " \n", " Total threads: 2\n", "
\n", " Dashboard: http://127.0.0.1:39963/status\n", " \n", " Memory: 0.93 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:38683\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-kg42o2xu\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:39355\n", " \n", " Total threads: 2\n", "
\n", " Dashboard: http://127.0.0.1:43051/status\n", " \n", " Memory: 0.93 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:43647\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-q4dslyjg\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open a sample dataset\n", "\n", "We will use some of xarray's tutorial data for this example. By specifying the chunk shape, xarray will automatically create Dask arrays for each data variable in the `Dataset`. In xarray, `Datasets` are dict-like container of labeled arrays, analogous to the `pandas.DataFrame`. Note that we're taking advantage of xarray's dimension labels when specifying chunk shapes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:18.162948Z", "iopub.status.busy": "2022-07-27T19:14:18.162479Z", "iopub.status.idle": "2022-07-27T19:14:18.703131Z", "shell.execute_reply": "2022-07-27T19:14:18.701890Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (lat: 25, time: 2920, lon: 53)\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n",
       "Data variables:\n",
       "    air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 25), meta=np.ndarray>\n",
       "Attributes:\n",
       "    Conventions:  COARDS\n",
       "    title:        4x daily NMC reanalysis (1948)\n",
       "    description:  Data is from NMC initialized reanalysis\\n(4x/day).  These a...\n",
       "    platform:     Model\n",
       "    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
" ], "text/plain": [ "\n", "Dimensions: (lat: 25, time: 2920, lon: 53)\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n", "Data variables:\n", " air (time, lat, lon) float32 dask.array\n", "Attributes:\n", " Conventions: COARDS\n", " title: 4x daily NMC reanalysis (1948)\n", " description: Data is from NMC initialized reanalysis\\n(4x/day). These a...\n", " platform: Model\n", " references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = xr.tutorial.open_dataset('air_temperature',\n", " chunks={'lat': 25, 'lon': 25, 'time': -1})\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Quickly inspecting the `Dataset` above, we'll note that this `Dataset` has three _dimensions_ akin to axes in NumPy (`lat`, `lon`, and `time`), three _coordinate variables_ akin to `pandas.Index` objects (also named `lat`, `lon`, and `time`), and one data variable (`air`). Xarray also holds Dataset specific metadata as _attributes_." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:18.706930Z", "iopub.status.busy": "2022-07-27T19:14:18.706390Z", "iopub.status.idle": "2022-07-27T19:14:18.729049Z", "shell.execute_reply": "2022-07-27T19:14:18.728535Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>\n",
       "dask.array<open_dataset-ced301335a37488ca2d3a9447fa27157air, shape=(2920, 25, 53), dtype=float32, chunksize=(2920, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n",
       "Attributes:\n",
       "    long_name:     4xDaily Air temperature at sigma level 995\n",
       "    units:         degK\n",
       "    precision:     2\n",
       "    GRIB_id:       11\n",
       "    GRIB_name:     TMP\n",
       "    var_desc:      Air temperature\n",
       "    dataset:       NMC Reanalysis\n",
       "    level_desc:    Surface\n",
       "    statistic:     Individual Obs\n",
       "    parent_stat:   Other\n",
       "    actual_range:  [185.16 322.1 ]
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n", "Attributes:\n", " long_name: 4xDaily Air temperature at sigma level 995\n", " units: degK\n", " precision: 2\n", " GRIB_id: 11\n", " GRIB_name: TMP\n", " var_desc: Air temperature\n", " dataset: NMC Reanalysis\n", " level_desc: Surface\n", " statistic: Individual Obs\n", " parent_stat: Other\n", " actual_range: [185.16 322.1 ]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da = ds['air']\n", "da" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each data variable in xarray is called a `DataArray`. These are the fundamental labeled array objects in xarray. Much like the `Dataset`, `DataArrays` also have _dimensions_ and _coordinates_ that support many of its label-based opperations." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:18.732054Z", "iopub.status.busy": "2022-07-27T19:14:18.731747Z", "iopub.status.idle": "2022-07-27T19:14:18.742093Z", "shell.execute_reply": "2022-07-27T19:14:18.741568Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 14.76 MiB 6.96 MiB
Shape (2920, 25, 53) (2920, 25, 25)
Count 4 Tasks 3 Chunks
Type float32 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 53\n", " 25\n", " 2920\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the underlying array of data is done via the `data` property. Here we can see that we have a Dask array. If this array were to be backed by a NumPy array, this property would point to the actual values in the array." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Standard Xarray Operations\n", "\n", "In almost all cases, operations using xarray objects are identical, regardless if the underlying data is stored as a Dask array or a NumPy array." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:18.744907Z", "iopub.status.busy": "2022-07-27T19:14:18.744498Z", "iopub.status.idle": "2022-07-27T19:14:18.810650Z", "shell.execute_reply": "2022-07-27T19:14:18.809976Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53, month: 12)>\n",
       "dask.array<sub, shape=(2920, 25, 53, 12), dtype=float32, chunksize=(2920, 25, 25, 1), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n",
       "  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n", " * month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da2 = da.groupby('time.month').mean('time')\n", "da3 = da - da2\n", "da3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call `.compute()` or `.load()` when you want your result as a `xarray.DataArray` with data stored as NumPy arrays.\n", "\n", "If you started `Client()` above then you may want to watch the status page during computation." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:18.813740Z", "iopub.status.busy": "2022-07-27T19:14:18.813301Z", "iopub.status.idle": "2022-07-27T19:14:20.374122Z", "shell.execute_reply": "2022-07-27T19:14:20.373492Z" } }, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computed_da = da3.load()\n", "type(computed_da.data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:20.377674Z", "iopub.status.busy": "2022-07-27T19:14:20.377307Z", "iopub.status.idle": "2022-07-27T19:14:20.429522Z", "shell.execute_reply": "2022-07-27T19:14:20.428835Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53, month: 12)>\n",
       "array([[[[-5.14987183e+00, -5.47715759e+00, -9.83168030e+00, ...,\n",
       "          -2.06136017e+01, -1.25448456e+01, -6.77099609e+00],\n",
       "         [-3.88607788e+00, -3.90576172e+00, -8.17987061e+00, ...,\n",
       "          -1.87125549e+01, -1.11448669e+01, -5.52117920e+00],\n",
       "         [-2.71517944e+00, -2.44839478e+00, -6.68945312e+00, ...,\n",
       "          -1.70036011e+01, -9.99716187e+00, -4.41302490e+00],\n",
       "         ...,\n",
       "         [-1.02611389e+01, -9.05839539e+00, -9.39399719e+00, ...,\n",
       "          -1.53933716e+01, -1.01606750e+01, -6.97190857e+00],\n",
       "         [-8.58795166e+00, -7.50210571e+00, -7.61483765e+00, ...,\n",
       "          -1.35699463e+01, -8.43449402e+00, -5.52383423e+00],\n",
       "         [-7.04670715e+00, -5.84384155e+00, -5.70956421e+00, ...,\n",
       "          -1.18162537e+01, -6.54209900e+00, -4.02824402e+00]],\n",
       "\n",
       "        [[-5.05761719e+00, -4.00010681e+00, -9.17195129e+00, ...,\n",
       "          -2.52222595e+01, -1.53296814e+01, -5.93362427e+00],\n",
       "         [-4.40733337e+00, -3.25991821e+00, -8.36616516e+00, ...,\n",
       "          -2.44294434e+01, -1.41292725e+01, -5.66036987e+00],\n",
       "         [-4.01040649e+00, -2.77757263e+00, -7.87347412e+00, ...,\n",
       "          -2.40147858e+01, -1.34914398e+01, -5.78581238e+00],\n",
       "...\n",
       "          -3.56890869e+00, -2.47412109e+00, -1.16558838e+00],\n",
       "         [ 6.08795166e-01,  1.47219849e+00,  1.11965942e+00, ...,\n",
       "          -3.59872437e+00, -2.50396729e+00, -1.15667725e+00],\n",
       "         [ 6.59942627e-01,  1.48742676e+00,  1.03787231e+00, ...,\n",
       "          -3.84628296e+00, -2.71829224e+00, -1.33132935e+00]],\n",
       "\n",
       "        [[ 5.35827637e-01,  4.01092529e-01,  3.08258057e-01, ...,\n",
       "          -1.68054199e+00, -1.12142944e+00, -1.90887451e-01],\n",
       "         [ 8.51684570e-01,  8.73504639e-01,  6.26892090e-01, ...,\n",
       "          -1.33462524e+00, -7.66601562e-01,  1.03210449e-01],\n",
       "         [ 1.04107666e+00,  1.23202515e+00,  8.63311768e-01, ...,\n",
       "          -1.06607056e+00, -5.31036377e-01,  3.14453125e-01],\n",
       "         ...,\n",
       "         [ 4.72015381e-01,  1.32940674e+00,  1.15509033e+00, ...,\n",
       "          -3.23403931e+00, -2.23956299e+00, -1.11035156e+00],\n",
       "         [ 4.14459229e-01,  1.23419189e+00,  1.07876587e+00, ...,\n",
       "          -3.47311401e+00, -2.56188965e+00, -1.37548828e+00],\n",
       "         [ 5.35278320e-02,  8.10333252e-01,  6.73461914e-01, ...,\n",
       "          -4.07232666e+00, -3.12890625e+00, -1.84762573e+00]]]],\n",
       "      dtype=float32)\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n",
       "  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
" ], "text/plain": [ "\n", "array([[[[-5.14987183e+00, -5.47715759e+00, -9.83168030e+00, ...,\n", " -2.06136017e+01, -1.25448456e+01, -6.77099609e+00],\n", " [-3.88607788e+00, -3.90576172e+00, -8.17987061e+00, ...,\n", " -1.87125549e+01, -1.11448669e+01, -5.52117920e+00],\n", " [-2.71517944e+00, -2.44839478e+00, -6.68945312e+00, ...,\n", " -1.70036011e+01, -9.99716187e+00, -4.41302490e+00],\n", " ...,\n", " [-1.02611389e+01, -9.05839539e+00, -9.39399719e+00, ...,\n", " -1.53933716e+01, -1.01606750e+01, -6.97190857e+00],\n", " [-8.58795166e+00, -7.50210571e+00, -7.61483765e+00, ...,\n", " -1.35699463e+01, -8.43449402e+00, -5.52383423e+00],\n", " [-7.04670715e+00, -5.84384155e+00, -5.70956421e+00, ...,\n", " -1.18162537e+01, -6.54209900e+00, -4.02824402e+00]],\n", "\n", " [[-5.05761719e+00, -4.00010681e+00, -9.17195129e+00, ...,\n", " -2.52222595e+01, -1.53296814e+01, -5.93362427e+00],\n", " [-4.40733337e+00, -3.25991821e+00, -8.36616516e+00, ...,\n", " -2.44294434e+01, -1.41292725e+01, -5.66036987e+00],\n", " [-4.01040649e+00, -2.77757263e+00, -7.87347412e+00, ...,\n", " -2.40147858e+01, -1.34914398e+01, -5.78581238e+00],\n", "...\n", " -3.56890869e+00, -2.47412109e+00, -1.16558838e+00],\n", " [ 6.08795166e-01, 1.47219849e+00, 1.11965942e+00, ...,\n", " -3.59872437e+00, -2.50396729e+00, -1.15667725e+00],\n", " [ 6.59942627e-01, 1.48742676e+00, 1.03787231e+00, ...,\n", " -3.84628296e+00, -2.71829224e+00, -1.33132935e+00]],\n", "\n", " [[ 5.35827637e-01, 4.01092529e-01, 3.08258057e-01, ...,\n", " -1.68054199e+00, -1.12142944e+00, -1.90887451e-01],\n", " [ 8.51684570e-01, 8.73504639e-01, 6.26892090e-01, ...,\n", " -1.33462524e+00, -7.66601562e-01, 1.03210449e-01],\n", " [ 1.04107666e+00, 1.23202515e+00, 8.63311768e-01, ...,\n", " -1.06607056e+00, -5.31036377e-01, 3.14453125e-01],\n", " ...,\n", " [ 4.72015381e-01, 1.32940674e+00, 1.15509033e+00, ...,\n", " -3.23403931e+00, -2.23956299e+00, -1.11035156e+00],\n", " [ 4.14459229e-01, 1.23419189e+00, 1.07876587e+00, ...,\n", " -3.47311401e+00, -2.56188965e+00, -1.37548828e+00],\n", " [ 5.35278320e-02, 8.10333252e-01, 6.73461914e-01, ...,\n", " -4.07232666e+00, -3.12890625e+00, -1.84762573e+00]]]],\n", " dtype=float32)\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n", " * month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computed_da" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persist data in memory\n", "\n", "If you have the available RAM for your dataset then you can persist data in memory. \n", "\n", "This allows future computations to be much faster." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:20.432764Z", "iopub.status.busy": "2022-07-27T19:14:20.432251Z", "iopub.status.idle": "2022-07-27T19:14:20.449806Z", "shell.execute_reply": "2022-07-27T19:14:20.444113Z" } }, "outputs": [], "source": [ "da = da.persist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time Series Operations\n", "\n", "Because we have a datetime index time-series operations work efficiently. Here we demo the use of xarray's resample method:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:20.452861Z", "iopub.status.busy": "2022-07-27T19:14:20.452409Z", "iopub.status.idle": "2022-07-27T19:14:20.707216Z", "shell.execute_reply": "2022-07-27T19:14:20.706694Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (lat: 25, lon: 53)>\n",
       "dask.array<_sqrt, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
" ], "text/plain": [ "\n", "dask.array<_sqrt, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da.resample(time='1w').mean('time').std('time')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:20.718149Z", "iopub.status.busy": "2022-07-27T19:14:20.710650Z", "iopub.status.idle": "2022-07-27T19:14:22.778848Z", "shell.execute_reply": "2022-07-27T19:14:22.777972Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "da.resample(time='1w').mean('time').std('time').load().plot(figsize=(12, 8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and rolling window operations:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:22.782427Z", "iopub.status.busy": "2022-07-27T19:14:22.781914Z", "iopub.status.idle": "2022-07-27T19:14:22.887586Z", "shell.execute_reply": "2022-07-27T19:14:22.885001Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>\n",
       "dask.array<truediv, shape=(2920, 25, 53), dtype=float64, chunksize=(2920, 25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n",
       "Attributes:\n",
       "    long_name:     4xDaily Air temperature at sigma level 995\n",
       "    units:         degK\n",
       "    precision:     2\n",
       "    GRIB_id:       11\n",
       "    GRIB_name:     TMP\n",
       "    var_desc:      Air temperature\n",
       "    dataset:       NMC Reanalysis\n",
       "    level_desc:    Surface\n",
       "    statistic:     Individual Obs\n",
       "    parent_stat:   Other\n",
       "    actual_range:  [185.16 322.1 ]
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00\n", "Attributes:\n", " long_name: 4xDaily Air temperature at sigma level 995\n", " units: degK\n", " precision: 2\n", " GRIB_id: 11\n", " GRIB_name: TMP\n", " var_desc: Air temperature\n", " dataset: NMC Reanalysis\n", " level_desc: Surface\n", " statistic: Individual Obs\n", " parent_stat: Other\n", " actual_range: [185.16 322.1 ]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "da_smooth = da.rolling(time=30).mean().persist()\n", "da_smooth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since xarray stores each of its coordinate variables in memory, slicing by label is trivial and entirely lazy." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:22.897318Z", "iopub.status.busy": "2022-07-27T19:14:22.895132Z", "iopub.status.idle": "2022-07-27T19:14:22.941105Z", "shell.execute_reply": "2022-07-27T19:14:22.940345Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.05 ms, sys: 2.82 ms, total: 3.87 ms\n", "Wall time: 7.08 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (lat: 25, lon: 53)>\n",
       "dask.array<getitem, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "    time     datetime64[ns] 2013-01-01T18:00:00\n",
       "Attributes:\n",
       "    long_name:     4xDaily Air temperature at sigma level 995\n",
       "    units:         degK\n",
       "    precision:     2\n",
       "    GRIB_id:       11\n",
       "    GRIB_name:     TMP\n",
       "    var_desc:      Air temperature\n",
       "    dataset:       NMC Reanalysis\n",
       "    level_desc:    Surface\n",
       "    statistic:     Individual Obs\n",
       "    parent_stat:   Other\n",
       "    actual_range:  [185.16 322.1 ]
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " time datetime64[ns] 2013-01-01T18:00:00\n", "Attributes:\n", " long_name: 4xDaily Air temperature at sigma level 995\n", " units: degK\n", " precision: 2\n", " GRIB_id: 11\n", " GRIB_name: TMP\n", " var_desc: Air temperature\n", " dataset: NMC Reanalysis\n", " level_desc: Surface\n", " statistic: Individual Obs\n", " parent_stat: Other\n", " actual_range: [185.16 322.1 ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time da.sel(time='2013-01-01T18:00:00')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:22.947428Z", "iopub.status.busy": "2022-07-27T19:14:22.946356Z", "iopub.status.idle": "2022-07-27T19:14:23.075228Z", "shell.execute_reply": "2022-07-27T19:14:23.074353Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 23.5 ms, sys: 7.2 ms, total: 30.7 ms\n", "Wall time: 91.1 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (lat: 25, lon: 53)>\n",
       "array([[241.89   , 241.79999, 241.79999, ..., 234.39   , 235.5    ,\n",
       "        237.59999],\n",
       "       [246.29999, 245.29999, 244.2    , ..., 230.89   , 231.5    ,\n",
       "        234.5    ],\n",
       "       [256.6    , 254.7    , 252.09999, ..., 230.7    , 231.79999,\n",
       "        236.09999],\n",
       "       ...,\n",
       "       [296.6    , 296.4    , 296.     , ..., 296.5    , 295.79   ,\n",
       "        295.29   ],\n",
       "       [297.     , 297.5    , 297.1    , ..., 296.79   , 296.6    ,\n",
       "        296.29   ],\n",
       "       [297.5    , 297.69998, 297.5    , ..., 297.79   , 298.     ,\n",
       "        297.9    ]], dtype=float32)\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n",
       "    time     datetime64[ns] 2013-01-01T18:00:00\n",
       "Attributes:\n",
       "    long_name:     4xDaily Air temperature at sigma level 995\n",
       "    units:         degK\n",
       "    precision:     2\n",
       "    GRIB_id:       11\n",
       "    GRIB_name:     TMP\n",
       "    var_desc:      Air temperature\n",
       "    dataset:       NMC Reanalysis\n",
       "    level_desc:    Surface\n",
       "    statistic:     Individual Obs\n",
       "    parent_stat:   Other\n",
       "    actual_range:  [185.16 322.1 ]
" ], "text/plain": [ "\n", "array([[241.89 , 241.79999, 241.79999, ..., 234.39 , 235.5 ,\n", " 237.59999],\n", " [246.29999, 245.29999, 244.2 , ..., 230.89 , 231.5 ,\n", " 234.5 ],\n", " [256.6 , 254.7 , 252.09999, ..., 230.7 , 231.79999,\n", " 236.09999],\n", " ...,\n", " [296.6 , 296.4 , 296. , ..., 296.5 , 295.79 ,\n", " 295.29 ],\n", " [297. , 297.5 , 297.1 , ..., 296.79 , 296.6 ,\n", " 296.29 ],\n", " [297.5 , 297.69998, 297.5 , ..., 297.79 , 298. ,\n", " 297.9 ]], dtype=float32)\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0\n", " time datetime64[ns] 2013-01-01T18:00:00\n", "Attributes:\n", " long_name: 4xDaily Air temperature at sigma level 995\n", " units: degK\n", " precision: 2\n", " GRIB_id: 11\n", " GRIB_name: TMP\n", " var_desc: Air temperature\n", " dataset: NMC Reanalysis\n", " level_desc: Surface\n", " statistic: Individual Obs\n", " parent_stat: Other\n", " actual_range: [185.16 322.1 ]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time da.sel(time='2013-01-01T18:00:00').load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom workflows and automatic parallelization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Almost all of xarray’s built-in operations work on Dask arrays. If you want to use a function that isn’t wrapped by xarray, one option is to extract Dask arrays from xarray objects (.data) and use Dask directly.\n", "\n", "Another option is to use xarray’s `apply_ufunc()` function, which can automate embarrassingly parallel “map” type operations where a function written for processing NumPy arrays should be repeatedly applied to xarray objects containing Dask arrays. It works similarly to `dask.array.map_blocks()` and `dask.array.blockwise()`, but without requiring an intermediate layer of abstraction.\n", "\n", "Here we show an example using NumPy operations and a fast function from `bottleneck`, which we use to calculate Spearman’s rank-correlation coefficient:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:23.081037Z", "iopub.status.busy": "2022-07-27T19:14:23.078773Z", "iopub.status.idle": "2022-07-27T19:14:23.095483Z", "shell.execute_reply": "2022-07-27T19:14:23.093372Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "import bottleneck\n", "\n", "def covariance_gufunc(x, y):\n", " return ((x - x.mean(axis=-1, keepdims=True))\n", " * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)\n", "\n", "def pearson_correlation_gufunc(x, y):\n", " return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))\n", "\n", "def spearman_correlation_gufunc(x, y):\n", " x_ranks = bottleneck.rankdata(x, axis=-1)\n", " y_ranks = bottleneck.rankdata(y, axis=-1)\n", " return pearson_correlation_gufunc(x_ranks, y_ranks)\n", "\n", "def spearman_correlation(x, y, dim):\n", " return xr.apply_ufunc(\n", " spearman_correlation_gufunc, x, y,\n", " input_core_dims=[[dim], [dim]],\n", " dask='parallelized',\n", " output_dtypes=[float])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the examples above, we were working with an some air temperature data. For this example, we'll calculate the spearman correlation using the raw air temperature data with the smoothed version that we also created (`da_smooth`). For this, we'll also have to rechunk the data ahead of time." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:23.103207Z", "iopub.status.busy": "2022-07-27T19:14:23.101006Z", "iopub.status.idle": "2022-07-27T19:14:23.163083Z", "shell.execute_reply": "2022-07-27T19:14:23.162556Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'air' (lat: 25, lon: 53)>\n",
       "dask.array<transpose, shape=(25, 53), dtype=float64, chunksize=(25, 25), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n",
       "  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0\n", " * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corr = spearman_correlation(da.chunk({'time': -1}),\n", " da_smooth.chunk({'time': -1}),\n", " 'time')\n", "corr" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:14:23.167769Z", "iopub.status.busy": "2022-07-27T19:14:23.166590Z", "iopub.status.idle": "2022-07-27T19:14:23.849126Z", "shell.execute_reply": "2022-07-27T19:14:23.848425Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "corr.plot(figsize=(12, 8))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }