Stencil Computations with Numba
Contents
Live Notebook
You can run this notebook in a live session or view it on Github.
Stencil Computations with Numba¶
This notebook combines Numba, a high performance Python compiler, with Dask Arrays.
In particular we show off two Numba features, and how they compose with Dask:
Numba’s stencil decorator
NumPy’s Generalized Universal Functions
This was originally published as a blogposthere
Introduction to Numba Stencils¶
Many array computing functions operate only on a local region of the array. This is common in image processing, signals processing, simulation, the solution of differential equations, anomaly detection, time series analysis, and more. Typically we write code that looks like the following:
[1]:
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0]  1):
for j in range(1, x.shape[1]  1):
out[i, j] = (x[i + 1, j + 1] + x[i + 1, j + 0] + x[i + 1, j + 1] +
x[i + 0, j + 1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + 1] + x[i + 1, j + 0] + x[i + 1, j + 1]) // 9
return out
Or something similar. The numba.stencil
decorator makes this a bit easier to write down. You just write down what happens on every element, and Numba handles the rest.
[2]:
import numba
@numba.stencil
def _smooth(x):
return (x[1, 1] + x[1, 0] + x[1, 1] +
x[ 0, 1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, 1] + x[ 1, 0] + x[ 1, 1]) // 9
When we run this function on a NumPy array, we find that it is slow, operating at Python speeds.
[3]:
import numpy as np
x = np.ones((100, 100))
%timeit _smooth(x)
368 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But if we JIT compile this function with Numba, then it runs more quickly.
[4]:
@numba.njit
def smooth(x):
return _smooth(x)
%timeit smooth(x)
175 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
For those counting, that’s over 1000x faster!
Note: this function already exists as ``scipy.ndimage.uniform_filter``, which operates at the same speed.
Dask Array¶
In these applications people often have many such arrays and they want to apply this function over all of them. In principle they could do this with a for loop.
from glob import glob
import skimage.io
for fn in glob('/path/to/*.png'):
img = skimage.io.imread(fn)
out = smooth(img)
skimage.io.imsave(fn.replace('.png', '.out.png'), out)
If they wanted to then do this in parallel they would maybe use the multiprocessing or concurrent.futures modules. If they wanted to do this across a cluster then they could rewrite their code with PySpark or some other system.
Or, they could use Dask array, which will handle both the pipelining and the parallelism (single machine or on a cluster) all while still looking mostly like a NumPy array.
import dask_image
x = dask_image.imread('/path/to/*.png') # a large lazy array of all of our images
y = x.map_blocks(smooth, dtype='int8')
And then because each of the chunks of a Dask array are just NumPy arrays, we can use the map_blocks function to apply this function across all of our images, and then save them out.
This is fine, but lets go a bit further, and discuss generalized universal functions from NumPy.
Because we don’t have a stack of images nearby, we’re going to make a random array with similar structure.
[5]:
import dask.array as da
x = da.random.randint(0, 127, size=(10000, 1000, 1000), chunks=('64 MB', None, None), dtype='int8')
x
[5]:

Generalized Universal Functions¶
Numba Docs: https://numba.pydata.org/numbadoc/dev/user/vectorize.html
NumPy Docs: https://numpy.org/doc/stable/reference/capi/generalizedufuncs.html
A generalized universal function (gufunc) is a normal function that has been annotated with typing and dimension information. For example we can redefine our smooth
function as a gufunc as follows:
[6]:
@numba.guvectorize(
[(numba.int8[:, :], numba.int8[:, :])],
'(n, m) > (n, m)'
)
def smooth(x, out):
out[:] = _smooth(x)
This function knows that it consumes a 2d array of int8’s and produces a 2d array of int8’s of the same dimensions.
This sort of annotation is a small change, but it gives other systems like Dask enough information to use it intelligently. Rather than call functions like map_blocks
, we can just use the function directly, as if our Dask Array was just a very large NumPy array.
[7]:
# Before gufuncs
y = x.map_blocks(smooth, dtype='int8')
# After gufuncs
y = smooth(x)
y
[7]:

This is nice. If you write library code with gufunc semantics then that code just works with systems like Dask, without you having to build in explicit support for parallel computing. This makes the lives of users much easier.
Start Dask Client for Dashboard¶
Starting the Dask Client is optional. It will start the dashboard which is useful to gain insight on the computation.
[8]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=4,
n_workers=1,
processes=False,
memory_limit='4GB')
client
[8]:
Client
Clientbb5807ea0de011eda0f1000d3a8f7959
Connection method: Cluster object  Cluster type: distributed.LocalCluster 
Dashboard: http://10.1.1.64:8787/status 
Cluster Info
LocalCluster
9d67a1da
Dashboard: http://10.1.1.64:8787/status  Workers: 1 
Total threads: 4  Total memory: 3.73 GiB 
Status: running  Using processes: False 
Scheduler Info
Scheduler
Scheduler0774d2abc8914a84bc8b4727c390aa3c
Comm: inproc://10.1.1.64/8433/1  Workers: 1 
Dashboard: http://10.1.1.64:8787/status  Total threads: 4 
Started: Just now  Total memory: 3.73 GiB 
Workers
Worker: 0
Comm: inproc://10.1.1.64/8433/4  Total threads: 4 
Dashboard: http://10.1.1.64:46077/status  Memory: 3.73 GiB 
Nanny: None  
Local directory: /home/runner/work/daskexamples/daskexamples/applications/daskworkerspace/workerebsbpsit 
[9]:
y.max().compute()
[9]:
122
GPU Version¶
Numba also supports a JIT compilation with CUDA on compatible GPU devices.
This gives about a 200x speedup over a CPU on a single V100 GPU using numba.cuda.jit.
import numba.cuda
@numba.cuda.jit
def smooth_gpu(x, out):
i, j = cuda.grid(2)
n, m = x.shape
if 1 <= i < n  1 and 1 <= j < m  1:
out[i, j] = (x[i  1, j  1] + x[i  1, j] + x[i  1, j + 1] +
x[i , j  1] + x[i , j] + x[i , j + 1] +
x[i + 1, j  1] + x[i + 1, j] + x[i + 1, j + 1]) // 9
import cupy, math
x_gpu = cupy.ones((10000, 10000), dtype='int8')
out_gpu = cupy.zeros((10000, 10000), dtype='int8')
# I copied the four lines below from the Numba docs
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
Full notebookhere