Handle Evolving Workflows

Live Notebook

You can run this notebook in a live session or view it on Github.

Handle Evolving Workflows¶

For some workflows we don’t know the extent of the computation at the outset. We need to do some computation in order to figure out the rest of the computation that we need to do. The computation grows and evolves as we do more work.

As an example, consider a situation where you need to read many files and then based on the contents of those files, fire off additional work. You would like to read the files in parallel, and then within each file expose more parallelism.

This example goes through three ways to handle this situation using Dask Futures

Using as_completed
Using async/await
Launching tasks from tasks

But first, lets run our code sequentially.

0: Sequential code¶

[1]:

filenames = ["file.{}.txt".format(i) for i in range(10)]

filenames[:3]

[1]:

['file.0.txt', 'file.1.txt', 'file.2.txt']

[2]:

import random, time


def parse_file(fn: str) -> list:
    """ Returns a list work items of unknown length """
    time.sleep(random.random())
    return [random.random() for _ in range(random.randint(1, 10))]

def process_item(x: float):
    """ Process each work item """
    time.sleep(random.random() / 4)
    return x + 1

[3]:

%%time

# This takes around 10-20s

results = []

for fn in filenames:
    L = parse_file(fn)
    for x in L:
        out = process_item(x)
        results.append(out)

CPU times: user 8.63 ms, sys: 1 ms, total: 9.63 ms
Wall time: 13.4 s

Start Dask Client¶

We’ll need a Dask client in order to manage dynamic workloads

[4]:

from dask.distributed import Client

client = Client(processes=False, n_workers=1, threads_per_worker=6)
client

[4]:

Client

Client-8cd18990-0de0-11ed-9f5a-000d3a8f7959

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://10.1.1.64:8787/status

Cluster Info

LocalCluster

136395a1

Dashboard: http://10.1.1.64:8787/status	Workers: 1
Total threads: 6	Total memory: 6.78 GiB
Status: running	Using processes: False

Scheduler Info

Scheduler

Scheduler-32e093ca-1ad3-4cda-a0bf-b68890befde7

Comm: inproc://10.1.1.64/8026/1	Workers: 1
Dashboard: http://10.1.1.64:8787/status	Total threads: 6
Started: Just now	Total memory: 6.78 GiB

Workers

Worker: 0

Comm: inproc://10.1.1.64/8026/4	Total threads: 6
Dashboard: http://10.1.1.64:39259/status	Memory: 6.78 GiB
Nanny: None
Local directory: /home/runner/work/dask-examples/dask-examples/applications/dask-worker-space/worker-v95r5hsj

1: Use as_completed¶

The as_completed iterator lets us handle futures as they complete. We can then submit more data on the fly.

We submit a task for each of our filenames
We also compute the length of each of the returned lists
As those lengths return, we submit off a new task to get each item of that list. We do this at higher priority, so that we process existing data before we collect new data.
We wait on all of the returned results

[5]:

%%time

from dask.distributed import as_completed
import operator

lists = client.map(parse_file, filenames, pure=False)
lengths = client.map(len, lists)

mapping = dict(zip(lengths, lists))

futures = []

for future in as_completed(lengths):
    n = future.result()
    L = mapping[future]
    for i in range(n):
        new = client.submit(operator.getitem, L, i, priority=1)
        new = client.submit(process_item, new, priority=1)
        futures.append(new)

client.gather(futures)

CPU times: user 764 ms, sys: 92.2 ms, total: 856 ms
Wall time: 2.6 s

[5]:

[1.855253279843388,
 1.7726329902181985,
 1.259836642564709,
 1.3256571039929872,
 1.5880998029929763,
 1.5138789437567746,
 1.5792286074896462,
 1.3147689951092234,
 1.3172076038757292,
 1.262171473735921,
 1.4714750086893678,
 1.201396011598346,
 1.0228797089919106,
 1.100749927496291,
 1.312524856134222,
 1.6853261123541268,
 1.7828341588461032,
 1.353256049336657,
 1.5560048785929426,
 1.909180985889757,
 1.3652870032952238,
 1.0200787324818752,
 1.6887029263541482,
 1.8419102255817514,
 1.7088922634312127,
 1.0272413357700962,
 1.9966520355684192,
 1.7515633492540368,
 1.1784571947389346,
 1.2626529965584385,
 1.776210436365392,
 1.1980741841717615,
 1.6535182981420282,
 1.652425843466223,
 1.8679564994356905,
 1.4829294673653362,
 1.5798547500541158,
 1.140733801463906,
 1.8522280296306772]

2: Use async/await to handle single file processing locally¶

We can also handle the concurrency here within our local process. This requires you to understand async/await syntax, but is generally powerful and arguably simpler than the as_completed approach above.

[6]:

import asyncio

async def f(fn):
    """ Handle the lifecycle of a single file """
    future = client.submit(parse_file, fn, pure=False)
    length_future = client.submit(len, future)
    length = await length_future

    futures = [client.submit(operator.getitem, future, i, priority=10)
               for i in range(length)]
    futures = client.map(process_item, futures, priority=10)
    return futures

async def run_all(filenames):
    list_of_list_of_futures = await asyncio.gather(*[f(fn) for fn in filenames])
    futures = sum(list_of_list_of_futures, [])
    return await client.gather(futures)

We now need to run this function in the same event loop as our client is running. If we had started our client asynchronously, then we could have done this:

client = await Client(asynchronous=True)

await run_all(filenames)

However, because we started our client without the asynchronous=True flag the event loop is actually running in a separate thread, so we’ll have to ask the client to run this for us.

[7]:

client.sync(run_all, filenames)

[7]:

[1.800289116905064,
 1.9871081844250762,
 1.4822374404826442,
 1.5398817452739677,
 1.7815421767702233,
 1.799086158704496,
 1.3684979870560656,
 1.7599086452147468,
 1.481710657277671,
 1.498441345100709,
 1.4189325403582265,
 1.4919506969852225,
 1.784110902042696,
 1.300637697987958,
 1.67384020477976,
 1.9299992134781654,
 1.5169156883991846,
 1.3547854189286748,
 1.0158519081255195,
 1.2905126990047529,
 1.5902047470140799,
 1.3370247692084685,
 1.8606232787906918,
 1.0593086235586204,
 1.2708399290512835,
 1.6014307072413518,
 1.719953843514872,
 1.811378106188461,
 1.196149275085125,
 1.1182782176944635,
 1.233133146375533,
 1.9828811027418132,
 1.0789456550316718,
 1.8491111490176726,
 1.895451013583132,
 1.3219157394238925,
 1.7080716916510053,
 1.3280857033684856,
 1.0402444828283657,
 1.6131248314712203,
 1.2100254915413697,
 1.092415131966796,
 1.93960521763009,
 1.793696097887303,
 1.6268308430745564,
 1.1389083858458733,
 1.6461270076543006,
 1.087262577270262,
 1.5348789887982262,
 1.7953259319984842]

3: Submit tasks from tasks¶

We can also submit tasks that themselves submit more tasks. See documentation here.

[8]:

%%time

from dask.distributed import get_client, secede, rejoin

def f(fn):
    L = parse_file(fn)
    client = get_client()

    futures = client.map(process_item, L, priority=10)
    secede()
    results = client.gather(futures)
    rejoin()
    return results

futures = client.map(f, filenames, pure=False)
results = client.gather(futures)

CPU times: user 357 ms, sys: 20.7 ms, total: 378 ms
Wall time: 2.14 s

Embarrassingly parallel Workloads

Image Processing

Dask Examples documentation