Custom Workloads with Futures
Custom Workloads with Futures¶
Dask futures provide fine-grained real-time execution for custom situations. This is the foundation for other APIs like Dask arrays and dataframes.
Start Dask Client¶
Unlike for arrays and dataframes, you need the Dask client to use the Futures interface. Additionally the client provides a dashboard which is useful to gain insight on the computation.
The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.
from dask.distributed import Client, progress client = Client(threads_per_worker=4, n_workers=1) client
|Connection method: Cluster object||Cluster type: distributed.LocalCluster|
|Dashboard: http://127.0.0.1:8787/status||Workers: 1|
|Total threads: 4||Total memory: 6.78 GiB|
|Status: running||Using processes: True|
|Comm: tcp://127.0.0.1:38745||Workers: 1|
|Dashboard: http://127.0.0.1:8787/status||Total threads: 4|
|Started: Just now||Total memory: 6.78 GiB|
|Comm: tcp://127.0.0.1:37435||Total threads: 4|
|Dashboard: http://127.0.0.1:36253/status||Memory: 6.78 GiB|
|Local directory: /home/runner/work/dask-examples/dask-examples/dask-worker-space/worker-5e0um0el|
Create simple functions¶
These functions do simple operations like add two numbers together, but they sleep for a random amount of time to simulate real work.
import time import random def inc(x): time.sleep(random.random()) return x + 1 def double(x): time.sleep(random.random()) return 2 * x def add(x, y): time.sleep(random.random()) return x + y
We can run them locally
Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.
future = client.submit(inc, 1) # returns immediately with pending future future
If you wait a second, and then check on the future again, you’ll see that it has finished.
future # scheduler and client talk constantly
You can block on the computation and gather the result with the
You can submit tasks on other futures. This will create a dependency between the inputs and outputs. Dask will track the execution of all tasks, ensuring that downstream tasks are run at the proper time and place and with the proper data.
x = client.submit(inc, 1) y = client.submit(double, 2) z = client.submit(add, x, y) z
Note that we never blocked on
y nor did we ever have to move their data back to our notebook.
Submit many tasks¶
So we’ve learned how to run Python functions remotely. This becomes useful when we add two things:
We can submit thousands of tasks per second
Tasks can depend on each other by consuming futures as inputs
We submit many tasks that depend on each other in a normal Python for loop
zs = 
%%time for i in range(256): x = client.submit(inc, i) # x = inc(i) y = client.submit(double, x) # y = inc(x) z = client.submit(add, x, y) # z = inc(y) zs.append(z)
CPU times: user 2.58 s, sys: 92.2 ms, total: 2.68 s Wall time: 2.57 s
total = client.submit(sum, zs)
To make this go faster, add an additional workers with more cores
(although we’re still only working on our local machine, this is more practical when using an actual cluster)
client.cluster.scale(10) # ask for ten 4-thread workers
Custom computation: Tree summation¶
As an example of a non-trivial algorithm, consider the classic tree reduction. We accomplish this with a nested for loop and a bit of normal Python logic.
finish total single output ^ / \ | c1 c2 neighbors merge | / \ / \ | b1 b2 b3 b4 neighbors merge ^ / \ / \ / \ / \ start a1 a2 a3 a4 a5 a6 a7 a8 many inputs
L = zs while len(L) > 1: new_L =  for i in range(0, len(L), 2): future = client.submit(add, L[i], L[i + 1]) # add neighbors new_L.append(future) L = new_L # swap old list for new
If you’re watching the dashboard’s status page then you may want to note two things:
The red bars are for inter-worker communication. They happen as different workers need to combine their intermediate values
There is lots of parallelism at the beginning but less towards the end as we reach the top of the tree where there is less work to do.
Alternatively you may want to navigate to the dashboard’s graph page and then run the cell above again. You will be able to see the task graph evolve during the computation.
Building a computation dynamically¶
In the examples above we explicitly specify the task graph ahead of time. We know for example that the first two futures in the list
L will be added together.
Sometimes this isn’t always best though, sometimes you want to dynamically define a computation as it is happening. For example we might want to sum up these values based on whichever futures show up first, rather than the order in which they were placed in the list to start with.
For this, we can use operations like as_completed.
We recommend watching the dashboard’s graph page when running this computation. You should see the graph construct itself during execution.
del future, L, new_L, total # clear out some old work
from dask.distributed import as_completed zs = client.map(inc, zs) seq = as_completed(zs) while seq.count() > 1: # at least two futures left a = next(seq) b = next(seq) new = client.submit(add, a, b, priority=1) # add them together seq.add(new) # add new future back into loop