Live Notebook
You can run this notebook in a live session or view it on Github.
Dask Bags¶
Dask Bag implements operations like map
, filter
, groupby
and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators. It is similar to a parallel version of itertools or a Pythonic version of the PySpark RDD.
Dask Bags are often used to do simple preprocessing on log files, JSON records, or other user defined Python objects.
Full API documentation is available here: http://docs.dask.org/en/latest/bag-api.html
Start Dask Client for Dashboard¶
Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.
The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.
[1]:
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
/usr/share/miniconda3/envs/dask-examples/lib/python3.8/site-packages/distributed/node.py:151: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 41855 instead
warnings.warn(
[1]:
Client
|
Cluster
|
Create Random Data¶
We create a random set of record data and store it to disk as many JSON files. This will serve as our data for this notebook.
[2]:
import dask
import json
import os
os.makedirs('data', exist_ok=True) # Create data/ directory
b = dask.datasets.make_people() # Make records of people
b.map(json.dumps).to_textfiles('data/*.json') # Encode as JSON, write to disk
[2]:
['/home/runner/work/dask-examples/dask-examples/data/0.json',
'/home/runner/work/dask-examples/dask-examples/data/1.json',
'/home/runner/work/dask-examples/dask-examples/data/2.json',
'/home/runner/work/dask-examples/dask-examples/data/3.json',
'/home/runner/work/dask-examples/dask-examples/data/4.json',
'/home/runner/work/dask-examples/dask-examples/data/5.json',
'/home/runner/work/dask-examples/dask-examples/data/6.json',
'/home/runner/work/dask-examples/dask-examples/data/7.json',
'/home/runner/work/dask-examples/dask-examples/data/8.json',
'/home/runner/work/dask-examples/dask-examples/data/9.json']
Read JSON data¶
Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module.
[3]:
!head -n 2 data/0.json
{"age": 17, "name": ["Sang", "Harrington"], "occupation": "Acoustic Engineer", "telephone": "555.305.0242", "address": {"address": "60 Ocean Trail", "city": "La Mesa"}, "credit-card": {"number": "5564 8609 9238 8995", "expiration-date": "06/21"}}
{"age": 61, "name": ["Rolando", "Calhoun"], "occupation": "Aerial Erector", "telephone": "1-073-052-3379", "address": {"address": "6 Zoe Garden", "city": "Waukegan"}, "credit-card": {"number": "5406 0743 6038 6720", "expiration-date": "01/16"}}
[4]:
import dask.bag as db
import json
b = db.read_text('data/*.json').map(json.loads)
b
[4]:
dask.bag<loads, npartitions=10>
[5]:
b.take(2)
[5]:
({'age': 17,
'name': ['Sang', 'Harrington'],
'occupation': 'Acoustic Engineer',
'telephone': '555.305.0242',
'address': {'address': '60 Ocean Trail', 'city': 'La Mesa'},
'credit-card': {'number': '5564 8609 9238 8995',
'expiration-date': '06/21'}},
{'age': 61,
'name': ['Rolando', 'Calhoun'],
'occupation': 'Aerial Erector',
'telephone': '1-073-052-3379',
'address': {'address': '6 Zoe Garden', 'city': 'Waukegan'},
'credit-card': {'number': '5406 0743 6038 6720',
'expiration-date': '01/16'}})
Map, Filter, Aggregate¶
We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value.
[6]:
b.filter(lambda record: record['age'] > 30).take(2) # Select only people over 30
[6]:
({'age': 61,
'name': ['Rolando', 'Calhoun'],
'occupation': 'Aerial Erector',
'telephone': '1-073-052-3379',
'address': {'address': '6 Zoe Garden', 'city': 'Waukegan'},
'credit-card': {'number': '5406 0743 6038 6720',
'expiration-date': '01/16'}},
{'age': 51,
'name': ['Curt', 'Garza'],
'occupation': 'Ceiling Fixer',
'telephone': '(994) 213-9475',
'address': {'address': '1190 Mark Twain Plantation', 'city': 'Bettendorf'},
'credit-card': {'number': '3794 811438 69084', 'expiration-date': '05/17'}})
[7]:
b.map(lambda record: record['occupation']).take(2) # Select the occupation field
[7]:
('Acoustic Engineer', 'Aerial Erector')
[8]:
b.count().compute() # Count total number of records
[8]:
10000
Chain computations¶
It is common to do many of these steps in one pipeline, only calling compute
or take
at the end.
[9]:
result = (b.filter(lambda record: record['age'] > 30)
.map(lambda record: record['occupation'])
.frequencies(sort=True)
.topk(10, key=1))
result
[9]:
dask.bag<topk-aggregate, npartitions=1>
As with all lazy Dask collections, we need to call compute
to actually evaluate our result. The take
method used in earlier examples is also like compute
and will also trigger computation.
[10]:
result.compute()
[10]:
[('Miner', 17),
('Gynaecologist', 15),
('Store Detective', 14),
('Barmaid', 14),
('Health Visitor', 14),
('Trinity House Pilot', 14),
('Tug Skipper', 13),
('Machine Minder', 13),
('Journalist', 12),
('Photographer', 12)]
Transform and Store¶
Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses. For that we can use methods like to_textfiles
and json.dumps
, or we can convert to Dask Dataframes and use their storage systems, which we’ll see more of in the next section.
[11]:
(b.filter(lambda record: record['age'] > 30) # Select records of interest
.map(json.dumps) # Convert Python objects to text
.to_textfiles('data/processed.*.json')) # Write to local disk
[11]:
['/home/runner/work/dask-examples/dask-examples/data/processed.0.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.1.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.2.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.3.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.4.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.5.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.6.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.7.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.8.json',
'/home/runner/work/dask-examples/dask-examples/data/processed.9.json']
Convert to Dask Dataframes¶
Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes. Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.
However, Dask Dataframes also expect data that is organized as flat columns. It does not support nested JSON data very well (Bag is better for this).
Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe.
[12]:
b.take(1)
[12]:
({'age': 17,
'name': ['Sang', 'Harrington'],
'occupation': 'Acoustic Engineer',
'telephone': '555.305.0242',
'address': {'address': '60 Ocean Trail', 'city': 'La Mesa'},
'credit-card': {'number': '5564 8609 9238 8995',
'expiration-date': '06/21'}},)
[13]:
def flatten(record):
return {
'age': record['age'],
'occupation': record['occupation'],
'telephone': record['telephone'],
'credit-card-number': record['credit-card']['number'],
'credit-card-expiration': record['credit-card']['expiration-date'],
'name': ' '.join(record['name']),
'street-address': record['address']['address'],
'city': record['address']['city']
}
b.map(flatten).take(1)
[13]:
({'age': 17,
'occupation': 'Acoustic Engineer',
'telephone': '555.305.0242',
'credit-card-number': '5564 8609 9238 8995',
'credit-card-expiration': '06/21',
'name': 'Sang Harrington',
'street-address': '60 Ocean Trail',
'city': 'La Mesa'},)
[14]:
df = b.map(flatten).to_dataframe()
df.head()
[14]:
age | occupation | telephone | credit-card-number | credit-card-expiration | name | street-address | city | |
---|---|---|---|---|---|---|---|---|
0 | 17 | Acoustic Engineer | 555.305.0242 | 5564 8609 9238 8995 | 06/21 | Sang Harrington | 60 Ocean Trail | La Mesa |
1 | 61 | Aerial Erector | 1-073-052-3379 | 5406 0743 6038 6720 | 01/16 | Rolando Calhoun | 6 Zoe Garden | Waukegan |
2 | 16 | Medical Practitioner | (767) 023-7986 | 5231 1287 4952 5357 | 01/20 | Hugh Mcclure | 417 Ward Viaduct | Plymouth |
3 | 51 | Ceiling Fixer | (994) 213-9475 | 3794 811438 69084 | 05/17 | Curt Garza | 1190 Mark Twain Plantation | Bettendorf |
4 | 49 | Aircraft Designer | +1-(225)-721-7908 | 4095 2197 6352 6674 | 06/22 | Ying Robles | 485 Shaw Alley | Montgomery |
We can now perform the same computation as before, but now using Pandas and Dask dataframe.
[15]:
df[df.age > 30].occupation.value_counts().nlargest(10).compute()
[15]:
Miner 17
Gynaecologist 15
Health Visitor 14
Store Detective 14
Trinity House Pilot 14
Barmaid 14
Machine Minder 13
Tug Skipper 13
Steel Worker 12
Aircraft Maintenance Engineer 12
Name: occupation, dtype: int64
Learn More¶
You may be interested in the following links:
dask tutorial, notebook 02, for a more in-depth introduction.