Live Notebook

You can run this notebook in a live session Binder or view it on Github.

Dask Bags

Dask Bag implements operations like map, filter, groupby and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators. It is similar to a parallel version of itertools or a Pythonic version of the PySpark RDD.

Dask Bags are often used to do simple preprocessing on log files, JSON records, or other user defined Python objects.

Full API documentation is available here: http://docs.dask.org/en/latest/bag-api.html

Start Dask Client for Dashboard

Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

[1]:
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client
[1]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 7.84 GB

Create Random Data

We create a random set of record data and store it to disk as many JSON files. This will serve as our data for this notebook.

[2]:
import dask
import json
import os

os.makedirs('data', exist_ok=True)              # Create data/ directory

b = dask.datasets.make_people()                 # Make records of people
b.map(json.dumps).to_textfiles('data/*.json')   # Encode as JSON, write to disk
[2]:
['/home/travis/build/dask/dask-examples/data/0.json',
 '/home/travis/build/dask/dask-examples/data/1.json',
 '/home/travis/build/dask/dask-examples/data/2.json',
 '/home/travis/build/dask/dask-examples/data/3.json',
 '/home/travis/build/dask/dask-examples/data/4.json',
 '/home/travis/build/dask/dask-examples/data/5.json',
 '/home/travis/build/dask/dask-examples/data/6.json',
 '/home/travis/build/dask/dask-examples/data/7.json',
 '/home/travis/build/dask/dask-examples/data/8.json',
 '/home/travis/build/dask/dask-examples/data/9.json']

Read JSON data

Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module.

[3]:
!head -n 2 data/0.json
{"age": 52, "name": ["Nathan", "Mcmahon"], "occupation": "Cafe Worker", "telephone": "(225) 243-0732", "address": {"address": "1178 Patten Sideline", "city": "Peekskill"}, "credit-card": {"number": "3769 115384 38304", "expiration-date": "08/20"}}
{"age": 45, "name": ["Kelley", "Montoya"], "occupation": "Pathologist", "telephone": "1-049-469-0593", "address": {"address": "160 Congdon Park", "city": "East Point"}, "credit-card": {"number": "4410 0554 0759 5958", "expiration-date": "04/22"}}
[4]:
import dask.bag as db
import json

b = db.read_text('data/*.json').map(json.loads)
b
[4]:
dask.bag<loads-d..., npartitions=10>
[5]:
b.take(2)
[5]:
({'age': 52,
  'name': ['Nathan', 'Mcmahon'],
  'occupation': 'Cafe Worker',
  'telephone': '(225) 243-0732',
  'address': {'address': '1178 Patten Sideline', 'city': 'Peekskill'},
  'credit-card': {'number': '3769 115384 38304', 'expiration-date': '08/20'}},
 {'age': 45,
  'name': ['Kelley', 'Montoya'],
  'occupation': 'Pathologist',
  'telephone': '1-049-469-0593',
  'address': {'address': '160 Congdon Park', 'city': 'East Point'},
  'credit-card': {'number': '4410 0554 0759 5958',
   'expiration-date': '04/22'}})

Map, Filter, Aggregate

We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value.

[6]:
b.filter(lambda record: record['age'] > 30).take(2)  # Select only people over 30
[6]:
({'age': 52,
  'name': ['Nathan', 'Mcmahon'],
  'occupation': 'Cafe Worker',
  'telephone': '(225) 243-0732',
  'address': {'address': '1178 Patten Sideline', 'city': 'Peekskill'},
  'credit-card': {'number': '3769 115384 38304', 'expiration-date': '08/20'}},
 {'age': 45,
  'name': ['Kelley', 'Montoya'],
  'occupation': 'Pathologist',
  'telephone': '1-049-469-0593',
  'address': {'address': '160 Congdon Park', 'city': 'East Point'},
  'credit-card': {'number': '4410 0554 0759 5958',
   'expiration-date': '04/22'}})
[7]:
b.map(lambda record: record['occupation']).take(2)  # Select the occupation field
[7]:
('Cafe Worker', 'Pathologist')
[8]:
b.count().compute()  # Count total number of records
[8]:
10000

Chain computations

It is common to do many of these steps in one pipeline, only calling compute or take at the end.

[9]:
result = (b.filter(lambda record: record['age'] > 30)
           .map(lambda record: record['occupation'])
           .frequencies(sort=True)
           .topk(10, key=1))
result
[9]:
dask.bag<topk-ag..., npartitions=1>

As with all lazy Dask collections, we need to call compute to actually evaluate our result. The take method used in earlier examples is also like compute and will also trigger computation.

[10]:
result.compute()
[10]:
[('Stock Controller', 14),
 ('Cafe Staff', 14),
 ('Stockbroker', 14),
 ('Industrial Consultant', 14),
 ('Tax Consultant', 14),
 ('Book-Keeper', 13),
 ('Circus Worker', 13),
 ('Astronomer', 13),
 ('Glass Worker', 13),
 ('School Inspector', 13)]

Transform and Store

Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses. For that we can use methods like to_textfiles and json.dumps, or we can convert to Dask Dataframes and use their storage systems, which we’ll see more of in the next section.

[11]:
(b.filter(lambda record: record['age'] > 30)  # Select records of interest
  .map(json.dumps)                            # Convert Python objects to text
  .to_textfiles('data/processed.*.json'))     # Write to local disk
[11]:
['/home/travis/build/dask/dask-examples/data/processed.0.json',
 '/home/travis/build/dask/dask-examples/data/processed.1.json',
 '/home/travis/build/dask/dask-examples/data/processed.2.json',
 '/home/travis/build/dask/dask-examples/data/processed.3.json',
 '/home/travis/build/dask/dask-examples/data/processed.4.json',
 '/home/travis/build/dask/dask-examples/data/processed.5.json',
 '/home/travis/build/dask/dask-examples/data/processed.6.json',
 '/home/travis/build/dask/dask-examples/data/processed.7.json',
 '/home/travis/build/dask/dask-examples/data/processed.8.json',
 '/home/travis/build/dask/dask-examples/data/processed.9.json']

Convert to Dask Dataframes

Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes. Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.

However, Dask Dataframes also expect data that is organized as flat columns. It does not support nested JSON data very well (Bag is better for this).

Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe.

[12]:
b.take(1)
[12]:
({'age': 52,
  'name': ['Nathan', 'Mcmahon'],
  'occupation': 'Cafe Worker',
  'telephone': '(225) 243-0732',
  'address': {'address': '1178 Patten Sideline', 'city': 'Peekskill'},
  'credit-card': {'number': '3769 115384 38304', 'expiration-date': '08/20'}},)
[13]:
def flatten(record):
    return {
        'age': record['age'],
        'occupation': record['occupation'],
        'telephone': record['telephone'],
        'credit-card-number': record['credit-card']['number'],
        'credit-card-expiration': record['credit-card']['expiration-date'],
        'name': ' '.join(record['name']),
        'street-address': record['address']['address'],
        'city': record['address']['city']
    }

b.map(flatten).take(1)
[13]:
({'age': 52,
  'occupation': 'Cafe Worker',
  'telephone': '(225) 243-0732',
  'credit-card-number': '3769 115384 38304',
  'credit-card-expiration': '08/20',
  'name': 'Nathan Mcmahon',
  'street-address': '1178 Patten Sideline',
  'city': 'Peekskill'},)
[14]:
df = b.map(flatten).to_dataframe()
df.head()
[14]:
age occupation telephone credit-card-number credit-card-expiration name street-address city
0 52 Cafe Worker (225) 243-0732 3769 115384 38304 08/20 Nathan Mcmahon 1178 Patten Sideline Peekskill
1 45 Pathologist 1-049-469-0593 4410 0554 0759 5958 04/22 Kelley Montoya 160 Congdon Park East Point
2 40 HGV Mechanic 1-238-822-9770 2642 7871 3815 5686 12/18 Theresia Knapp 1143 John Maher Heights Muskego
3 21 Forwarding Agent 1-749-291-8556 3720 245435 40371 05/19 Garth Kelly 1368 Halibut Brae Mason
4 43 Librarian (474) 103-0185 2322 7372 5637 7304 03/23 Wade Fowler 611 Mcnair Road Atlanta

We can now perform the same computation as before, but now using Pandas and Dask dataframe.

[15]:
df[df.age > 30].occupation.value_counts().nlargest(10).compute()
[15]:
Cafe Staff               14
Tax Consultant           14
Industrial Consultant    14
Stock Controller         14
Stockbroker              14
Glass Worker             13
Pool Attendant           13
Astronomer               13
Circus Worker            13
Stone Cutter             13
Name: occupation, dtype: int64

Learn More

You may be interested in the following links: