Live Notebook

You can run this notebook in a live session Binder or view it on Github.

DataFrames: Reading in messy data

In the 01-data-access example we show how Dask Dataframes can read and store data in many of the same formats as Pandas dataframes. One key difference, when using Dask Dataframes is that instead of opening a single file with a function like pandas.read_csv, we typically open many files at once with dask.dataframe.read_csv. This enables us to treat a collection of files as a single dataset. Most of the time this works really well. But real data is messy and in this notebook we will explore a more advanced technique to bring messy datasets into a dask dataframe.

Start Dask Client for Dashboard

Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

[1]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=True, memory_limit='2GB')
client
[1]:

Client

Client-26d54db6-d520-11ec-a232-000d3aeabb7a

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

Create artificial dataset

First we create an artificial dataset and write it to many CSV files.

You don’t need to understand this section, we’re just creating a dataset for the rest of the notebook.

[2]:
import dask
df = dask.datasets.timeseries()
df
[2]:
Dask DataFrame Structure:
id name x y
npartitions=30
2000-01-01 int64 object float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 30 tasks
[3]:
import os
import datetime

if not os.path.exists('data'):
    os.mkdir('data')

def name(i):
    """ Provide date for filename given index

    Examples
    --------
    >>> name(0)
    '2000-01-01'
    >>> name(10)
    '2000-01-11'
    """
    return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))

df.to_csv('data/*.csv', name_function=name, index=False);

Read CSV files

We now have many CSV files in our data directory, one for each day in the month of January 2000. Each CSV file holds timeseries data for that day. We can read all of them as one logical dataframe using the dd.read_csv function with a glob string.

[4]:
!ls data/*.csv | head
data/2000-01-01.csv
data/2000-01-02.csv
data/2000-01-03.csv
data/2000-01-04.csv
data/2000-01-05.csv
data/2000-01-06.csv
data/2000-01-07.csv
data/2000-01-08.csv
data/2000-01-09.csv
data/2000-01-10.csv
[5]:
import dask.dataframe as dd

df = dd.read_csv('data/2000-*-*.csv')
df
[5]:
Dask DataFrame Structure:
id name x y
npartitions=30
int64 object float64 float64
... ... ... ...
... ... ... ... ...
... ... ... ...
... ... ... ...
Dask Name: read-csv, 30 tasks
[6]:
df.head()
[6]:
id name x y
0 1035 Norbert -0.757208 0.410627
1 1088 Hannah 0.655573 -0.618467
2 979 Dan 0.112189 0.363218
3 983 Edith 0.547462 -0.595247
4 1039 Wendy 0.710590 0.695703

Let’s look at some statistics on the data

[7]:
df.describe().compute()
[7]:
id x y
count 2.592000e+06 2.592000e+06 2.592000e+06
mean 9.999959e+02 -2.440908e-04 2.330697e-06
std 3.162768e+01 5.772636e-01 5.773129e-01
min 8.550000e+02 -9.999995e-01 -9.999996e-01
25% 9.790000e+02 -4.962286e-01 -4.949900e-01
50% 1.000000e+03 5.109524e-03 6.483245e-03
75% 1.021000e+03 5.067333e-01 5.048855e-01
max 1.160000e+03 9.999983e-01 9.999985e-01

Make some messy data

Now this works great, and in most cases dd.read_csv or dd.read_parquet etc are the preferred way to read in large collections of data files into a dask dataframe, but real world data is often very messy and some files may be broken or badly formatted. To simulate this we are going to create some fake messy data by tweaking our example csv files. For the file data/2000-01-05.csv we will replace with no data and for the file data/2000-01-07.csv we will remove the y column

[8]:
# corrupt the data in data/2000-01-05.csv
with open('data/2000-01-05.csv', 'w') as f:
    f.write('')
[9]:
# remove y column from data/2000-01-07.csv
import pandas as pd
df = pd.read_csv('data/2000-01-07.csv')
del df['y']
df.to_csv('data/2000-01-07.csv', index=False)
[10]:
!head data/2000-01-05.csv
[11]:
!head data/2000-01-07.csv
id,name,x
1024,Sarah,0.1864121065059618
1086,Michael,-0.1573116298587789
1031,Kevin,0.7762392026848937
1058,Charlie,-0.1641263675170783
929,Jerry,0.6898603625500546
1002,Yvonne,0.3046855447254017
1043,Dan,-0.654128931819649
1024,Dan,-0.4308229218905972
1026,Patricia,-0.0112543302085683

Reading the messy data

Let’s try reading in the collection of files again

[12]:
df = dd.read_csv('data/2000-*-*.csv')
[13]:
df.head()
[13]:
id name x y
0 1035 Norbert -0.757208 0.410627
1 1088 Hannah 0.655573 -0.618467
2 979 Dan 0.112189 0.363218
3 983 Edith 0.547462 -0.595247
4 1039 Wendy 0.710590 0.695703

Ok this looks like it worked, let us calculate the dataset statistics again

[14]:
df.describe().compute()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 df.describe().compute()

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/base.py:292, in DaskMethodsMixin.compute(self, **kwargs)
    268 def compute(self, **kwargs):
    269     """Compute this dask collection
    270
    271     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    290     dask.base.compute
    291     """
--> 292     (result,) = compute(self, traverse=False, **kwargs)
    293     return result

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/base.py:575, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    572     keys.append(x.__dask_keys__())
    573     postcomputes.append(x.__dask_postcompute__())
--> 575 results = schedule(dsk, keys, **kwargs)
    576 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/client.py:3018, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   3016         should_rejoin = False
   3017 try:
-> 3018     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3019 finally:
   3020     for f in futures.values():

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/client.py:2171, in Client.gather(self, futures, errors, direct, asynchronous)
   2169 else:
   2170     local_worker = None
-> 2171 return self.sync(
   2172     self._gather,
   2173     futures,
   2174     errors=errors,
   2175     direct=direct,
   2176     local_worker=local_worker,
   2177     asynchronous=asynchronous,
   2178 )

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/utils.py:309, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    307     return future
    308 else:
--> 309     return sync(
    310         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    311     )

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/utils.py:376, in sync(loop, func, callback_timeout, *args, **kwargs)
    374 if error:
    375     typ, exc, tb = error
--> 376     raise exc.with_traceback(tb)
    377 else:
    378     return result

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/utils.py:349, in sync.<locals>.f()
    347         future = asyncio.wait_for(future, callback_timeout)
    348     future = asyncio.ensure_future(future)
--> 349     result = yield future
    350 except Exception:
    351     error = sys.exc_info()

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/distributed/client.py:2034, in Client._gather(self, futures, errors, direct, local_worker)
   2032         exc = CancelledError(key)
   2033     else:
-> 2034         raise exception.with_traceback(traceback)
   2035     raise exc
   2036 if errors == "skip":

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/optimization.py:990, in __call__()
    988 if not len(args) == len(self.inkeys):
    989     raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990 return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/core.py:149, in get()
    147 for key in toposort(dsk):
    148     task = dsk[key]
--> 149     result = _execute_task(task, cache)
    150     cache[key] = result
    151 result = _execute_task(out, cache)

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/core.py:119, in _execute_task()
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/utils.py:39, in apply()
     37 def apply(func, args, kwargs=None):
     38     if kwargs:
---> 39         return func(*args, **kwargs)
     40     else:
     41         return func(*args)

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/dataframe/core.py:6355, in apply_and_enforce()
   6353     return meta
   6354 if is_dataframe_like(df):
-> 6355     check_matching_columns(meta, df)
   6356     c = meta.columns
   6357 else:

File /usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/dataframe/utils.py:415, in check_matching_columns()
    413 else:
    414     extra_info = "Order of columns does not match"
--> 415 raise ValueError(
    416     "The columns in the computed data do not match"
    417     " the columns in the provided metadata\n"
    418     f"{extra_info}"
    419 )

ValueError: The columns in the computed data do not match the columns in the provided metadata
  Extra:   []
  Missing: ['y']

So what happened?

When creating a dask dataframe from a collection of files, dd.read_csv samples the first few files in the dataset to determine the datatypes and columns available. Since it has not opened all the files it does not now if some of them are corrupt. Hence, df.head() works since it is only looking at the first file. df.describe.compute() fails because of the corrupt data in data/2000-01-05.csv

Building a delayed reader

To get around this problem we are going to use a more advanced technique to build our dask dataframe. This method can also be used any time some custom logic is required when reading each file. Essentially, we are going to build a function that uses pandas and some error checking and returns a pandas dataframe. If we find a bad data file we will either find a way to fix/clean the data or we will return and empty pandas dataframe with the same structure as the good data.

[15]:
import numpy as np
import io

def read_data(filename):

    # for this to work we need to explicitly set the datatypes of our pandas dataframe
    dtypes = {'id': int, 'name': str, 'x': float, 'y': float}
    try:
        # try reading in the data with pandas
        df = pd.read_csv(filename, dtype=dtypes)
    except:
        # if this fails create an empty pandas dataframe with the same dtypes as the good data
        df = pd.read_csv(io.StringIO(''), names=dtypes.keys(), dtype=dtypes)

    # for the case with the missing column, add a column of data with NaN's
    if 'y' not in df.columns:
        df['y'] = np.NaN

    return df

Let’s test this function on a good file and the two bad files

[16]:
# test function on a normal file
read_data('data/2000-01-01.csv').head()
[16]:
id name x y
0 1035 Norbert -0.757208 0.410627
1 1088 Hannah 0.655573 -0.618467
2 979 Dan 0.112189 0.363218
3 983 Edith 0.547462 -0.595247
4 1039 Wendy 0.710590 0.695703
[17]:
# test function on the empty file
read_data('data/2000-01-05.csv').head()
[17]:
id name x y
[18]:
# test function on the file missing the y column
read_data('data/2000-01-07.csv').head()
[18]:
id name x y
0 1024 Sarah 0.186412 NaN
1 1086 Michael -0.157312 NaN
2 1031 Kevin 0.776239 NaN
3 1058 Charlie -0.164126 NaN
4 929 Jerry 0.689860 NaN

Assembling the dask dataframe

First we take our read_data function and convert it to a dask delayed function

[19]:
from dask import delayed
read_data = delayed(read_data)

Let us look at what the function does now

[20]:
df = read_data('data/2000-01-01.csv')
df
[20]:
Delayed('read_data-5f0eccbc-3bde-4275-8ba0-2649154b0bb7')

It creates a delayed object, to actually run read the file we need to run .compute()

[21]:
df.compute()
[21]:
id name x y
0 1035 Norbert -0.757208 0.410627
1 1088 Hannah 0.655573 -0.618467
2 979 Dan 0.112189 0.363218
3 983 Edith 0.547462 -0.595247
4 1039 Wendy 0.710590 0.695703
... ... ... ... ...
86395 1021 Kevin 0.125742 -0.711153
86396 965 Zelda -0.624159 0.855033
86397 1042 Frank 0.738092 0.464486
86398 1067 Oliver -0.325376 0.768031
86399 1003 Charlie -0.755295 -0.279816

86400 rows × 4 columns

Now let’s build a list of all the available csv files

[22]:
# loop over all the files
from glob import glob
files = glob('data/2000-*-*.csv')
files
[22]:
['data/2000-01-26.csv',
 'data/2000-01-09.csv',
 'data/2000-01-01.csv',
 'data/2000-01-11.csv',
 'data/2000-01-02.csv',
 'data/2000-01-22.csv',
 'data/2000-01-08.csv',
 'data/2000-01-07.csv',
 'data/2000-01-03.csv',
 'data/2000-01-30.csv',
 'data/2000-01-29.csv',
 'data/2000-01-12.csv',
 'data/2000-01-19.csv',
 'data/2000-01-20.csv',
 'data/2000-01-23.csv',
 'data/2000-01-04.csv',
 'data/2000-01-13.csv',
 'data/2000-01-06.csv',
 'data/2000-01-21.csv',
 'data/2000-01-10.csv',
 'data/2000-01-17.csv',
 'data/2000-01-14.csv',
 'data/2000-01-05.csv',
 'data/2000-01-16.csv',
 'data/2000-01-28.csv',
 'data/2000-01-25.csv',
 'data/2000-01-27.csv',
 'data/2000-01-18.csv',
 'data/2000-01-15.csv',
 'data/2000-01-24.csv']

Now we run the delayed read_data function on each file in the list

[23]:
df = [read_data(file) for file in files]
df
[23]:
[Delayed('read_data-0f117b8f-3bf5-4a88-bb2e-7cf633df2d1f'),
 Delayed('read_data-046ac206-4de3-4777-8167-d0bc0b4979d0'),
 Delayed('read_data-2aa9072d-e97f-442f-8b9c-cdc2266e8ae3'),
 Delayed('read_data-ccf646ff-5709-4edc-aa4c-725c70181082'),
 Delayed('read_data-411c2f2c-e985-4b50-b52b-a41fcb8419f0'),
 Delayed('read_data-dbd0a7a9-fc3b-4848-8467-d7da73a4a585'),
 Delayed('read_data-f802b6f6-c99e-41d9-b011-7f7f76141341'),
 Delayed('read_data-fd3751d6-9add-4b27-a741-785b26eb90a9'),
 Delayed('read_data-0b3ed986-6857-49c2-abeb-223af0f8c1b1'),
 Delayed('read_data-3edf4bea-e554-47bd-9345-720c73bd1779'),
 Delayed('read_data-b2356f93-2073-40d6-90ae-3adc4643103f'),
 Delayed('read_data-25587a27-877a-4095-99ed-49c51cda9e8c'),
 Delayed('read_data-a493b21a-05d0-4fb1-bd10-fe9af7701445'),
 Delayed('read_data-03f39d68-650e-4bb2-8bcc-ab1d80351429'),
 Delayed('read_data-243e92f4-aabb-4598-ae28-dc85c4742202'),
 Delayed('read_data-572862e0-e437-4c83-bd01-9feaaabd3c50'),
 Delayed('read_data-f59e5f89-aa36-4da6-9c83-63749d7aa21c'),
 Delayed('read_data-29fb0fcc-80f4-4f06-8d34-fb8797494d99'),
 Delayed('read_data-2f38c117-6072-4b19-a46a-fbfc44c39e73'),
 Delayed('read_data-cbab1938-7f71-41fe-adb3-9674cf0edc2a'),
 Delayed('read_data-0abf24ff-3f78-44fd-ae07-214523714beb'),
 Delayed('read_data-923234e2-d058-4f37-9e5c-d61e93168585'),
 Delayed('read_data-e0bbcca9-9268-43b9-af80-64cf333119e4'),
 Delayed('read_data-c455609b-c62e-4953-a158-28f9e00ef5e2'),
 Delayed('read_data-35304cc6-6f22-4ab4-883b-c21ddb2b6f12'),
 Delayed('read_data-29e56dee-a38f-41e9-ad86-f4357b8015c5'),
 Delayed('read_data-c2341016-7ed3-41d5-bcbd-1ff397c9bf2d'),
 Delayed('read_data-54edd786-0f28-42f5-83f1-e09419608a35'),
 Delayed('read_data-67004d43-1987-48b9-b121-815dc4953d43'),
 Delayed('read_data-2e368572-51e3-488a-8b5d-54b852123414')]

Then we use dask.dataframe.from_delayed. This function creates a Dask DataFrame from a list of delayed objects as long as each delayed object returns a pandas dataframe. The structure of each individual dataframe returned must also be the same.

[24]:
df = dd.from_delayed(df, meta={'id': int, 'name': str, 'x': float, 'y': float})
df
[24]:
Dask DataFrame Structure:
id name x y
npartitions=30
int64 object float64 float64
... ... ... ...
... ... ... ... ...
... ... ... ...
... ... ... ...
Dask Name: from-delayed, 60 tasks

Note: we provided the dtypes in the meta keyword to explicitly tell Dask Dataframe what kind of dataframe to expect. If we did not do this Dask would infer this from the first delayed object which could be slow if it was a large csv file

Now let’s see if this works

[25]:
df.head()
[25]:
id name x y
0 968 Wendy 0.427086 -0.267555
1 1011 Zelda -0.944234 -0.925922
2 1009 Ray -0.499415 0.092161
3 998 Dan 0.243237 0.754739
4 982 Xavier 0.300957 0.548789
[26]:
df.describe().compute()
[26]:
id x y
count 2.505600e+06 2.505600e+06 2.419200e+06
mean 1.000001e+03 -2.724378e-04 -3.678182e-06
std 3.162445e+01 5.772779e-01 5.772706e-01
min 8.550000e+02 -9.999995e-01 -9.999996e-01
25% 9.790000e+02 -4.962286e-01 -4.949900e-01
50% 1.000000e+03 5.109524e-03 6.483245e-03
75% 1.021000e+03 5.067333e-01 5.048855e-01
max 1.160000e+03 9.999983e-01 9.999985e-01

Success!

To recap, in this example, we looked at an approach to create a Dask Dataframe from a collection of many data files. Typically you would use built-in functions like dd.read_csv or dd.read_parquet to do this. Sometimes, this is not possible because of messy/corrupted files in your dataset or some custom processing that might need to be done.

In these cases, you can build a Dask DataFrame with the following steps.

  1. Create a regular python function that reads the data, performs any transformations, error checking etc and always returns a Pandas dataframe with the same structure

  2. Convert this read function to a delayed object using the dask.delayed function

  3. Call each file in your dataset with the delayed data reader and assemble the output as a list of delayed objects

  4. Used dd.from_delayed to covert the list of delayed objects to a Dask Dataframe

This same technique can be used in other situations as well. Another example might be data files that require using a specialized reader, or several transformations before they can be converted to a pandas dataframe.