Score and Predict Large Datasets

Live Notebook

You can run this notebook in a live session or view it on Github.

Score and Predict Large Datasets¶

Sometimes you’ll train on a smaller dataset that fits in memory, but need to predict or score for a much larger (possibly larger than memory) dataset. Perhaps your learning curve has leveled off, or you only have labels for a subset of the data.

In this situation, you can use ParallelPostFit to parallelize and distribute the scoring or prediction steps.

[1]:

from dask.distributed import Client, progress

# Scale up: connect to your own cluster with bmore resources
# see http://dask.pydata.org/en/latest/setup.html
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

[1]:

Client

Client-7847b57a-0de1-11ed-a43b-000d3a8f7959

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://10.1.1.64:8787/status

Cluster Info

LocalCluster

0bbc581a

Dashboard: http://10.1.1.64:8787/status	Workers: 1
Total threads: 4	Total memory: 1.86 GiB
Status: running	Using processes: False

Scheduler Info

Scheduler

Scheduler-02b344ef-0a9b-4ca2-b806-8205e284cc50

Comm: inproc://10.1.1.64/9275/1	Workers: 1
Dashboard: http://10.1.1.64:8787/status	Total threads: 4
Started: Just now	Total memory: 1.86 GiB

Workers

Worker: 0

Comm: inproc://10.1.1.64/9275/4	Total threads: 4
Dashboard: http://10.1.1.64:39787/status	Memory: 1.86 GiB
Nanny: None
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-ocmx77q9

[2]:

import numpy as np
import dask.array as da
from sklearn.datasets import make_classification

We’ll generate a small random dataset with scikit-learn.

[3]:

X_train, y_train = make_classification(
    n_features=2, n_redundant=0, n_informative=2,
    random_state=1, n_clusters_per_class=1, n_samples=1000)
X_train[:5]

[3]:

array([[ 1.53682958, -1.39869399],
       [ 1.36917601, -0.63734411],
       [ 0.50231787, -0.45910529],
       [ 1.83319262, -1.29808229],
       [ 1.04235568,  1.12152929]])

And we’ll clone that dataset many times with dask.array. X_large and y_large represent our larger than memory dataset.

[4]:

# Scale up: increase N, the number of times we replicate the data.
N = 100
X_large = da.concatenate([da.from_array(X_train, chunks=X_train.shape)
                          for _ in range(N)])
y_large = da.concatenate([da.from_array(y_train, chunks=y_train.shape)
                          for _ in range(N)])
X_large

[4]:

	Array	Chunk
Bytes	1.53 MiB	15.62 kiB
Shape	(100000, 2)	(1000, 2)
Count	101 Tasks	100 Chunks
Type	float64	numpy.ndarray

Since our training dataset fits in memory, we can use a scikit-learn estimator as the actual estimator fit during traning. But we know that we’ll want to predict for a large dataset, so we’ll wrap the scikit-learn estimator with ParallelPostFit.

[5]:

from sklearn.linear_model import LogisticRegressionCV
from dask_ml.wrappers import ParallelPostFit

[6]:

clf = ParallelPostFit(LogisticRegressionCV(cv=3), scoring="r2")

See the note in the dask-ml’s documentation about when and why a scoring parameter is needed: https://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html#dask_ml.wrappers.ParallelPostFit.

Now we’ll call clf.fit. Dask-ML does nothing here, so this step can only use datasets that fit in memory.

[7]:

clf.fit(X_train, y_train)

[7]:

ParallelPostFit(estimator=LogisticRegressionCV(cv=3), scoring='r2')

Now that training is done, we’ll turn to predicting for the full (larger than memory) dataset.

[8]:

y_pred = clf.predict(X_large)
y_pred

[8]:

	Array	Chunk
Bytes	781.25 kiB	7.81 kiB
Shape	(100000,)	(1000,)
Count	201 Tasks	100 Chunks
Type	int64	numpy.ndarray

y_pred is Dask arary. Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine.

Or we can check the models score on the entire large dataset. The computation will be done in parallel, and no single machine will have to hold all the data.

[9]:

clf.score(X_large, y_large)

[9]:

0.596

Scale Scikit-Learn for Small Data Problems

Batch Prediction with PyTorch

Dask Examples documentation