Incrementally Train Large Datasets

Live Notebook

You can run this notebook in a live session or view it on Github.

Incrementally Train Large Datasets¶

We can train models on large datasets one batch at a time. Many Scikit-Learn estimators implement a partial_fit method to enable incremental learning in batches.

est = SGDClassifier(...)
est.partial_fit(X_train_1, y_train_1)
est.partial_fit(X_train_2, y_train_2)
...

The Scikit-Learn documentation discusses this approach in more depth in their user guide.

This notebook demonstrates the use of Dask-ML’s Incremental meta-estimator, which automates the use of Scikit-Learn’s partial_fit over Dask arrays and dataframes. Scikit-Learn handles all of the computation while Dask handles the data management, loading and moving batches of data as necessary. This allows scaling to large datasets distributed across many machines, or to datasets that do not fit in memory, all with a familiar workflow.

This example shows …

wrapping a Scikit-Learn estimator that implements partial_fit with the Dask-ML Incremental meta-estimator
training, predicting, and scoring on this wrapped estimator

Although this example uses Scikit-Learn’s SGDClassifer, the Incremental meta-estimator will work for any class that implements partial_fit and the scikit-learn base estimator API.

e3baadfd3b404eabbfbd75b346642908

Setup Dask¶

We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.

You can view the dashboard by clicking on the dashboard link after you run the cell

[1]:

from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
client

[1]:

Client

Client-6fe04c8e-0de1-11ed-a3a3-000d3a8f7959

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

85c11ebe

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 6.78 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-55b9e8e1-2630-4f2c-b9a8-94f5f1e77a1c

Comm: tcp://127.0.0.1:38373	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 6.78 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:43257	Total threads: 1
Dashboard: http://127.0.0.1:43969/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:43431
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-pf961cd2

Worker: 1

Comm: tcp://127.0.0.1:35171	Total threads: 1
Dashboard: http://127.0.0.1:43051/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:33129
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-g873798m

Worker: 2

Comm: tcp://127.0.0.1:40821	Total threads: 1
Dashboard: http://127.0.0.1:44433/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:44935
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-_7hesy4q

Worker: 3

Comm: tcp://127.0.0.1:37923	Total threads: 1
Dashboard: http://127.0.0.1:43331/status	Memory: 1.70 GiB
Nanny: tcp://127.0.0.1:33679
Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-pxcnehmb

Create Data¶

We create a synthetic dataset that is large enough to be interesting, but small enough to run quickly.

Our dataset has 1,000,000 examples and 100 features.

[2]:

import dask
import dask.array as da
from dask_ml.datasets import make_classification


n, d = 100000, 100

X, y = make_classification(n_samples=n, n_features=d,
                           chunks=n // 10, flip_y=0.2)
X

/usr/share/miniconda3/envs/dask-examples/lib/python3.9/site-packages/dask/base.py:1283: UserWarning: Running on a single-machine scheduler when a distributed client is active might lead to unexpected results.
  warnings.warn(

[2]:

	Array	Chunk
Bytes	76.29 MiB	7.63 MiB
Shape	(100000, 100)	(10000, 100)
Count	10 Tasks	10 Chunks
Type	float64	numpy.ndarray

For more information on creating dask arrays and dataframes from real data, see documentation on Dask arrays and Dask dataframes.

Split data for training and testing¶

We split our dataset into training and testing data to aid evaluation by making sure we have a fair test:

[3]:

from dask_ml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train

[3]:

	Array	Chunk
Bytes	68.66 MiB	6.87 MiB
Shape	(90000, 100)	(9000, 100)
Count	70 Tasks	10 Chunks
Type	float64	numpy.ndarray

Persist data in memory¶

This dataset is small enough to fit in distributed memory, so we call dask.persist to ask Dask to execute the computations above and keep the results in memory.

[4]:

X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)

If you are working in a situation where your dataset does not fit in memory then you should skip this step. Everything will still work, but will be slower and use less memory.

Calling dask.persist will preserve our data in memory, so no computation will be needed as we pass over our data many times. For example if our data came from CSV files and was not persisted, then the CSV files would have to be re-read on each pass. This is desirable if the data does not fit in RAM, but not slows down our computation otherwise.

Precompute classes¶

We pre-compute the classes from our training data, which is required for this classification example:

[5]:

classes = da.unique(y_train).compute()
classes

[5]:

array([0, 1])

Create Scikit-Learn model¶

We make the underlying Scikit-Learn estimator, an SGDClassifier:

[6]:

from sklearn.linear_model import SGDClassifier

est = SGDClassifier(loss='log', penalty='l2', tol=1e-3)

Here we use SGDClassifier, but any estimator that implements the partial_fit method will work. A list of Scikit-Learn models that implement this API is available here.

Wrap with Dask-ML’s Incremental meta-estimator¶

We now wrap our SGDClassifer with the `dask_ml.wrappers.Incremental <http://ml.dask.org/modules/generated/dask_ml.wrappers.Incremental.html#dask_ml.wrappers.Incremental>`__ meta-estimator.

[7]:

from dask_ml.wrappers import Incremental

inc = Incremental(est, scoring='accuracy')

Recall that Incremental only does data management while leaving the actual algorithm to the underlying Scikit-Learn estimator.

Note: We set the scoring parameter above in the Dask estimator to tell it to handle scoring. This works better when using Dask arrays for test data.

Model training¶

Incremental implements a fit method, which will perform one loop over the dataset, calling partial_fit over each chunk in the Dask array.

You may want to watch the dashboard during this fit process to see the sequential fitting of many batches.

[8]:

inc.fit(X_train, y_train, classes=classes)

[8]:

Incremental(estimator=SGDClassifier(loss='log'), scoring='accuracy')

[9]:

inc.score(X_test, y_test)

[9]:

0.5942

Pass over the training data many times¶

Calling .fit passes over all chunks our data once. However, in many cases we may want to pass over the training data many times. To do this we can use the Incremental.partial_fit method and a for loop.

[10]:

est = SGDClassifier(loss='log', penalty='l2', tol=0e-3)
inc = Incremental(est, scoring='accuracy')

[11]:

for _ in range(10):
    inc.partial_fit(X_train, y_train, classes=classes)
    print('Score:', inc.score(X_test, y_test))

Score: 0.6102
Score: 0.5896
Score: 0.5897
Score: 0.6159
Score: 0.6154
Score: 0.62
Score: 0.6254
Score: 0.6394
Score: 0.6345
Score: 0.637

Predict and Score¶

Finally we can also call Incremental.predict and Incremental.score on our testing data

[12]:

inc.predict(X_test)  # Predict produces lazy dask arrays

[12]:

	Array	Chunk
Bytes	78.12 kiB	7.81 kiB
Shape	(10000,)	(1000,)
Count	20 Tasks	10 Chunks
Type	int64	numpy.ndarray

[13]:

inc.predict(X_test)[:100].compute()  # call compute to get results

[13]:

array([1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0])

[14]:

inc.score(X_test, y_test)

[14]:

0.637

Learn more¶

In this notebook we went over using Dask-ML’s Incremental meta-estimator to automate the process of incremental training with Scikit-Learn estimators that implement the partial_fit method. If you want to learn more about this process you might want to investigate the following documentation:

https://scikit-learn.org/stable/computing/scaling_strategies.html
Dask-ML Incremental API documentation
List of Scikit-Learn estimators compatible with Dask-ML’s Incremental
For more info on the train-test split for model evaluation, see Hyperparameters and Model Validation.

Train Models on Large Datasets

Text Vectorization Pipeline

Dask Examples documentation