Incrementally Train Large Datasets¶
We can train models on large datasets one batch at a time. Many
Scikit-Learn estimators implement a
partial_fit method to enable
incremental learning in batches.
est = SGDClassifier(...) est.partial_fit(X_train_1, y_train_1) est.partial_fit(X_train_2, y_train_2) ...
The Scikit-Learn documentation discusses this approach in more depth in their user guide.
This notebook demonstrates the use of Dask-ML’s
meta-estimator, which automates the use of Scikit-Learn’s
partial_fit over Dask arrays and dataframes. Scikit-Learn handles
all of the computation while Dask handles the data management, loading
and moving batches of data as necessary. This allows scaling to large
datasets distributed across many machines, or to datasets that do not
fit in memory, all with a familiar workflow.
This example shows …
- wrapping a Scikit-Learn estimator that implements
partial_fitwith the Dask-ML Incremental meta-estimator
- training, predicting, and scoring on this wrapped estimator
Although this example uses Scikit-Learn’s SGDClassifer, the
Incremental meta-estimator will work for any class that implements
partial_fit and the scikit-learn base estimator
We first start a Dask client in order to get access to the Dask dashboard, which will provide progress and performance metrics.
You can view the dashboard by clicking on the dashboard link after you run the cell
from dask.distributed import Client client = Client(n_workers=4, threads_per_worker=1) client
We create a synthetic dataset that is large enough to be interesting, but small enough to run quickly.
Our dataset has 1,000,000 examples and 100 features.
import dask import dask.array as da from dask_ml.datasets import make_classification n, d = 100000, 100 X, y = make_classification(n_samples=n, n_features=d, chunks=n // 10, flip_y=0.2) X
dask.array<normal, shape=(100000, 100), dtype=float64, chunksize=(10000, 100)>
Split data for training and testing¶
We split our dataset into training and testing data to aid evaluation by making sure we have a fair test:
from dask_ml.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) X_train
dask.array<concatenate, shape=(90000, 100), dtype=float64, chunksize=(9000, 100)>
Persist data in memory¶
This dataset is small enough to fit in distributed memory, so we call
dask.persist to ask Dask to execute the computations above and keep
the results in memory.
X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)
If you are working in a situation where your dataset does not fit in memory then you should skip this step. Everything will still work, but will be slower and use less memory.
dask.persist will preserve our data in memory, so no
computation will be needed as we pass over our data many times. For
example if our data came from CSV files and was not persisted, then the
CSV files would have to be re-read on each pass. This is desirable if
the data does not fit in RAM, but not slows down our computation
We pre-compute the classes from our training data, which is required for this classification example:
classes = da.unique(y_train).compute() classes
Create Scikit-Learn model¶
We make the underlying Scikit-Learn estimator, an
from sklearn.linear_model import SGDClassifier est = SGDClassifier(loss='log', penalty='l2', tol=1e-3)
Here we use
SGDClassifier, but any estimator that implements the
partial_fit method will work. A list of Scikit-Learn models that
implement this API is available
Wrap with Dask-ML’s Incremental meta-estimator¶
We now wrap our
SGDClassifer with the
from dask_ml.wrappers import Incremental inc = Incremental(est, scoring='accuracy')
Incremental only does data management while leaving the
actual algorithm to the underlying Scikit-Learn estimator.
Note: We set the scoring parameter above in the Dask estimator to tell it to handle scoring. This works better when using Dask arrays for test data.
Incremental implements a
fit method, which will perform one loop
over the dataset, calling
partial_fit over each chunk in the Dask
You may want to watch the dashboard during this fit process to see the sequential fitting of many batches.
inc.fit(X_train, y_train, classes=classes)
Incremental(estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False), random_state=None, scoring='accuracy', shuffle_blocks=True)
Pass over the training data many times¶
.fit passes over all chunks our data once. However, in many
cases we may want to pass over the training data many times. To do this
we can use the
Incremental.partial_fit method and a for loop.
est = SGDClassifier(loss='log', penalty='l2', tol=0e-3) inc = Incremental(est, scoring='accuracy')
for _ in range(10): inc.partial_fit(X_train, y_train, classes=classes) print('Score:', inc.score(X_test, y_test))
Score: 0.5741 Score: 0.5678 Score: 0.5799 Score: 0.5691 Score: 0.596 Score: 0.5843 Score: 0.594 Score: 0.6013 Score: 0.6054 Score: 0.5926
Predict and Score¶
Finally we can also call
Incremental.score on our testing data
inc.predict(X_test) # Predict produces lazy dask arrays
dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>
inc.predict(X_test)[:100].compute() # call compute to get results
array([1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1])
In this notebook we went over using Dask-ML’s
meta-estimator to automate the process of incremental training with
Scikit-Learn estimators that implement the
partial_fit method. If
you want to learn more about this process you might want to investigate
the following documentation:
- Dask-ML Incremental API documentation
- List of Scikit-Learn estimators compatible with Dask-ML’s Incremental
- For more info on the train-test split for model evaluation, see Hyperparameters and Model Validation.