Parallelizing Predicion

This example demonstrates dask_ml.wrappers.ParallelPostFit. A sklearn.svm.SVC is fit on a small dataset that easily fits in memory.

After training, we predict for successively larger datasets. We compare

  • The serial prediction time using the regular SVC.predict method
  • The parallel prediction time using dask_ml.warppers.ParallelPostFit.predict()

We see that the parallel version is faster, especially for larger datasets. Additionally, the parallel version from ParallelPostFit scales out to larger than memory datasets.

While only predict is demonstrated here, wrappers.ParallelPostFit is equally useful for predict_proba and transform.

../_images/sphx_glr_plot_parallel_post_fit_scaling_001.png
from timeit import default_timer as tic

import pandas as pd
import seaborn as sns
import sklearn.datasets
from sklearn.svm import SVC

import dask_ml.datasets
from dask_ml.wrappers import ParallelPostFit

X, y = sklearn.datasets.make_classification(n_samples=1000)
clf = ParallelPostFit(SVC(gamma='scale'))
clf.fit(X, y)


Ns = [100_000, 200_000, 400_000, 800_000]
timings = []


for n in Ns:
    X, y = dask_ml.datasets.make_classification(n_samples=n,
                                                random_state=n,
                                                chunks=n // 20)
    t1 = tic()
    # Serial scikit-learn version
    clf.estimator.predict(X)
    timings.append(('Scikit-Learn', n, tic() - t1))

    t1 = tic()
    # Parallelized scikit-learn version
    clf.predict(X).compute()
    timings.append(('dask-ml', n, tic() - t1))


df = pd.DataFrame(timings,
                  columns=['method', 'Number of Samples', 'Predict Time'])
ax = sns.factorplot(x='Number of Samples', y='Predict Time', hue='method',
                    data=df, aspect=1.5)

Total running time of the script: ( 0 minutes 22.372 seconds)

Gallery generated by Sphinx-Gallery