diff --git a/machine-learning/parallel-prediction.ipynb b/machine-learning/parallel-prediction.ipynb index ecbbfc7..8bb357f 100644 --- a/machine-learning/parallel-prediction.ipynb +++ b/machine-learning/parallel-prediction.ipynb @@ -12,7 +12,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Sometimes you'll train on a smaller dataset that fits in memory, but need to predict or score for a much larger (possibly larger than memory) dataset. Perhaps your [learning curve](http://scikit-learn.org/stable/modules/learning_curve.html) has leveled off, or you only have labels for a subset of the data.\n", + "Sometimes you'll train on a smaller dataset that fits in memory, but need to predict or score for a much larger (possibly larger than memory) dataset.\n", + "Perhaps your [learning curve](http://scikit-learn.org/stable/modules/learning_curve.html) has leveled off, or you only have labels for a subset of the data.\n", "\n", "In this situation, you can use [ParallelPostFit](http://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html) to parallelize and distribute the scoring or prediction steps." ] @@ -25,7 +26,7 @@ "source": [ "from dask.distributed import Client, progress\n", "\n", - "# Scale up: connect to your own cluster with bmore resources\n", + "# Scale up: connect to your own cluster with more resources\n", "# see http://dask.pydata.org/en/latest/setup.html\n", "client = Client(processes=False, threads_per_worker=4,\n", " n_workers=1, memory_limit='2GB')\n", @@ -155,9 +156,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "y_pred is Dask arary. Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine.\n", + "`y_pred` is a Dask array.\n", + "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine.\n", "\n", - "Or we can check the models score on the entire large dataset. The computation will be done in parallel, and no single machine will have to hold all the data." + "Or we can check the models score on the entire large dataset.\n", + "The computation will be done in parallel, and no single machine will have to hold all the data." ] }, { @@ -168,6 +171,13 @@ "source": [ "clf.score(X_large, y_large)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -186,7 +196,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.12" + "version": "3.10.6" } }, "nbformat": 4,