From dccfa6ef906b4eca1b97d0608889e97fa1f8812c Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 24 Jul 2020 16:09:32 -0500 Subject: [PATCH 1/6] Add example using CountVectorizer --- binder/environment.yml | 2 +- machine-learning/text-count-vectorizer.ipynb | 323 +++++++++++++++++++ 2 files changed, 324 insertions(+), 1 deletion(-) create mode 100644 machine-learning/text-count-vectorizer.ipynb diff --git a/binder/environment.yml b/binder/environment.yml index da731f37..61e5402d 100644 --- a/binder/environment.yml +++ b/binder/environment.yml @@ -5,7 +5,7 @@ dependencies: - bokeh=1.4.0 - dask=2.9.1 - dask-image=0.2.0 - - dask-ml=1.4.0 + - dask-ml=1.6.0 # TODO: Bump dask-labextension to >=2 once extensions have been made compatibility updates - dask-labextension=1.1.0 # TODO: Bump jupyterlab to 2.x once extensions have been made compatibility updates diff --git a/machine-learning/text-count-vectorizer.ipynb b/machine-learning/text-count-vectorizer.ipynb new file mode 100644 index 00000000..316be31b --- /dev/null +++ b/machine-learning/text-count-vectorizer.ipynb @@ -0,0 +1,323 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Dask-ML's CountVectorizer\n", + "\n", + "Dask-ML includes a `CountVectorizer` that's appropriate for parallel / distributed processing of large datasets." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading Data\n", + "\n", + "In this example, we'll work with the 20 newsgroups dataset from scikit-learn." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\",\n", + " \"From: guykuo@carson.u.washington.edu (Guy Kuo)\\nSubject: SI Clock Poll - Final Call\\nSummary: Final call for SI clock reports\\nKeywords: SI,acceleration,clock,upgrade\\nArticle-I.D.: shelley.1qvfo9INNc3s\\nOrganization: University of Washington\\nLines: 11\\nNNTP-Posting-Host: carson.u.washington.edu\\n\\nA fair number of brave souls who upgraded their SI clock oscillator have\\nshared their experiences for this poll. Please send a brief message detailing\\nyour experiences with the procedure. Top speed attained, CPU rated speed,\\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\\nfunctionality with 800 and 1.4 m floppies are especially requested.\\n\\nI will be summarizing in the next two days, so please add to the network\\nknowledge base if you have done the clock upgrade and haven't answered this\\npoll. Thanks.\\n\\nGuy Kuo \\n\"]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import sklearn.datasets\n", + "\n", + "news = sklearn.datasets.fetch_20newsgroups()\n", + "news['data'][:2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This returns a list of documents (strings). Dask-ML's `CountVectorizer` expects a `dask.bag.Bag` of documents. We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.\n", + "\n", + "As we'll see later, Dask-ML's `CountVectorizer` benefits from using the `dask.distributed` scheduler, even on a single machine." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

Client

\n", + "\n", + "
\n", + "

Cluster

\n", + "
    \n", + "
  • Workers: 4
  • \n", + "
  • Cores: 4
  • \n", + "
  • Memory: 17.18 GB
  • \n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4, threads_per_worker=1)\n", + "client" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dask.bag" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import dask\n", + "import numpy as np\n", + "import dask.bag as db\n", + "import toolz\n", + "\n", + "@dask.delayed\n", + "def load_news(slice_):\n", + " \"\"\"Load a slice of the 20 newsgroups dataset.\"\"\"\n", + " return sklearn.datasets.fetch_20newsgroups()['data'][slice_]\n", + "\n", + "npartitions = 10\n", + "partition_size = len(news['data']) // npartitions\n", + "\n", + "lengths = np.cumsum([partition_size] * npartitions)\n", + "lengths = [0] + list(lengths) + [None]\n", + "\n", + "slices = [slice(a, b) for a, b in\n", + " toolz.sliding_window(2, lengths)]\n", + "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", + "documents" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import dask_ml.feature_extraction.text" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 340 ms, sys: 30.4 ms, total: 371 ms\n", + "Wall time: 2.47 s\n" + ] + } + ], + "source": [ + "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", + "%time result = vectorizer.fit_transform(documents)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The call to `fit_transform` did some work to discover the *vocabulary*, a mapping from terms in the documents to positions in the transformed result array." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('00', 0), ('000', 1), ('0000', 2), ('00000', 3), ('000000', 4)]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(vectorizer.vocabulary_.items())[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Speaking of the result, it's a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0]])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "local_result = result.compute()\n", + "local_result[:5].toarray()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that we persisted `documents` earlier. If possible, persisting the input documents is preferable to avoid making two passes over the data. One to discover the vocabulary and a second to transform. If the dataset is larger than (distributed) memory, then two passes will be necessary." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A note on vocabularies\n", + "\n", + "You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer`." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "vocabulary = vectorizer.vocabulary_\n", + "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", + "\n", + "vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n", + " vocabulary=remote_vocabulary\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 7.15 ms, sys: 2.45 ms, total: 9.59 ms\n", + "Wall time: 8.54 ms\n" + ] + } + ], + "source": [ + "%time result = vectorizer2.transform(documents)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 162 ms, sys: 37.1 ms, total: 199 ms\n", + "Wall time: 1.08 s\n" + ] + }, + { + "data": { + "text/plain": [ + "<11314x130107 sparse matrix of type ''\n", + "\twith 1787565 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time result.compute()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 68f348d61d270b311f72b71d3d8c0ef3e9cf0735 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 27 Jul 2020 08:43:58 -0500 Subject: [PATCH 2/6] update --- index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/index.rst b/index.rst index 138a7df7..3ae5cd6f 100644 --- a/index.rst +++ b/index.rst @@ -41,6 +41,7 @@ You can run these examples in a live session here: |Binder| machine-learning/torch-prediction machine-learning/training-on-large-datasets machine-learning/incremental + machine-learning/text-count-vectorizer machine-learning/text-vectorization machine-learning/hyperparam-opt.ipynb machine-learning/xgboost From 007ae478ef48d13348af6c12d282fdc04819cc01 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 27 Jul 2020 09:48:30 -0500 Subject: [PATCH 3/6] fixup --- machine-learning/text-count-vectorizer.ipynb | 72 +++++++------------- 1 file changed, 24 insertions(+), 48 deletions(-) diff --git a/machine-learning/text-count-vectorizer.ipynb b/machine-learning/text-count-vectorizer.ipynb index 316be31b..48f85ebb 100644 --- a/machine-learning/text-count-vectorizer.ipynb +++ b/machine-learning/text-count-vectorizer.ipynb @@ -6,7 +6,7 @@ "source": [ "# Using Dask-ML's CountVectorizer\n", "\n", - "Dask-ML includes a `CountVectorizer` that's appropriate for parallel / distributed processing of large datasets." + "Dask-ML includes a [CountVectorizer](https://ml.dask.org/modules/generated/dask_ml.feature_extraction.text.CountVectorizer.html#dask_ml.feature_extraction.text.CountVectorizer) that's appropriate for parallel / distributed processing of large datasets." ] }, { @@ -15,6 +15,25 @@ "source": [ "## Loading Data\n", "\n", + "As we'll see later, Dask-ML's `CountVectorizer` benefits from using the `dask.distributed` scheduler, even on a single machine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4, threads_per_worker=1)\n", + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "In this example, we'll work with the 20 newsgroups dataset from scikit-learn." ] }, @@ -48,51 +67,7 @@ "source": [ "This returns a list of documents (strings). Dask-ML's `CountVectorizer` expects a `dask.bag.Bag` of documents. We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.\n", "\n", - "As we'll see later, Dask-ML's `CountVectorizer` benefits from using the `dask.distributed` scheduler, even on a single machine." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

Client

\n", - "\n", - "
\n", - "

Cluster

\n", - "
    \n", - "
  • Workers: 4
  • \n", - "
  • Cores: 4
  • \n", - "
  • Memory: 17.18 GB
  • \n", - "
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from dask.distributed import Client\n", - "\n", - "client = Client(n_workers=4, threads_per_worker=1)\n", - "client" + "This example is a bit contrived to get a Bag with multiple partitions. Typically the full dataset would be partitioned into multiple files on disk, and you'd load one partition per file. In this case, we split the single file into multiple partitions by loading the data and then slicing." ] }, { @@ -130,13 +105,14 @@ "\n", "slices = [slice(a, b) for a, b in\n", " toolz.sliding_window(2, lengths)]\n", + "# Notice the persist here! More details later.\n", "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", "documents" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ @@ -234,7 +210,7 @@ "source": [ "## A note on vocabularies\n", "\n", - "You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer`." + "You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer` object. Dask-ML's `CountVectorizer` works just fine when the `vocabulary` is a pointer to a piece of data on the cluster." ] }, { From f257e40ed537ffa2fd2a851a673ed1581405588f Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 27 Jul 2020 09:48:41 -0500 Subject: [PATCH 4/6] strip --- machine-learning/text-count-vectorizer.ipynb | 119 +++---------------- 1 file changed, 16 insertions(+), 103 deletions(-) diff --git a/machine-learning/text-count-vectorizer.ipynb b/machine-learning/text-count-vectorizer.ipynb index 48f85ebb..37eac98d 100644 --- a/machine-learning/text-count-vectorizer.ipynb +++ b/machine-learning/text-count-vectorizer.ipynb @@ -39,21 +39,9 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[\"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\",\n", - " \"From: guykuo@carson.u.washington.edu (Guy Kuo)\\nSubject: SI Clock Poll - Final Call\\nSummary: Final call for SI clock reports\\nKeywords: SI,acceleration,clock,upgrade\\nArticle-I.D.: shelley.1qvfo9INNc3s\\nOrganization: University of Washington\\nLines: 11\\nNNTP-Posting-Host: carson.u.washington.edu\\n\\nA fair number of brave souls who upgraded their SI clock oscillator have\\nshared their experiences for this poll. Please send a brief message detailing\\nyour experiences with the procedure. Top speed attained, CPU rated speed,\\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\\nfunctionality with 800 and 1.4 m floppies are especially requested.\\n\\nI will be summarizing in the next two days, so please add to the network\\nknowledge base if you have done the clock upgrade and haven't answered this\\npoll. Thanks.\\n\\nGuy Kuo \\n\"]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "import sklearn.datasets\n", "\n", @@ -72,20 +60,9 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "dask.bag" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "import dask\n", "import numpy as np\n", @@ -112,7 +89,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -121,18 +98,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 340 ms, sys: 30.4 ms, total: 371 ms\n", - "Wall time: 2.47 s\n" - ] - } - ], + "outputs": [], "source": [ "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", "%time result = vectorizer.fit_transform(documents)" @@ -147,20 +115,9 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('00', 0), ('000', 1), ('0000', 2), ('00000', 3), ('000000', 4)]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "list(vectorizer.vocabulary_.items())[:5]" ] @@ -174,24 +131,9 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0],\n", - " [0, 0, 0, ..., 0, 0, 0]])" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "local_result = result.compute()\n", "local_result[:5].toarray()" @@ -215,7 +157,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -229,47 +171,18 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 7.15 ms, sys: 2.45 ms, total: 9.59 ms\n", - "Wall time: 8.54 ms\n" - ] - } - ], + "outputs": [], "source": [ "%time result = vectorizer2.transform(documents)" ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 162 ms, sys: 37.1 ms, total: 199 ms\n", - "Wall time: 1.08 s\n" - ] - }, - { - "data": { - "text/plain": [ - "<11314x130107 sparse matrix of type ''\n", - "\twith 1787565 stored elements in Compressed Sparse Row format>" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "%time result.compute()" ] From 39cb3bdd75146370887f673a497f2acfda1fb25a Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 27 Jul 2020 14:26:15 -0500 Subject: [PATCH 5/6] merge --- index.rst | 1 - machine-learning/text-count-vectorizer.ipynb | 212 ------------------- machine-learning/text-vectorization.ipynb | 192 +++++++++++++---- 3 files changed, 149 insertions(+), 256 deletions(-) delete mode 100644 machine-learning/text-count-vectorizer.ipynb diff --git a/index.rst b/index.rst index 3ae5cd6f..138a7df7 100644 --- a/index.rst +++ b/index.rst @@ -41,7 +41,6 @@ You can run these examples in a live session here: |Binder| machine-learning/torch-prediction machine-learning/training-on-large-datasets machine-learning/incremental - machine-learning/text-count-vectorizer machine-learning/text-vectorization machine-learning/hyperparam-opt.ipynb machine-learning/xgboost diff --git a/machine-learning/text-count-vectorizer.ipynb b/machine-learning/text-count-vectorizer.ipynb deleted file mode 100644 index 37eac98d..00000000 --- a/machine-learning/text-count-vectorizer.ipynb +++ /dev/null @@ -1,212 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Dask-ML's CountVectorizer\n", - "\n", - "Dask-ML includes a [CountVectorizer](https://ml.dask.org/modules/generated/dask_ml.feature_extraction.text.CountVectorizer.html#dask_ml.feature_extraction.text.CountVectorizer) that's appropriate for parallel / distributed processing of large datasets." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Loading Data\n", - "\n", - "As we'll see later, Dask-ML's `CountVectorizer` benefits from using the `dask.distributed` scheduler, even on a single machine." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dask.distributed import Client\n", - "\n", - "client = Client(n_workers=4, threads_per_worker=1)\n", - "client" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this example, we'll work with the 20 newsgroups dataset from scikit-learn." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import sklearn.datasets\n", - "\n", - "news = sklearn.datasets.fetch_20newsgroups()\n", - "news['data'][:2]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This returns a list of documents (strings). Dask-ML's `CountVectorizer` expects a `dask.bag.Bag` of documents. We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.\n", - "\n", - "This example is a bit contrived to get a Bag with multiple partitions. Typically the full dataset would be partitioned into multiple files on disk, and you'd load one partition per file. In this case, we split the single file into multiple partitions by loading the data and then slicing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import dask\n", - "import numpy as np\n", - "import dask.bag as db\n", - "import toolz\n", - "\n", - "@dask.delayed\n", - "def load_news(slice_):\n", - " \"\"\"Load a slice of the 20 newsgroups dataset.\"\"\"\n", - " return sklearn.datasets.fetch_20newsgroups()['data'][slice_]\n", - "\n", - "npartitions = 10\n", - "partition_size = len(news['data']) // npartitions\n", - "\n", - "lengths = np.cumsum([partition_size] * npartitions)\n", - "lengths = [0] + list(lengths) + [None]\n", - "\n", - "slices = [slice(a, b) for a, b in\n", - " toolz.sliding_window(2, lengths)]\n", - "# Notice the persist here! More details later.\n", - "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", - "documents" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import dask_ml.feature_extraction.text" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", - "%time result = vectorizer.fit_transform(documents)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The call to `fit_transform` did some work to discover the *vocabulary*, a mapping from terms in the documents to positions in the transformed result array." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "list(vectorizer.vocabulary_.items())[:5]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Speaking of the result, it's a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "local_result = result.compute()\n", - "local_result[:5].toarray()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that we persisted `documents` earlier. If possible, persisting the input documents is preferable to avoid making two passes over the data. One to discover the vocabulary and a second to transform. If the dataset is larger than (distributed) memory, then two passes will be necessary." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## A note on vocabularies\n", - "\n", - "You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer` object. Dask-ML's `CountVectorizer` works just fine when the `vocabulary` is a pointer to a piece of data on the cluster." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vocabulary = vectorizer.vocabulary_\n", - "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", - "\n", - "vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n", - " vocabulary=remote_vocabulary\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%time result = vectorizer2.transform(documents)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%time result.compute()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/machine-learning/text-vectorization.ipynb b/machine-learning/text-vectorization.ipynb index 9770699a..2bb665a6 100644 --- a/machine-learning/text-vectorization.ipynb +++ b/machine-learning/text-vectorization.ipynb @@ -4,16 +4,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Text Vectorization Pipeline\n", + "# Working with Text Data\n", "\n", - "This example illustrates how Dask-ML can be used to classify large textual datasets in parallel.\n", - "It is adapted from [this scikit-learn example](https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py).\n", - "\n", - "The primary differences are that\n", - "\n", - "* We fit the entire model, including text vectorization, as a pipeline.\n", - "* We use dask collections like [Dask Bag](https://docs.dask.org/en/latest/bag.html), [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html), and [Dask Array](https://docs.dask.org/en/latest/array.html)\n", - " rather than generators to work with larger than memory datasets." + "Dask-ML includes several ways to [process text data](https://ml.dask.org/modules/api.html#dask-ml-feature-extraction-text-feature-extraction). Typically these work with the [`Dask DataFrame`](https://docs.dask.org/en/latest/dataframe.html) or [`Bag`](https://docs.dask.org/en/latest/bag.html) collections." ] }, { @@ -22,9 +15,9 @@ "metadata": {}, "outputs": [], "source": [ - "from dask.distributed import Client, progress\n", + "from dask.distributed import Client\n", "\n", - "client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')\n", + "client = Client(n_workers=4, threads_per_worker=1, memory_limit='2GB')\n", "client" ] }, @@ -32,9 +25,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Fetch the data\n", - "\n", - "Scikit-Learn provides a utility to fetch the newsgroups dataset." + "In this example, we'll work with the 20 newsgroups dataset from scikit-learn. Each element in the dataset has a bit of metadata and the full text of a post." ] }, { @@ -45,20 +36,17 @@ "source": [ "import sklearn.datasets\n", "\n", - "bunch = sklearn.datasets.fetch_20newsgroups()" + "news = sklearn.datasets.fetch_20newsgroups()\n", + "print(news.data[0][:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The data from scikit-learn isn't *too* large, so the data is just\n", - "returned in memory. Each document is a string. The target we're predicting\n", - "is an integer, which codes the topic of the post.\n", + "This returns a list of documents (strings). We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.\n", "\n", - "We'll load the documents and targets directly into a dask DataFrame.\n", - "In practice, on a larger than memory dataset, you would likely load the\n", - "documents from disk or cloud storage using `dask.bag` or `dask.delayed`." + "This example is a bit contrived to get a Bag with multiple partitions. Typically the full dataset would be partitioned into multiple files on disk, and you'd load one partition per file. In this case, we split the single file into multiple partitions by loading the data and then slicing. Because the full dataset isn't too large, we `.persist()` it into memory. If it's larger than (distributed) memory, that would need to be omitted, and the data would need to be read from disk (possibly more than once, depending on the algorithm used later on)." ] }, { @@ -67,20 +55,38 @@ "metadata": {}, "outputs": [], "source": [ - "import dask.dataframe as dd\n", - "import pandas as pd\n", + "import dask\n", + "import numpy as np\n", + "import dask.bag as db\n", + "import toolz\n", "\n", - "df = dd.from_pandas(pd.DataFrame({\"text\": bunch.data, \"target\": bunch.target}),\n", - " npartitions=25)\n", + "@dask.delayed\n", + "def load_news(slice_):\n", + " \"\"\"Load a slice of the 20 newsgroups dataset.\"\"\"\n", + " return sklearn.datasets.fetch_20newsgroups()['data'][slice_]\n", "\n", - "df" + "npartitions = 10\n", + "partition_size = len(news['data']) // npartitions\n", + "\n", + "lengths = np.cumsum([partition_size] * npartitions)\n", + "lengths = [0] + list(lengths) + [None]\n", + "\n", + "slices = [slice(a, b) for a, b in\n", + " toolz.sliding_window(2, lengths)]\n", + "\n", + "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", + "documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Each row in the `text` column has a bit of metadata and the full text of a post." + "## Feature Extraction\n", + "\n", + "`HashingVectorizer` and `CountVectorizer` can be used to turn raw documents (strings) into sparse Dask Arrays.\n", + "\n", + "`HashingVectorizer` is stateless. Notice that the call to `fit_transform` is almost instant." ] }, { @@ -89,23 +95,77 @@ "metadata": {}, "outputs": [], "source": [ - "print(df.head().loc[0, 'text'][:500])" + "import dask_ml.feature_extraction\n", + "\n", + "hashing_vectorizer = dask_ml.feature_extraction.text.HashingVectorizer()\n", + "%time transformed = hashing_vectorizer.fit_transform(documents)\n", + "%time transformed.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Feature Hashing" + "`CountVectorizer` is not stateless, unless you provide a `vocabulary` ahead of time. When no vocabulary is provided, `CountVectorizer.fit` or `CountVectorizer.fit_transform` will need to load data to discover the unique set of terms in the documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", + "%time result = vectorizer.fit_transform(documents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Dask's [HashingVectorizer](https://ml.dask.org/modules/generated/dask_ml.feature_extraction.text.HashingVectorizer.html#dask_ml.feature_extraction.text.HashingVectorizer) provides a similar API to [scikit-learn's implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). In fact, Dask-ML's implementation uses scikit-learn's, applying it to each partition of the input `dask.dataframe.Series` or `dask.bag.Bag`.\n", + "The call to `fit_transform` did some work to discover the *vocabulary*, a mapping from terms in the documents to positions in the transformed result array." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(vectorizer.vocabulary_.items())[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Speaking of the result, it's again a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "local_result = result.compute()\n", + "local_result[:5].toarray()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that we persisted `documents` earlier. If possible, persisting the input documents is preferable to avoid making two passes over the data. One to discover the vocabulary and a second to transform. If the dataset is larger than (distributed) memory, then two passes will be necessary." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A note on vocabularies\n", "\n", - "Transformation, once we actually compute the result, happens in parallel and returns a dask Array." + "You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer` object. Dask-ML's `CountVectorizer` works just fine when the `vocabulary` is a pointer to a piece of data on the cluster." ] }, { @@ -114,20 +174,19 @@ "metadata": {}, "outputs": [], "source": [ - "import dask_ml.feature_extraction.text\n", + "vocabulary = vectorizer.vocabulary_\n", + "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", "\n", - "vect = dask_ml.feature_extraction.text.HashingVectorizer()\n", - "X = vect.fit_transform(df['text'])\n", - "X" + "vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n", + " vocabulary=remote_vocabulary\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The output array `X` has unknown chunk sizes becase the input dask Series or Bags don't know their own length.\n", - "\n", - "Each block in `X` is a `scipy.sparse` matrix." + "`CountVectorizer.transform` doesn't need to do any real work now, so it's fast." ] }, { @@ -136,14 +195,60 @@ "metadata": {}, "outputs": [], "source": [ - "X.blocks[0].compute()" + "%time result = vectorizer2.transform(documents)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time result.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Text Vectorization Pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The rest of this example is adapted from [this scikit-learn example](https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py).\n", + "\n", + "The primary differences are that\n", + "\n", + "* We fit the entire model, including text vectorization, as a pipeline.\n", + "* We use dask collections like [Dask Bag](https://docs.dask.org/en/latest/bag.html), [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html), and [Dask Array](https://docs.dask.org/en/latest/array.html)\n", + " rather than generators to work with larger than memory datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This is a document-term matrix. Each row is the hashed representation of the original post." + "We'll load the documents and targets directly into a dask DataFrame.\n", + "In practice, on a larger than memory dataset, you would likely load the\n", + "documents from disk or cloud storage using `dask.bag` or `dask.delayed`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "import pandas as pd\n", + "\n", + "df = dd.from_pandas(pd.DataFrame({\"text\": news.data, \"target\": news.target}),\n", + " npartitions=25)\n", + "\n", + "df" ] }, { @@ -164,7 +269,7 @@ "metadata": {}, "outputs": [], "source": [ - "bunch.target_names" + "news.target_names" ] }, { @@ -175,7 +280,7 @@ "source": [ "import numpy as np\n", "\n", - "positive = np.arange(len(bunch.target_names))[['comp' in x for x in bunch.target_names]]\n", + "positive = np.arange(len(news.target_names))[['comp' in x for x in news.target_names]]\n", "y = df['target'].isin(positive).astype(int)\n", "y" ] @@ -210,6 +315,7 @@ "sgd = sklearn.linear_model.SGDClassifier(\n", " tol=1e-3\n", ")\n", + "vect = dask_ml.feature_extraction.text.HashingVectorizer()\n", "clf = dask_ml.wrappers.Incremental(\n", " sgd, scoring='accuracy', assume_equal_chunks=True\n", ")\n", @@ -293,7 +399,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.4" + "version": "3.7.6" } }, "nbformat": 4, From 4a5934362c3fea18170883eaa080cbf0bb136edd Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Wed, 5 Aug 2020 11:22:18 -0500 Subject: [PATCH 6/6] fixups --- machine-learning/text-vectorization.ipynb | 85 ++++++++++++++--------- 1 file changed, 52 insertions(+), 33 deletions(-) diff --git a/machine-learning/text-vectorization.ipynb b/machine-learning/text-vectorization.ipynb index 2bb665a6..6ca6d5e7 100644 --- a/machine-learning/text-vectorization.ipynb +++ b/machine-learning/text-vectorization.ipynb @@ -6,7 +6,7 @@ "source": [ "# Working with Text Data\n", "\n", - "Dask-ML includes several ways to [process text data](https://ml.dask.org/modules/api.html#dask-ml-feature-extraction-text-feature-extraction). Typically these work with the [`Dask DataFrame`](https://docs.dask.org/en/latest/dataframe.html) or [`Bag`](https://docs.dask.org/en/latest/bag.html) collections." + "Dask-ML includes several ways to [process text data](https://ml.dask.org/modules/api.html#dask-ml-feature-extraction-text-feature-extraction). Typically these work with the [`Dask DataFrame`](https://docs.dask.org/en/latest/dataframe.html) or [`Bag`](https://docs.dask.org/en/latest/bag.html) collections, which can reference larger-than-memory datasets stored on disk or in distributed memory on a Dask Cluster." ] }, { @@ -44,9 +44,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This returns a list of documents (strings). We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.\n", - "\n", - "This example is a bit contrived to get a Bag with multiple partitions. Typically the full dataset would be partitioned into multiple files on disk, and you'd load one partition per file. In this case, we split the single file into multiple partitions by loading the data and then slicing. Because the full dataset isn't too large, we `.persist()` it into memory. If it's larger than (distributed) memory, that would need to be omitted, and the data would need to be read from disk (possibly more than once, depending on the algorithm used later on)." + "This returns a list of documents (strings). We'll load the datset using `dask.bag.from_sequence`, but in practice you would want to load the data on the workers. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections." ] }, { @@ -58,23 +56,8 @@ "import dask\n", "import numpy as np\n", "import dask.bag as db\n", - "import toolz\n", - "\n", - "@dask.delayed\n", - "def load_news(slice_):\n", - " \"\"\"Load a slice of the 20 newsgroups dataset.\"\"\"\n", - " return sklearn.datasets.fetch_20newsgroups()['data'][slice_]\n", - "\n", - "npartitions = 10\n", - "partition_size = len(news['data']) // npartitions\n", - "\n", - "lengths = np.cumsum([partition_size] * npartitions)\n", - "lengths = [0] + list(lengths) + [None]\n", - "\n", - "slices = [slice(a, b) for a, b in\n", - " toolz.sliding_window(2, lengths)]\n", "\n", - "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", + "documents = db.from_sequence(news['data'], npartitions=10).persist()\n", "documents" ] }, @@ -84,9 +67,11 @@ "source": [ "## Feature Extraction\n", "\n", - "`HashingVectorizer` and `CountVectorizer` can be used to turn raw documents (strings) into sparse Dask Arrays.\n", + "Dask-ML's feature extractors turn Bags or DataFrames of raw documents (strings) into Dask Arrays backed by scipy.sparse matrices.\n", "\n", - "`HashingVectorizer` is stateless. Notice that the call to `fit_transform` is almost instant." + "If the limitations of `HashingVectorizer` (no inverse transform, no IDF weighting) are acceptable, then we strongly recommend using it over something like `CountVectorizer`. `HashingVectorizer` is completely stateless and so is much easier (and faster) to use in a distributed setting.\n", + "\n", + "Note that becuase `HashingVectorizer` is stateless, the calls to `fit` and `transform` are nearly instant." ] }, { @@ -98,15 +83,15 @@ "import dask_ml.feature_extraction\n", "\n", "hashing_vectorizer = dask_ml.feature_extraction.text.HashingVectorizer()\n", - "%time transformed = hashing_vectorizer.fit_transform(documents)\n", - "%time transformed.compute()" + "%time hashing_vectorizer.fit(documents)\n", + "%time transformed = hashing_vectorizer.transform(documents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "`CountVectorizer` is not stateless, unless you provide a `vocabulary` ahead of time. When no vocabulary is provided, `CountVectorizer.fit` or `CountVectorizer.fit_transform` will need to load data to discover the unique set of terms in the documents." + "It's only when you `.compute()` the result that we load data and do the transformation." ] }, { @@ -115,15 +100,14 @@ "metadata": {}, "outputs": [], "source": [ - "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", - "%time result = vectorizer.fit_transform(documents)" + "%time transformed.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The call to `fit_transform` did some work to discover the *vocabulary*, a mapping from terms in the documents to positions in the transformed result array." + "`CountVectorizer` is not stateless unless you provide a `vocabulary` ahead of time. When no vocabulary is provided, `CountVectorizer.fit` or `CountVectorizer.fit_transform` will need to load data to discover the unique set of terms in the documents." ] }, { @@ -132,14 +116,17 @@ "metadata": {}, "outputs": [], "source": [ - "list(vectorizer.vocabulary_.items())[:5]" + "vectorizer = dask_ml.feature_extraction.text.CountVectorizer()\n", + "%time result = vectorizer.fit_transform(documents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Speaking of the result, it's again a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`" + "Now `.fit_transform` (and `fit` and `transform`) is much more expensive since all the documents must be loaded to determine the `vocabulary`.\n", + "\n", + "Thee result is again a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`" ] }, { @@ -148,8 +135,7 @@ "metadata": {}, "outputs": [], "source": [ - "local_result = result.compute()\n", - "local_result[:5].toarray()" + "%time result.compute()" ] }, { @@ -174,6 +160,8 @@ "metadata": {}, "outputs": [], "source": [ + "# reuse the vocabulary from the previously fitted estimator.\n", + "# In practice this would come from an external source.\n", "vocabulary = vectorizer.vocabulary_\n", "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", "\n", @@ -207,6 +195,37 @@ "%time result.compute()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick for more on problems with large vocabularies, which recommends \"feature hashing\" as a possible solution.\n", + "\n", + "## Feature Hashing\n", + "\n", + "Feature hashing transforms a DataFrame or Bag of inputs (mappings or strings) to a sparse array. It is completely stateless, and so doesn't suffer from the same issues as `CountVectorizer`. See https://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing for more." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hasher = dask_ml.feature_extraction.text.FeatureHasher(input_type=\"string\")\n", + "result = hasher.transform(documents)\n", + "result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time result.compute()" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -399,7 +418,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.8.5" } }, "nbformat": 4,