From a510ad1c4c53ec31f4a790fd50be4aeda30ce1be Mon Sep 17 00:00:00 2001 From: deanm0000 Date: Tue, 2 Jan 2024 18:26:34 +0000 Subject: [PATCH 1/9] strong warning about map_batches --- docs/user-guide/expressions/user-defined-functions.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/user-guide/expressions/user-defined-functions.md b/docs/user-guide/expressions/user-defined-functions.md index 882cc11c6ac1..3387be994cb0 100644 --- a/docs/user-guide/expressions/user-defined-functions.md +++ b/docs/user-guide/expressions/user-defined-functions.md @@ -18,8 +18,7 @@ These functions have an important distinction in how they operate and consequent A `map_batches` passes the `Series` backed by the `expression` as is. `map_batches` follows the same rules in both the `select` and the `group_by` context, this will -mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet -aggregated! +mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.** Use cases for `map_batches` are for instance passing the `Series` in an expression to a third party library. Below we show how we could use `map_batches` to pass an expression column to a neural network model. From 904a99938c2a98e9abe2efe46899a333ee7997f5 Mon Sep 17 00:00:00 2001 From: deanm0000 Date: Tue, 2 Jan 2024 19:48:01 +0000 Subject: [PATCH 2/9] docs(python): add numba info/example --- .../user-guide/expressions/numba-example.py | 17 ++++++++++++++++ docs/user-guide/expressions/numpy.md | 20 +++++++++++++++++-- py-polars/docs/requirements-docs.txt | 1 + 3 files changed, 36 insertions(+), 2 deletions(-) create mode 100644 docs/src/python/user-guide/expressions/numba-example.py diff --git a/docs/src/python/user-guide/expressions/numba-example.py b/docs/src/python/user-guide/expressions/numba-example.py new file mode 100644 index 000000000000..acd6c10c2b3b --- /dev/null +++ b/docs/src/python/user-guide/expressions/numba-example.py @@ -0,0 +1,17 @@ +import polars as pl +import numba as nb + +df = pl.DataFrame({"a": [10, 9, 8, 7]}) + + +@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)") +def cum_sum_reset(x, y, res): + res[0] = x[0] + for i in range(1, x.shape[0]): + res[i] = x[i] + res[i - 1] + if res[i] >= y: + res[i] = x[i] + + +out = df.select(cum_sum_reset(pl.all(), 5)) +print(out) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index 6500e87b5207..3b0e92f9e1d0 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -1,7 +1,7 @@ -# Numpy +# Numpy ufuncs Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) -for a list on all supported numpy functions. +for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. @@ -13,6 +13,18 @@ This means that if a function is not provided by Polars, we can use NumPy and we --8<-- "python/user-guide/expressions/numpy-example.py" ``` +## Numba + +[NumBa](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold. + +### Example + +{{code_block('user-guide/expressions/numpy-example',api_functions=['DataFrame'])}} + +```python exec="on" result="text" session="user-guide/numpy" +--8<-- "python/user-guide/expressions/numba-example.py" +``` + ### Interoperability Polars `Series` have support for NumPy universal functions (ufuncs). Element-wise functions such as `np.exp()`, `np.cos()`, `np.div()`, etc. all work with almost zero overhead. @@ -20,3 +32,7 @@ Polars `Series` have support for NumPy universal functions (ufuncs). Element-wis However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. This can lead to a window function or a `np.convolve()` giving flawed or incomplete results. Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missing values will be replaced by `np.nan` during the conversion. + +### Note on Performance + +The speed of ufuncs comes from being vectorized, compiled, and their ability to automatically use and return a pl.Series. That said, there's no inherent benefit in avoiding the use of `map_batches`. In fact, when polars sees an object that is a ufunc, it conveniently calls `map_batches`. In other words, even if you're trying to avoid calling `map_batches`, it's being called under the hood anyways. \ No newline at end of file diff --git a/py-polars/docs/requirements-docs.txt b/py-polars/docs/requirements-docs.txt index dfc9cb34f0b0..a8a389802f42 100644 --- a/py-polars/docs/requirements-docs.txt +++ b/py-polars/docs/requirements-docs.txt @@ -3,6 +3,7 @@ numpy pandas pyarrow +numba hypothesis==6.92.1 From 1812f7fb994394317644b0b5edb1564de4cef378 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 15:30:12 -0500 Subject: [PATCH 3/9] ufunc update --- docs/user-guide/expressions/numpy.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index 3b0e92f9e1d0..cef23d544e69 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -3,7 +3,7 @@ Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. -This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. +This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [__array_ufunc__](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. ### Example @@ -35,4 +35,4 @@ Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missin ### Note on Performance -The speed of ufuncs comes from being vectorized, compiled, and their ability to automatically use and return a pl.Series. That said, there's no inherent benefit in avoiding the use of `map_batches`. In fact, when polars sees an object that is a ufunc, it conveniently calls `map_batches`. In other words, even if you're trying to avoid calling `map_batches`, it's being called under the hood anyways. \ No newline at end of file +The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. \ No newline at end of file From 8a29f2502852de75015ad50ab367032ae265dac8 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 15:34:37 -0500 Subject: [PATCH 4/9] fmt --- docs/user-guide/expressions/numpy.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index cef23d544e69..e0b91b1fa8c0 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -3,7 +3,7 @@ Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. -This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [__array_ufunc__](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. +This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. ### Example @@ -35,4 +35,4 @@ Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missin ### Note on Performance -The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. \ No newline at end of file +The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. \ No newline at end of file From ce36292c676b7f9fe67e4aee5a30046fc45a43f6 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 15:45:19 -0500 Subject: [PATCH 5/9] more formatting --- docs/user-guide/expressions/numpy.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index e0b91b1fa8c0..2a16cab7e917 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -1,9 +1,9 @@ # Numpy ufuncs Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) -for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. +for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. -This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. +This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. ### Example @@ -35,4 +35,4 @@ Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missin ### Note on Performance -The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. \ No newline at end of file +The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. From 583bd83bea6fdf718831b39fd2898cc20b4a6256 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 16:00:33 -0500 Subject: [PATCH 6/9] requirements --- docs/requirements.txt | 2 +- py-polars/docs/requirements-docs.txt | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/requirements.txt b/docs/requirements.txt index e0416d67440b..e24c3641198c 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -2,7 +2,7 @@ pandas pyarrow graphviz matplotlib - +numba mkdocs-material==9.5.2 mkdocs-macros-plugin==1.0.4 material-plausible-plugin==0.2.0 diff --git a/py-polars/docs/requirements-docs.txt b/py-polars/docs/requirements-docs.txt index a8a389802f42..dfc9cb34f0b0 100644 --- a/py-polars/docs/requirements-docs.txt +++ b/py-polars/docs/requirements-docs.txt @@ -3,7 +3,6 @@ numpy pandas pyarrow -numba hypothesis==6.92.1 From b5ff69200d8fdf67b470426b9607bc3816b9c187 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 16:12:59 -0500 Subject: [PATCH 7/9] both req --- py-polars/docs/requirements-docs.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/py-polars/docs/requirements-docs.txt b/py-polars/docs/requirements-docs.txt index dfc9cb34f0b0..a8a389802f42 100644 --- a/py-polars/docs/requirements-docs.txt +++ b/py-polars/docs/requirements-docs.txt @@ -3,6 +3,7 @@ numpy pandas pyarrow +numba hypothesis==6.92.1 From 67141a986a337f6cee81b64d95d1141abb498dda Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 16:24:02 -0500 Subject: [PATCH 8/9] lowercase --- docs/user-guide/expressions/numpy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index 2a16cab7e917..97b8a1b241e6 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -15,7 +15,7 @@ This means that if a function is not provided by Polars, we can use NumPy and we ## Numba -[NumBa](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold. +[Numba](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold. ### Example From 56e443cd5e5454a606d7c9fdd36f175047b29674 Mon Sep 17 00:00:00 2001 From: Dean MacGregor Date: Tue, 13 Feb 2024 16:49:21 -0500 Subject: [PATCH 9/9] more_req --- py-polars/requirements-dev.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/py-polars/requirements-dev.txt b/py-polars/requirements-dev.txt index aeb9d3be53bd..be8c4ac15c59 100644 --- a/py-polars/requirements-dev.txt +++ b/py-polars/requirements-dev.txt @@ -60,6 +60,7 @@ hypothesis==6.97.4 pytest==8.0.0 pytest-cov==4.1.0 pytest-xdist==3.5.0 +numba # Need moto.server to mock s3fs - see: https://github.com/aio-libs/aiobotocore/issues/755 moto[s3]==5.0.0