diff --git a/docs/requirements.txt b/docs/requirements.txt index dccf92dd62d1..d64ab525bedd 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -2,6 +2,7 @@ pandas pyarrow graphviz matplotlib +numba seaborn plotly altair diff --git a/docs/src/python/user-guide/expressions/numba-example.py b/docs/src/python/user-guide/expressions/numba-example.py new file mode 100644 index 000000000000..acd6c10c2b3b --- /dev/null +++ b/docs/src/python/user-guide/expressions/numba-example.py @@ -0,0 +1,17 @@ +import polars as pl +import numba as nb + +df = pl.DataFrame({"a": [10, 9, 8, 7]}) + + +@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)") +def cum_sum_reset(x, y, res): + res[0] = x[0] + for i in range(1, x.shape[0]): + res[i] = x[i] + res[i - 1] + if res[i] >= y: + res[i] = x[i] + + +out = df.select(cum_sum_reset(pl.all(), 5)) +print(out) diff --git a/docs/user-guide/expressions/numpy.md b/docs/user-guide/expressions/numpy.md index 6500e87b5207..97b8a1b241e6 100644 --- a/docs/user-guide/expressions/numpy.md +++ b/docs/user-guide/expressions/numpy.md @@ -1,9 +1,9 @@ -# Numpy +# Numpy ufuncs Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) -for a list on all supported numpy functions. +for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. -This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. +This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API. ufuncs have a hook that diverts their own execution when one of its inputs is a class with the [`__array_ufunc__`](https://numpy.org/doc/stable/reference/arrays.classes.html#special-attributes-and-methods) method. Polars Expr class has this method which allows ufuncs to be input directly in a context (`select`, `with_columns`, `agg`) with relevant expressions as the input. This syntax extends even to multiple input functions. ### Example @@ -13,6 +13,18 @@ This means that if a function is not provided by Polars, we can use NumPy and we --8<-- "python/user-guide/expressions/numpy-example.py" ``` +## Numba + +[Numba](https://numba.pydata.org/) is an open source JIT compiler that allows you to create your own ufuncs entirely within python. The key is to use the [@guvectorize](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) decorator. One popular use case is conditional cumulative functions. For example, suppose you want to take a cumulative sum but have it reset whenever it gets to a threshold. + +### Example + +{{code_block('user-guide/expressions/numpy-example',api_functions=['DataFrame'])}} + +```python exec="on" result="text" session="user-guide/numpy" +--8<-- "python/user-guide/expressions/numba-example.py" +``` + ### Interoperability Polars `Series` have support for NumPy universal functions (ufuncs). Element-wise functions such as `np.exp()`, `np.cos()`, `np.div()`, etc. all work with almost zero overhead. @@ -20,3 +32,7 @@ Polars `Series` have support for NumPy universal functions (ufuncs). Element-wis However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. This can lead to a window function or a `np.convolve()` giving flawed or incomplete results. Convert a Polars `Series` to a NumPy array with the `.to_numpy()` method. Missing values will be replaced by `np.nan` during the conversion. + +### Note on Performance + +The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. diff --git a/docs/user-guide/expressions/user-defined-functions.md b/docs/user-guide/expressions/user-defined-functions.md index 882cc11c6ac1..3387be994cb0 100644 --- a/docs/user-guide/expressions/user-defined-functions.md +++ b/docs/user-guide/expressions/user-defined-functions.md @@ -18,8 +18,7 @@ These functions have an important distinction in how they operate and consequent A `map_batches` passes the `Series` backed by the `expression` as is. `map_batches` follows the same rules in both the `select` and the `group_by` context, this will -mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet -aggregated! +mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.** Use cases for `map_batches` are for instance passing the `Series` in an expression to a third party library. Below we show how we could use `map_batches` to pass an expression column to a neural network model. diff --git a/py-polars/docs/requirements-docs.txt b/py-polars/docs/requirements-docs.txt index f1f88d7e2940..3efb78f29f87 100644 --- a/py-polars/docs/requirements-docs.txt +++ b/py-polars/docs/requirements-docs.txt @@ -3,6 +3,7 @@ numpy pandas pyarrow +numba hypothesis==6.97.4 diff --git a/py-polars/requirements-dev.txt b/py-polars/requirements-dev.txt index aeb9d3be53bd..be8c4ac15c59 100644 --- a/py-polars/requirements-dev.txt +++ b/py-polars/requirements-dev.txt @@ -60,6 +60,7 @@ hypothesis==6.97.4 pytest==8.0.0 pytest-cov==4.1.0 pytest-xdist==3.5.0 +numba # Need moto.server to mock s3fs - see: https://github.com/aio-libs/aiobotocore/issues/755 moto[s3]==5.0.0