Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): More accurate and helpful docs for user defined functions #15194

Merged
merged 25 commits into from
Jun 25, 2024

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Mar 20, 2024

Fixes #14699
Fixes #14508

  1. Focuses purely on Python, as in the title of the article, with Rust APIs moved elsewhere.
  2. Doesn't give erroneous information about the limits of map_batches(), given implementation changes.
  3. Doesn't mention map_elements(), for simplicity's sake. This can eventually be added in a new section, presuming I can figure out why map_elements() even exists.
  4. Adds demonstration of how to write fast user functions.
  5. Moved around some docs (using struct for multi-column support).
  6. Tiny updates to NumPy docs, bit redundant but that is fine probably.

This adds Numba as documentation requirement, so when e.g. Python 3.13 is released, docs won't be buildable on 3.13 until Numba has a new release.

@itamarst itamarst marked this pull request as ready for review March 20, 2024 18:47
@itamarst itamarst changed the title Better docs for user defined functions docs(python): More accurate and helpful docs for user defined functions Mar 20, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars labels Mar 20, 2024
Copy link

codecov bot commented Mar 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.41%. Comparing base (332e40a) to head (ca8ba41).

Current head ca8ba41 differs from pull request most recent head 0778a44

Please upload reports for the commit 0778a44 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15194      +/-   ##
==========================================
+ Coverage   80.81%   81.41%   +0.59%     
==========================================
  Files        1464     1425      -39     
  Lines      192019   187964    -4055     
  Branches     2743     2704      -39     
==========================================
- Hits       155185   153027    -2158     
+ Misses      36323    34440    -1883     
+ Partials      511      497      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli
Copy link
Collaborator

Thanks for working on this!

Doesn't mention map_elements(), for simplicity's sake. This can eventually be added in a new section, presuming I can figure out why map_elements() even exists.

The purpose of map_elements is for elementwise UDFs. If you're calling a 3rd party library (e.g. libpostal) which works on one element at a time, then that's a good use case for map_elements. It's a last-resort, really, but I think it does need mentioning for those cases where nothing else will do. Polars can also (sometimes) warn the user when they could have done better 😎

Regarding Numba: given that is' quite dangerous to use it when the Series has missing values, I don't think it should currently be recommended in the user guide. I'd suggest either keeping it out for now, or sorting out Numba first

@itamarst itamarst requested a review from reswqa as a code owner May 29, 2024 14:16
@itamarst
Copy link
Contributor Author

Merged forward, ready for review again @MarcoGorelli.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looking better

I think map_elements needs mentioning though. perhaps you could even start with this one? Something like:

  1. show that you can do pl.col('a').map_elements(lambda x: math.log(x))
  2. then, say that it's better to use vectorised operations, and if you have a numpy function you want to apply, you can pass the column all in at once with pl.col('a').map_batches(np.log)
  3. then, show how you can use numba to pass a custom vectorised function

@itamarst
Copy link
Contributor Author

Thank you, I will address these.

@itamarst itamarst requested a review from MarcoGorelli June 5, 2024 14:13
@itamarst
Copy link
Contributor Author

itamarst commented Jun 5, 2024

@MarcoGorelli back to you, either addressed or commented above.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing

FAILED tests/docs/test_user_guide.py::test_run_python_snippets[path10] - polars.exceptions.PolarsInefficientMapWarning: 
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("values").map_elements(my_log)
with this one instead:
  + pl.col("values").log()

pop up in the CI logs brought a smile to my face 😄

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @itamarst , good one!

I made some minor comments, and made a commit to address them, as they're really minor

Leaving open a little in case others have objections, will merge if they don't


EDIT: I can't merge this one, but I think it's ready

docs/user-guide/expressions/user-defined-functions.md Outdated Show resolved Hide resolved
docs/user-guide/expressions/user-defined-functions.md Outdated Show resolved Hide resolved
docs/user-guide/expressions/user-defined-functions.md Outdated Show resolved Hide resolved
docs/user-guide/expressions/user-defined-functions.md Outdated Show resolved Hide resolved
@itamarst
Copy link
Contributor Author

@MarcoGorelli I merged forward to deal with conflicts.

@itamarst
Copy link
Contributor Author

@MarcoGorelli can we get this merged?

@ritchie46
Copy link
Member

Can you do a rebase?

@itamarst
Copy link
Contributor Author

Will do that now.

@itamarst
Copy link
Contributor Author

My editor is having Issues so still need to manually verify that merge was correct.

@ritchie46 ritchie46 enabled auto-merge (squash) June 25, 2024 17:15
auto-merge was automatically disabled June 25, 2024 17:21

Head branch was pushed to by a user without write access

@itamarst
Copy link
Contributor Author

OK I think it's good now. Note that auto-merge won't happen because I edited something, you'll need re-enable it if you're happy.

@ritchie46 ritchie46 merged commit 7763bd4 into pola-rs:main Jun 25, 2024
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars
Projects
None yet
4 participants