Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add require_all parameter to the by_name column selector #15028

Merged
merged 2 commits into from
May 18, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Mar 13, 2024

Closes #15023.
Closes #16219.

Adds a new (opt-in) "match_any" param to the by_name selector.
Default behaviour remains unchanged.

Update: have changed the parameter name from "any" (and then "match_any"), to "require_all" and adjusted the default value accordingly 😎

Before

Default behaviour; expect all given columns to be present:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({
    "foo": ["x", "y"],
    "bar": [123, 456],
    "baz": [2.0, 5.5],
})
df.select( cs.by_name("stroopwafel","baz","!!") )
# ColumnNotFoundError: stroopwafel

After

Enable permissive matching on any of the given columns:

df.select( cs.by_name("stroopwafel","baz","!!", require_all=False) )
# shape: (2, 1)
# ┌─────┐
# │ baz │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 2.0 │
# │ 5.5 │
# └─────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Mar 13, 2024
@gab23r
Copy link
Contributor

gab23r commented Mar 13, 2024

any being a python keyword isn't an issue ?
I think the argument name has to be carefully chosen, because it will appear in other places like: #11913 (comment)

I was thinking of using strict, it is general and clear to user that it will not raise. But in the meantime strict is today link (only ?) to casting operations, so I am not really sure 🤔

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Mar 13, 2024

any being a python keyword isn't an issue ?

It actually isn't, as we don't overwrite the global "any", or use "any" in the function. However, I do usually use "any_" as per that comment, as it's better practice overall - should probably do so here too. I'll change that when I finish in the gym ;)

I was thinking of using strict

Hmm, I don't think it's as obvious in this context what "strict" would imply? Not as obvious (to me) as "any" (or "any_"), at least.

@gab23r
Copy link
Contributor

gab23r commented Mar 13, 2024

IMHO, any is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly ) within the list of columns, which is a weird use case I admit 😅. ignore_missing would be cristal clear, i guess. Have a good gym time 😁

@alexander-beedie alexander-beedie changed the title feat(python): add "any" param/option to the by_name selector feat(python): add "any_" param/option to the by_name selector Mar 13, 2024
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Mar 13, 2024

IMHO, any is not really clear. When I read it the first time, I really thought that it will select only one column ( like randomly )

Well now you're making me want to add a cs.random(<n>) selector... ;)

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.82%. Comparing base (98a2d9b) to head (8f40dd4).
Report is 11 commits behind head on main.

Current head 8f40dd4 differs from pull request most recent head 1639789

Please upload reports for the commit 1639789 to get more accurate results.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15028   +/-   ##
=======================================
  Coverage   80.82%   80.82%           
=======================================
  Files        1393     1393           
  Lines      179341   179345    +4     
  Branches     2918     2920    +2     
=======================================
+ Hits       144955   144964    +9     
+ Misses      33884    33879    -5     
  Partials      502      502           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcrumiller
Copy link
Contributor

+1 for ignore_missing, I was going to suggest that anyway. Having any underscore after any_ is just...I don't like it.

@stinodego
Copy link
Contributor

This is similar to the discussion in #11913

I'm not sure there needs to be a parameter for this, but if there is one, I wouldn't want to call it any_ as it is not descriptive enough.

Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above (just tagging this with 'request changes' to organize my personal backlog).

@ritchie46
Copy link
Member

Any updates on this one? @alexander-beedie

@alexander-beedie alexander-beedie marked this pull request as draft May 14, 2024 13:28
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 14, 2024

Any updates on this one? @alexander-beedie

@ritchie46: Yup, just updated it and rebased against the latest code; left this dangling a bit longer than planned! I believe we do want a parameter here as it enables a clean selection pattern - apparently it just took me two months to think of an improved parameter name :))

cs.by_name("baz","foo","zap", match_any=True)

This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (any_ by itself was indeed too vague).

@alexander-beedie alexander-beedie marked this pull request as ready for review May 14, 2024 13:59
@alexander-beedie alexander-beedie requested a review from reswqa as a code owner May 14, 2024 13:59
@alexander-beedie alexander-beedie changed the title feat(python): add "any_" param/option to the by_name selector feat(python): add "match_any" param/option to the by_name selector May 14, 2024
@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label May 14, 2024
Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, I guess a parameter here could be helpful.

match_any is definitely more clear than any_, but it still doesn't make sense to me.

Without the flag, we're saying "Select the columns A, B, and C."

With the new flag, we're saying "Select columns A, B, and C, and if any of those don't exist, you can just ignore them.".

With that in mind, wouldn't ignore_missing make much more sense? match_any could mean anything. I would read it as "Select any column that matches these names", which is in fact the default behavior. It tells me nothing about what should happen if one of those columns doesn't exist.

py-polars/polars/selectors.py Outdated Show resolved Hide resolved
@dangotbanned
Copy link
Contributor

dangotbanned commented May 15, 2024

cs.by_name("baz","foo","zap", match_any=True)

This reads a lot better than before; you're matching by name, and it'll match any of the names. Final feels self-explanatory now? (any_

Really happy to see something like this @alexander-beedie

Seeing the new name, could this be generalised to the below?
I'd been using this under the name matches_or, but changed after this PR.

def matches_any(*names: str) -> SelectorType:
    """
    Select any columns with an exact match in `names`.

    Where no columns match, no operation is performed.
    """
    return cs.matches(f"^{"|".join(names)}$")

Where names could be either a full column name (like by_name) or a regular expression (like matches).

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 15, 2024

With the new flag, we're saying "Select columns A, B, and C, and if any of those don't exist, you can just ignore them.".

No, we're saying "match any of A, B or C", which is even more straightforward.

With that in mind, wouldn't ignore_missing make much more sense?

Not to me; labelling columns that weren't matched as "missing" isn't correct, as "missing" implies that they were expected to be there. But if you have set "match_any" then you clearly don't expect some of them to be there, so they aren't missing - they simply aren't present, which isn't the same thing 😜

match_any could mean anything. I would read it as "Select any column that matches these names", which is in fact the default behavior.

That's not the default behaviour though; the default behaviour requires that you match all columns, not match any column. You only get to match any column if you set... match_any (which is why I think this parameter name is pretty clear ;)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 15, 2024

Really happy to see something like this @alexander-beedie

😎👍

Seeing the new name, could this be generalised to the below? I'd been using this under the name matches_or, but changed after this PR.

Not really, as it would tread on the existing cs.matches selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.

(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so: .join(re.escape(nm) for nm in names)1).

Footnotes

  1. Or use:from polars._utils.various import re_escape so that the escapes better-match the underlying Rust regex crate. However, this is a private function so use judiciously... ;)

@dangotbanned
Copy link
Contributor

Seeing the new name, could this be generalised to the below? I'd been using this under the name matches_or, but changed after this PR.

Not really, as it would tread on the existing cs.matches selector, which does regex patterns rather than column names - two different behaviours/use-cases. I don't think we should look to blend them together.

That makes sense. Looking forward to using the match_any parameter.

(Your pattern match looks like it would fail on names with special chars in them? Could tweak like so: .join(re.escape(nm) for nm in names).

Ooh good find! Thanks I hadn't tried this yet with any cases containing special chars.
All the references I've found are using it exactly like by_name(..., match_any=True).

I'm only just seeing now that matches (quite rightly) doesn't already do this like contains, starts/ends_with.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 15, 2024

Rebased / resolved conflicts...
(Update: and again 😎)

@stinodego
Copy link
Contributor

I really don't want to merge this with the current parameter name. It's extremely unclear to me. What are you matching? Any of what? What would it mean to set it to False, or True?

I went ahead and asked ChatGPT for help:

ME: I have a Python function that selects columns by name. It accepts a list of column names and returns the matching columns. If any of the column names do not have a matching column, it raises an error. I want to add a boolean parameter, which defaults to False, that will change the functionality to ignore column names without matches, instead of raising an error. What should such a parameter be called?

ChatGPT: A suitable name for the boolean parameter could be ignore_missing. This name clearly indicates its purpose: to control whether the function should ignore missing columns or raise an error.

Of course the prompt is leading it in that direction - but if you can show me a prompt that returns match_any I might be able to better understand where you're coming from.

I propose we change it to ignore_missing and this can be merged.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 17, 2024

I propose we change it to ignore_missing and this can be merged.

I dislike the involvement of "missing" as it is incorrect. Unmatched columns are not missing, as that word implies an expectation that they should be there, which is wrong; if you expected them to be there then you wouldn't set a param that allows them not to be:

missing /ˈmɪsɪŋ/ (adjective)
(of a thing) not able to be found because it is not in its expected place.

In general l also prefer parameter names to indicate something active/explicit...

  • match_any
    "I want to match any of the names"

...rather than passive/implicit:

  • ignore_missing
    "if some names aren't present then I don't want to raise an error"

ChatGPT 4o result:

if you can show me a prompt that returns match_any I might be able to better understand where you're coming from.

chatgpt4o_prompt_param_name

What are you matching? Any of what?

It feels somewhat self-evident from the selector name; by_name is clearly matching by name, and settingmatch_any=True means I want to match any of those names 😅🤷

@etiennebacher
Copy link
Contributor

In case it helps, I have two additional suggestions for this argument:

  • ignore_absent: IMO absent doesn't give an "opinion" (I don't have a better word for this), it doesn't say that the column should be there (as missing implies), simply that it isn't in the data.

  • ignore_nonexistent: more or less the same explanation, "select those columns while ignoring those who don't exist"

(One could also replace ignore_ by allow_)

@mcrumiller
Copy link
Contributor

What about the somewhat vague but rather all-purpose strict, which almost universally means "don't coerce or ignore anything that isn't exactly specified"? We use it elsewhere and I think it's applicable here.

@alexander-beedie alexander-beedie changed the title feat(python): add "match_any" param/option to the by_name selector feat(python): add "requires_all" param/option to the by_name selector May 17, 2024
@alexander-beedie alexander-beedie changed the title feat(python): add "requires_all" param/option to the by_name selector feat(python): add "require_all" param/option to the by_name selector May 17, 2024
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 17, 2024

@stinodego: ok, here we go!

Thanks for the ideas all; we've discussed it thoroughly and are going with require_all; it's active/explicit like "match_any", avoids the implications of "ignore_missing", but is (we believe!) more self-explanatory than either of those - you either require all of the names to be present or you don't✌️

@stinodego stinodego changed the title feat(python): add "require_all" param/option to the by_name selector feat(python): Add require_all parameter to the by_name column selector May 18, 2024
@stinodego stinodego merged commit e440b9d into pola-rs:main May 18, 2024
16 checks passed
@alexander-beedie alexander-beedie deleted the by-name-selector branch May 18, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

non strict columns selection when specifying column names Add column selector equivalent to any_of
7 participants