Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add list.pad_start() #20674

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

etiennebacher
Copy link
Contributor

@etiennebacher etiennebacher commented Jan 12, 2025

Fixes #10283

Questions:

  1. In the linked issue above, the workaround lets the user specify the final length of each sublist but I use the length of the longest sublist and pad all other sublist to match this length instead. Is this correct? -> added a length argument instead of automatically taking longest length
  2. I feel like there's too much code duplication in the match statement, but it's not obvious to me how to reduce it. Any idea?
  3. Are the failures in new-streaming expected? -> solved, related to point 1

@etiennebacher etiennebacher changed the title [WIP] feat: Add list.pad_start() feat: Add list.pad_start() Jan 15, 2025
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 99.23664% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.29%. Comparing base (084ddde) to head (5d7fe4d).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/dsl/function_expr/list.rs 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20674      +/-   ##
==========================================
+ Coverage   79.10%   79.29%   +0.18%     
==========================================
  Files        1583     1583              
  Lines      225265   225676     +411     
  Branches     2586     2586              
==========================================
+ Hits       178188   178941     +753     
+ Misses      46487    46145     -342     
  Partials      590      590              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@etiennebacher
Copy link
Contributor Author

etiennebacher commented Jan 15, 2025

It should be possible to support more types (date, datetime, duration, maybe categorical and enum also?), but I'd like to have some feedback on the current implementation before, cf the questions in the first post.

In particular, for now it casts categoricals to string in this case:

pl.DataFrame({"a": [["a"], ["a", "b"]]}, schema={"a": pl.List(pl.Categorical)}).select(
    pl.col("a").list.pad_start("foo", length=2)
)

# shape: (2, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │ list[str]    │
# ╞══════════════╡
# │ ["foo", "a"] │
# │ ["a", "b"]   │
# └──────────────┘

which I'm not sure is correct.

@orlp
Copy link
Collaborator

orlp commented Jan 24, 2025

@etiennebacher One bit of feedback is that the proposal in the original issue, and Expr.str.pad_start/Expr.str.pad_end take a width to pad to. This is important so that the operation can execute in a streaming fashion instead of first having to find the maximum width.

@etiennebacher etiennebacher marked this pull request as draft January 24, 2025 19:53
@deanm0000
Copy link
Collaborator

wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()? @orlp would a lit satisfy the streaming engine in that case?

Also, does it make sense for this to return an Array instead of a List? I just assume that all the use cases for making everything a unified width would want an array as the next step.

@deanm0000
Copy link
Collaborator

deanm0000 commented Jan 24, 2025

I was playing around with how to avoid match and came up with this:

    let fill_s = fill_value.as_materialized_series();
    let out = ca.apply_values(|inner_series| {
        let inner_series = inner_series.explode().unwrap();
        if inner_series.len() >= width {
            inner_series.slice(0i64, width)  // is this right, should it truncate or just return as-is?
        } else {
            // this assumes fill can't be too small, need to repeat if scalar
            let need_fill = width - inner_series.len();
            let mut fill = fill_s.slice(0i64, need_fill).clone();
            fill.append(&inner_series).unwrap();
            if fill.len() != width {
                panic!("fill+original!=width")
            }
            let fill = fill; // to undo mut
            fill
        }
    });

I don't know the difference between all the apply methods so not sure if that should be apply_amortized or something else.

@orlp
Copy link
Collaborator

orlp commented Jan 25, 2025

@deanm0000

wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()?

Yes, and we should also allow that for Expr.str.pad_start and Expr.str.pad_end. PR's welcome :)

Also, does it make sense for this to return an Array instead of a List?

No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.

@etiennebacher
Copy link
Contributor Author

No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.

I hadn't understood that in the original issue. I thought all the sub-lists in the output were guaranteed to have the same length. I guess it makes sense if one can pass .list.len().max() as width.

length
};
let length = length.strict_cast(&DataType::UInt64)?;
let mut length = length.u64()?.into_iter();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to pass this in zip_and_apply_amortized(), which is why I make it an iterator here and call next() inside zip_and_apply_amortized(). It works but feels clunky.

let binding = s.unwrap();
let s: &Series = binding.as_ref();
let ca = s.i64().unwrap();
let length = length.next().unwrap().unwrap() as usize;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, works but feels clunky.

let mut fill_values;
match length.cmp(&ca.len()) {
Ordering::Equal | Ordering::Less => {
fill_values = ca.clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the clone() here ok or a no-go? If the latter, what is the alternative?

};
let fill_value = fill_value.cast(&super_type)?;

let out: ListChunked = match super_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire match should probably be reduced by calling one function per datatype, which will be also used for list.pad_end() (in another PR).

@@ -107,6 +109,8 @@ impl ListFunction {
NUnique => mapper.with_dtype(IDX_DTYPE),
#[cfg(feature = "list_to_struct")]
ToStruct(args) => mapper.try_map_dtype(|x| args.get_output_dtype(x)),
#[cfg(feature = "list_pad")]
PadStart => mapper.with_same_dtype(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong because the output dtype can change because the inner dtype and the padding type are cast to the supertype.

However, it does pass the tests. Why is that? Is there a test I can add for this?

@etiennebacher etiennebacher marked this pull request as ready for review February 3, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add padding method to List datatype
3 participants