-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add list.pad_start()
#20674
base: main
Are you sure you want to change the base?
feat: Add list.pad_start()
#20674
Conversation
list.pad_start()
list.pad_start()
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20674 +/- ##
==========================================
+ Coverage 79.10% 79.29% +0.18%
==========================================
Files 1583 1583
Lines 225265 225676 +411
Branches 2586 2586
==========================================
+ Hits 178188 178941 +753
+ Misses 46487 46145 -342
Partials 590 590 ☔ View full report in Codecov by Sentry. |
It should be possible to support more types (date, datetime, duration, maybe categorical and enum also?), but I'd like to have some feedback on the current implementation before, cf the questions in the first post. In particular, for now it casts categoricals to string in this case: pl.DataFrame({"a": [["a"], ["a", "b"]]}, schema={"a": pl.List(pl.Categorical)}).select(
pl.col("a").list.pad_start("foo", length=2)
)
# shape: (2, 1)
# ┌──────────────┐
# │ a │
# │ --- │
# │ list[str] │
# ╞══════════════╡
# │ ["foo", "a"] │
# │ ["a", "b"] │
# └──────────────┘ which I'm not sure is correct. |
@etiennebacher One bit of feedback is that the proposal in the original issue, and |
wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use Also, does it make sense for this to return an Array instead of a List? I just assume that all the use cases for making everything a unified width would want an array as the next step. |
I was playing around with how to avoid match and came up with this: let fill_s = fill_value.as_materialized_series();
let out = ca.apply_values(|inner_series| {
let inner_series = inner_series.explode().unwrap();
if inner_series.len() >= width {
inner_series.slice(0i64, width) // is this right, should it truncate or just return as-is?
} else {
// this assumes fill can't be too small, need to repeat if scalar
let need_fill = width - inner_series.len();
let mut fill = fill_s.slice(0i64, need_fill).clone();
fill.append(&inner_series).unwrap();
if fill.len() != width {
panic!("fill+original!=width")
}
let fill = fill; // to undo mut
fill
}
}); I don't know the difference between all the apply methods so not sure if that should be apply_amortized or something else. |
Yes, and we should also allow that for
No, because lists longer than the specified length should stay that way, similar to |
I hadn't understood that in the original issue. I thought all the sub-lists in the output were guaranteed to have the same length. I guess it makes sense if one can pass |
3b363c8
to
28b295d
Compare
length | ||
}; | ||
let length = length.strict_cast(&DataType::UInt64)?; | ||
let mut length = length.u64()?.into_iter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way to pass this in zip_and_apply_amortized()
, which is why I make it an iterator here and call next()
inside zip_and_apply_amortized()
. It works but feels clunky.
let binding = s.unwrap(); | ||
let s: &Series = binding.as_ref(); | ||
let ca = s.i64().unwrap(); | ||
let length = length.next().unwrap().unwrap() as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before, works but feels clunky.
let mut fill_values; | ||
match length.cmp(&ca.len()) { | ||
Ordering::Equal | Ordering::Less => { | ||
fill_values = ca.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the clone()
here ok or a no-go? If the latter, what is the alternative?
}; | ||
let fill_value = fill_value.cast(&super_type)?; | ||
|
||
let out: ListChunked = match super_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire match
should probably be reduced by calling one function per datatype, which will be also used for list.pad_end()
(in another PR).
@@ -107,6 +109,8 @@ impl ListFunction { | |||
NUnique => mapper.with_dtype(IDX_DTYPE), | |||
#[cfg(feature = "list_to_struct")] | |||
ToStruct(args) => mapper.try_map_dtype(|x| args.get_output_dtype(x)), | |||
#[cfg(feature = "list_pad")] | |||
PadStart => mapper.with_same_dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong because the output dtype can change because the inner dtype and the padding type are cast to the supertype.
However, it does pass the tests. Why is that? Is there a test I can add for this?
Fixes #10283
Questions:
In the linked issue above, the workaround lets the user specify the final length of each sublist but I use the length of the longest sublist and pad all other sublist to match this length instead. Is this correct?-> added alength
argument instead of automatically taking longest lengthmatch
statement, but it's not obvious to me how to reduce it. Any idea?Are the failures in new-streaming expected?-> solved, related to point 1