feat: Add `list.pad_start()` #20674

etiennebacher · 2025-01-12T16:41:36Z

Questions:

In the linked issue above, the workaround lets the user specify the final length of each sublist but I use the length of the longest sublist and pad all other sublist to match this length instead. Is this correct? -> added a length argument instead of automatically taking longest length
I feel like there's too much code duplication in the match statement, but it's not obvious to me how to reduce it. Any idea?
~~Are the failures in new-streaming expected?~~ -> solved, related to point 1

codecov · 2025-01-15T11:51:42Z

Codecov Report

Attention: Patch coverage is 99.23664% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.29%. Comparing base (084ddde) to head (5d7fe4d).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-plan/src/dsl/function_expr/list.rs	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20674      +/-   ##
==========================================
+ Coverage   79.10%   79.29%   +0.18%     
==========================================
  Files        1583     1583              
  Lines      225265   225676     +411     
  Branches     2586     2586              
==========================================
+ Hits       178188   178941     +753     
+ Misses      46487    46145     -342     
  Partials      590      590

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

etiennebacher · 2025-01-15T15:10:45Z

It should be possible to support more types (date, datetime, duration, maybe categorical and enum also?), but I'd like to have some feedback on the current implementation before, cf the questions in the first post.

In particular, for now it casts categoricals to string in this case:

pl.DataFrame({"a": [["a"], ["a", "b"]]}, schema={"a": pl.List(pl.Categorical)}).select(
    pl.col("a").list.pad_start("foo", length=2)
)

# shape: (2, 1)
# ┌──────────────┐
# │ a            │
# │ ---          │
# │ list[str]    │
# ╞══════════════╡
# │ ["foo", "a"] │
# │ ["a", "b"]   │
# └──────────────┘

which I'm not sure is correct.

orlp · 2025-01-24T14:52:10Z

@etiennebacher One bit of feedback is that the proposal in the original issue, and Expr.str.pad_start/Expr.str.pad_end take a width to pad to. This is important so that the operation can execute in a streaming fashion instead of first having to find the maximum width.

deanm0000 · 2025-01-24T20:14:58Z

wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()? @orlp would a lit satisfy the streaming engine in that case?

Also, does it make sense for this to return an Array instead of a List? I just assume that all the use cases for making everything a unified width would want an array as the next step.

deanm0000 · 2025-01-24T21:40:05Z

I was playing around with how to avoid match and came up with this:

    let fill_s = fill_value.as_materialized_series();
    let out = ca.apply_values(|inner_series| {
        let inner_series = inner_series.explode().unwrap();
        if inner_series.len() >= width {
            inner_series.slice(0i64, width)  // is this right, should it truncate or just return as-is?
        } else {
            // this assumes fill can't be too small, need to repeat if scalar
            let need_fill = width - inner_series.len();
            let mut fill = fill_s.slice(0i64, need_fill).clone();
            fill.append(&inner_series).unwrap();
            if fill.len() != width {
                panic!("fill+original!=width")
            }
            let fill = fill; // to undo mut
            fill
        }
    });

I don't know the difference between all the apply methods so not sure if that should be apply_amortized or something else.

crates/polars-ops/src/chunked_array/list/namespace.rs

orlp · 2025-01-25T10:28:47Z

@deanm0000

wrt to making it a specified width, could that input be an Expr input so if I'm not streaming I can use .list.len().max()?

Yes, and we should also allow that for Expr.str.pad_start and Expr.str.pad_end. PR's welcome :)

Also, does it make sense for this to return an Array instead of a List?

No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.

etiennebacher · 2025-02-01T10:51:51Z

No, because lists longer than the specified length should stay that way, similar to Expr.str.pad_start.

I hadn't understood that in the original issue. I thought all the sub-lists in the output were guaranteed to have the same length. I guess it makes sense if one can pass .list.len().max() as width.

etiennebacher · 2025-02-02T14:49:26Z

crates/polars-ops/src/chunked_array/list/namespace.rs

+            length
+        };
+        let length = length.strict_cast(&DataType::UInt64)?;
+        let mut length = length.u64()?.into_iter();


I couldn't find a way to pass this in zip_and_apply_amortized(), which is why I make it an iterator here and call next() inside zip_and_apply_amortized(). It works but feels clunky.

etiennebacher · 2025-02-02T14:49:50Z

crates/polars-ops/src/chunked_array/list/namespace.rs

+                    let binding = s.unwrap();
+                    let s: &Series = binding.as_ref();
+                    let ca = s.i64().unwrap();
+                    let length = length.next().unwrap().unwrap() as usize;


Same as before, works but feels clunky.

etiennebacher · 2025-02-02T14:50:32Z

crates/polars-ops/src/chunked_array/list/namespace.rs

+                    let mut fill_values;
+                    match length.cmp(&ca.len()) {
+                        Ordering::Equal | Ordering::Less => {
+                            fill_values = ca.clone();


Is the clone() here ok or a no-go? If the latter, what is the alternative?

etiennebacher · 2025-02-02T14:51:55Z

crates/polars-ops/src/chunked_array/list/namespace.rs

+        };
+        let fill_value = fill_value.cast(&super_type)?;
+
+        let out: ListChunked = match super_type {


This entire match should probably be reduced by calling one function per datatype, which will be also used for list.pad_end() (in another PR).

etiennebacher · 2025-02-02T14:54:19Z

crates/polars-plan/src/dsl/function_expr/list.rs

@@ -107,6 +109,8 @@ impl ListFunction {
            NUnique => mapper.with_dtype(IDX_DTYPE),
            #[cfg(feature = "list_to_struct")]
            ToStruct(args) => mapper.try_map_dtype(|x| args.get_output_dtype(x)),
+            #[cfg(feature = "list_pad")]
+            PadStart => mapper.with_same_dtype(),


I think this is wrong because the output dtype can change because the inner dtype and the padding type are cast to the supertype.

However, it does pass the tests. Why is that? Is there a test I can add for this?

github-actions bot added the title needs formatting label Jan 12, 2025

etiennebacher changed the title ~~[WIP] feat: Add list.pad_start()~~ feat: Add list.pad_start() Jan 15, 2025

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Jan 15, 2025

etiennebacher marked this pull request as ready for review January 15, 2025 15:07

etiennebacher requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners January 15, 2025 15:07

etiennebacher marked this pull request as draft January 24, 2025 19:53

deanm0000 reviewed Jan 24, 2025

View reviewed changes

crates/polars-ops/src/chunked_array/list/namespace.rs Show resolved Hide resolved

etiennebacher added 8 commits February 1, 2025 11:53

init

3e036c7

allow expr

34486da

minor [skip ci]

6b00202

minor [skip ci]

aa9a5c2

fix broadcasting

73966bf

minor [skip ci]

aa4fa4e

docs

2f5e959

remove some unwrap [skip ci]

1181d73

etiennebacher added 17 commits February 1, 2025 11:53

add feature gate [skip ci]

64dd07f

start tests [skip ci]

e3c71bd

more tests [skip ci]

b36433a

typo

b736ba6

do not use named arg in docstrings

1d022ab

fmt

4b897e5

mypy

823f5db

clippy

4de5381

fmt again

46d5a75

docs for series

7ed3aa4

some tests fail in new-streaming

92f55ab

fmt

352fbf6

add for Boolean

f0e9150

add arg width

3f7ca46

clippy

bad748f

ruff

b87cd9a

enable tests with new streaming

28b295d

etiennebacher force-pushed the list-pad branch from 3b363c8 to 28b295d Compare February 1, 2025 10:53

etiennebacher added 6 commits February 1, 2025 12:15

do not slice sublists larger than width

3cb1871

accept expression in width

56a6865

rename width to length

6b8e919

forgot to rename in tests

658347a

mistake in docs

5d7fe4d

typo [skip ci]

dbf9035

etiennebacher commented Feb 2, 2025

View reviewed changes

etiennebacher marked this pull request as ready for review February 3, 2025 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `list.pad_start()` #20674

feat: Add `list.pad_start()` #20674

etiennebacher commented Jan 12, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading

etiennebacher commented Jan 15, 2025 •

edited

Loading

orlp commented Jan 24, 2025 •

edited

Loading

deanm0000 commented Jan 24, 2025

deanm0000 commented Jan 24, 2025 •

edited

Loading

orlp commented Jan 25, 2025 •

edited

Loading

etiennebacher commented Feb 1, 2025

etiennebacher Feb 2, 2025

etiennebacher Feb 2, 2025

etiennebacher Feb 2, 2025

etiennebacher Feb 2, 2025

etiennebacher Feb 2, 2025

feat: Add list.pad_start() #20674

Are you sure you want to change the base?

feat: Add list.pad_start() #20674

Conversation

etiennebacher commented Jan 12, 2025 • edited Loading

codecov bot commented Jan 15, 2025 • edited Loading

Codecov Report

etiennebacher commented Jan 15, 2025 • edited Loading

orlp commented Jan 24, 2025 • edited Loading

deanm0000 commented Jan 24, 2025

deanm0000 commented Jan 24, 2025 • edited Loading

orlp commented Jan 25, 2025 • edited Loading

etiennebacher commented Feb 1, 2025

etiennebacher Feb 2, 2025

Choose a reason for hiding this comment

etiennebacher Feb 2, 2025

Choose a reason for hiding this comment

etiennebacher Feb 2, 2025

Choose a reason for hiding this comment

etiennebacher Feb 2, 2025

Choose a reason for hiding this comment

etiennebacher Feb 2, 2025

Choose a reason for hiding this comment

feat: Add `list.pad_start()` #20674

feat: Add `list.pad_start()` #20674

etiennebacher commented Jan 12, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading

etiennebacher commented Jan 15, 2025 •

edited

Loading

orlp commented Jan 24, 2025 •

edited

Loading

deanm0000 commented Jan 24, 2025 •

edited

Loading

orlp commented Jan 25, 2025 •

edited

Loading