-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to create WindowFunctions with the Expr API #6747
Comments
Here is another example from #10345 / @timsaucer showing how non easy it is to create a window function via the expr API use datafusion::{logical_expr::{expr::WindowFunction, BuiltInWindowFunction, WindowFrame, WindowFunctionDefinition}, prelude::*};
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
let mut df = ctx.read_csv("/Users/tsaucer/working/testing_ballista/lead_lag/example.csv", CsvReadOptions::default()).await?;
df = df.with_column("array_col", make_array(vec![col("a"), col("b"), col("c")]))?;
df.clone().show().await?;
let lag_expr = Expr::WindowFunction(WindowFunction::new(
WindowFunctionDefinition::BuiltInWindowFunction(
BuiltInWindowFunction::Lead,
),
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
));
df = df.select(vec![col("a"), col("b"), col("c"), col("array_col"), lag_expr.alias("lagged")])?;
df.show().await?;
Ok(())
} It would be great if instead of let lag_expr = Expr::WindowFunction(WindowFunction::new(
WindowFunctionDefinition::BuiltInWindowFunction(
BuiltInWindowFunction::Lead,
),
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
)); It looked more like let lag_expr = lead(
vec![col("array_col")],
vec![],
vec![],
WindowFrame::new(None),
None,
)); Maybe even better like a builder style let lag_expr = lead(col("array_col")).build() Which would permit adding the various let lag_expr = lead(col("array_col"))
.partition_by(vec![])
.order_by(vec![])
.build() Maybe there are some inspirations in the polars API too: https://docs.pola.rs/user-guide/expressions/window/#group-by-aggregations-in-selection |
🤔 it seems like spark's API is like
https://stackoverflow.com/questions/32769328/how-to-use-window-functions-in-pyspark-using-dataframes So maybe for DataFusion it could look like let w = Window::new()
.partition_by(col("id"))
.order_by(col("dt"));
let lag_expr = lag(col("array_col"))
.over(w) |
Note I have some code in #6746 that had some part of it (along with an example) |
I am willing to help with this task. |
Great! I've rebased @alamb 's branch and added the changes I suggested. I was about to start testing the code and then I was going to write up the unit tests. My work in progress is here: https://github.com/timsaucer/datafusion/tree/feature/easier_window_funcs There was a little bit of changes I needed to make around the null_options. I got distracted by a task in the |
Thanks for your update! I'll work on the tests. |
FYI, my work is in: https://github.com/shanretoo/datafusion/tree/feat-window-fn |
@timsaucer I have fixed the calls of |
Oh, great. Have you been able to run the example code above using the new easy interface? |
You can check it in the unit test: |
Thank you. I pulled your branch and many of the tests are failing for me even though the functions are returning correct values when I add additional debug statements. I think what's happening here is that because we have the partition_by there is no guarantee what order the results come back as. On my machine the unit tests are returning the partitions on column C in order 10 then 1. I'm guessing on yours it was the opposite. There are a couple of things I think we can do to resolve this. One way would be to make a new macro for testing these partitioned functions. I could do something like
And then the lead function test would become
I've added an |
The one thing I think we're missing is the other variants for these. I don't think it's covered in other unit tests that I can find. So for example, for lead we would want to validate:
What do you think? I might try to write a macro around all these variants. I'm now unblocked on the other task I was working on, so I can pick it up if you'd like or I'm happy to work on other things. Please let me know. |
Sorry, my fault. I haven't taken into account the ordering issue. Maybe we could add a following match arm in the macro to omit the macro_rules! assert_sorted_fn_batches {
($EXPR:expr, $EXPECTED: expr) => {
let sort_by = $EXPR
.iter()
.map(|e| {
let alias = e.name_for_alias().expect("failed to get an alias");
col(alias).sort(true, true)
})
.collect::<Vec<_>>();
assert_sorted_fn_batches!($EXPR, $EXPECTED, sort_by);
}; |
Have you checked tests in sqllogictest? You can take over this and I'm happy to help when needed. |
I think you're doing a great job, and good point on the sqllogictest. TBH I find those tests harder to wrap my head around than the rust tests, but that's more personal preference. About the test function, I realize we can probably make it simpler:
What do you think? |
Looks good. It is clearer to understand the results in this way. |
Update here is that @jayzhan211 and I have been working on a similar API for creating |
In case anyone is following along, @jayzhan211 added a really nice trait for working with aggregate functions. Maybe we can do something similar for window functions eventually datafusion/datafusion/expr/src/udaf.rs Lines 614 to 653 in e693ed7
|
I've started looking at this and coming up against one blocker that prevents just following the exact pattern. My first thought was to implement a trait like
The problem with this is that we would have two traits implemented on My current thinking is that instead of doing this, I should rename
Then the
And finally
(case statement remove from original comment) I haven't dug too much deeper into it, but these are my initial design ideas. @jayzhan211 and @alamb what do you think? |
I'm also wondering if instead of carrying the data around in the builder, we can just update the member within |
The overall idea looks good to me, but I'm not sure about the My current idea is specific to function only, but if there is other non-function expr, we could extend it like ExprExt, ExprBuidler with the related methods they need. pub trait FunctionExt {
/// Add `ORDER BY <order_by>`
///
/// Note: `order_by` must be [`Expr::Sort`]
fn order_by(self, order_by: Vec<Expr>) -> AggregateBuilder;
/// Add `FILTER <filter>`
fn filter(self, filter: Expr) -> AggregateBuilder;
/// Add `DISTINCT`
fn distinct(self) -> AggregateBuilder;
/// Add `RESPECT NULLS` or `IGNORE NULLS`
fn null_treatment(self, null_treatment: NullTreatment) -> AggregateBuilder;
fn partiion_by:
fn window_frame:
...
}
pub struct FuncBuilder {
fun: Option<FuncKind>,
order_by: Option<Vec<Expr>>,
filter: Option<Expr>,
distinct: bool,
null_treatment: Option<NullTreatment>,
partition_by,
window_frame,
....
}
pub enum FuncKind {
Aggregate(AggregateFunction),
Window(WindowFunction),
} |
Good point. I was just thinking about it and came on here to remove that from my comment! So we are well aligned. Great suggestions. I'll move forward on working on this tomorrow. |
I started a new branch off |
Is your feature request related to a problem or challenge?
Follow on to #5781
There are at least three things named
WindowFunction
in DataFusion --Expr::WindowFunction
,window_function::WindowFunction
andexpr::WindowFunction
https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/index.html?search=WindowFunction
Constructing an Expr::WindowFunction to pass to
LogicalPlanBuilder::window
is quite challengingDescribe the solution you'd like
I would like to make this process easier with a builder style:
for
lead(foo) OVER(PARTITION BY bar)
for example:Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: