-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make builtin window function output datatype to be derived from schema #9686
Conversation
@mustafasrepo @alamb @viirya appreciate for your thoughts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @comphead, have a few questions
.map(|arg| arg.data_type(input_schema)) | ||
.collect::<Result<_>>()?; | ||
// derive the output datatype from incoming schema | ||
let out_data_type: &DataType = input_schema.field_with_name(&name)?.data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this works (or makes sense), can you help me understand? How can you decide the output type of a function without knowing what the function is (i.e. using fun.return_type
)? Is it always guaranteed that you will find name
in input_schema
? If yes, no problem. But if no, shouldn't you fall back to using fun.return_type
?
I understand that some people may want to have ROW_NUMBER
output an Int
instead of an UInt
(which doesn't make sense but I can see how such an ask may arise for backwards compatibility reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think with these changes parameter input_schema: &Schema,
name is misleading in the create_built_in_window_expr
function. Similarly, logical_input_schema: &DFSchema,
parameter for the create_window_expr
is misleading. Now these parameters are the schema of the window function (not input schema). Hence, I think it is guaranteed to find name of the window function in the schema (As long as window function name calculations are consistent).
I checked, how LogicalPlan::Window
calculates its schema.
It calls exprlist_to_fields
util to calculate fields of window expressions.
- Which in turn uses
.to_field
method oninput_schema
- Which in turn uses
Expr::get_type
method oninput_schema
- which calculates output type using
fun.return_type(&data_types)
.
In this sense, I think this PR behaves same with current version, and removes duplication. However, we should change parameter/variable names to reflect which data it actually uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mustafasrepo for clarification, I would also like to add the part were fun.report_type()
called for window builtin functions in logical planner https://github.com/apache/arrow-datafusion/blob/b87dd6143c2dc089b07f74780bd525c4369e68a3/datafusion/expr/src/expr_schema.rs#L184
So when physical layer gets called all the metadata(names, datatypes) is already known
@@ -742,7 +742,7 @@ impl DefaultPhysicalPlanner { | |||
); | |||
} | |||
|
|||
let logical_input_schema = input.schema(); | |||
let logical_input_schema = logical_plan.schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this schema is no longer input_schema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will always will be a tension between different desired semantics for window functions the same as there is a tension between different desired semantics for scalar functions.
DataFusion currently has several hard coded window function implementations, but it is hard to customize how they work without the code getting complicated quickly (e.g. here)
In my opinion, the "cleanest" way to handle this challenge is to remove "BuiltInWindowFunctions" entirely, and switch to only using WindowUDFImpl
API: #8709, the same way we are doing this for scalar functions.
Using a UDF for each function means we could move the rules for determining the output type of each function can be handled individually (or if needed a second implementation provided that returns a different type).
BuiltInWindowFunction::DenseRank => Arc::new(dense_rank(name, data_type)), | ||
BuiltInWindowFunction::PercentRank => Arc::new(percent_rank(name, data_type)), | ||
BuiltInWindowFunction::CumeDist => Arc::new(cume_dist(name, data_type)), | ||
BuiltInWindowFunction::RowNumber => Arc::new(RowNumber::new(name, out_data_type)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it strange in general that we have to pass a "output type" to RowNumber
when it is (effectively) always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @alamb, It is always the same for DataFusion, but when building the hybrid system, like something on top of the DF, it happens expected external output type its not the same as hardcoded in DF.
Imho, derivation from schema improves DF extensibility as it reduces coupling between physical and logical layers and the schema is the contract. Thus external systems on top of DF may safely use physical layer just providing the expected schema.
UPD: Thus external systems on top of DF may safely use physical layer just providing the expected query plan(schema is the part of the query plan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reg to trait unification this is also a great idea to improve extensibility.
I'm thinking if can we move on with this PR, and migrate all window funcs to unified trait in follow up activity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if can we move on with this PR, and migrate all window funcs to unified trait in follow up activity?
That seems fine to me (and the migration is already tracked, albiet at a high level) by another ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!. Thanks @comphead
…om schema (apache#9686)" This reverts commit 1d0171a.
Which issue does this PR close?
Closes #.
Rationale for this change
There are scenarios when output datatype for builtin window functions are not the same as hardcoded in DataFusion.
For instance Apache Spark requires some functions like
ROW_NUMBER, RANK
, etc to beInt
rather thanUInt
which is hardcoded in DF. It would be great to derive builtin window function output type from the incoming schema rather than having the type hardcoded. Having this derivation from the schema increases DF extensibility to be used in other hybrid systems.What changes are included in this PR?
Changed physical planner for window builtin functions to derive the output data type from schema.
Are these changes tested?
Yes
Are there any user-facing changes?
No