-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Speed up struct
and named_struct
using invoke_with_args
#14276
Conversation
@@ -203,12 +137,19 @@ impl ScalarUDFImpl for NamedStructFunc { | |||
)))) | |||
} | |||
|
|||
fn invoke_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If delete invoke_batch
function, the default invoke_batch
function will call invoke
when args is not empty, but invoke
function is left impl by Specific type, is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what the intention is for the deprecated functions from an API user point of view. I had interpreted ScalarUDF(Impl) as an extension point rather than as public API that library users would call directly. If that assumption is correct, then it doesn't matter if we remove the invoke_batch
implementation since ScalarFunctionExpr
never calls this method. The only usages of invoke_batch
are in benchmark and test code.
Perhaps @alamb, who logged #13515, can provide some insight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe reason we kept around deprecated functions is to give downstream users of DataFusion time to adjust their code -- especially on upgrade having a deprecated function with guidance of what to change has been helpful
You can read more about this strategy here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming from the Java world so I'll use the terminology from there; not sure what the equivalent is in Rust lingo. It's not explicitly stated which parts of the library are Service Provider Interface vs Application Programming Interface. My assumption was that ScalarUDFImpl
is SPI and invoke_batch
was kept around to not break all existing implementations and ScalarFunctionExpr
is the API side of things which doesn't expose invoke_batch
. Is that a correct interpretation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that ScalarUDFImpl is SPI
Yes, that is accurate in my understanding
and invoke_batch was kept around to not break all existing implementations
Yes that is also my understanding
and ScalarFunctionExpr is the API side of things which doesn't expose invoke_batch. Is that a correct interpretation?
I would say ScalarFunctionExpr
is an implementation detail of how functions are invoked in the ExecutionPlan
(aka the physical execution)
The split between logical/physical plans is explained a bit in the API docs / intro videos in case you are interested:
https://docs.rs/datafusion/latest/datafusion/index.html#query-planning-and-execution-overview
By implementing `invoke_with_args` the fields derived in return_type(_from_args) can be reused (performance) and the duplicate derivation logic can be removed.
invoke_with_args
for struct
and named_struct
struct
and named_struct
using invoke_with_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pepijnve
This is a really nice cleanup and a good example of using the ScalarUDFImpl API👌
"return type field count != argument count / 2" | ||
); | ||
|
||
let values: Vec<ColumnarValue> = args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
I also took the liberty of merging up from main to get the CI to run again. |
Update here is we are very close to cutting the 45 release branch. See
I plan to merge this PR once that is done |
We have a release branch for 45 now -- see #14008 (comment) Let's keep the main moving! |
Thanks again @pepijnve |
…pache#14276) By implementing `invoke_with_args` the fields derived in return_type(_from_args) can be reused (performance) and the duplicate derivation logic can be removed. Co-authored-by: Andrew Lamb <[email protected]>
By implementing
invoke_with_args
the fields derived in return_type(_from_args) can be reused (performance) and the duplicate derivation logic can be removed.Which issue does this PR close?
struct
andnamed_struct
#14275.Rationale for this change
Reusing the already derived fields results in a small, but measurable, performance improvement according to the added micro benchmark.
What changes are included in this PR?
Replaces the
invoke_batch
implementations withinvoke_with_args
implementations.Are these changes tested?
Covered by the existing dataframe/dataframe_functions tests
Are there any user-facing changes?
Not at the SQL level. At the library level calls to
invoke_batch
for these two functions will result in a runtime error.