Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Speed up struct and named_struct using invoke_with_args #14276

Merged
merged 2 commits into from
Feb 1, 2025

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Jan 24, 2025

By implementing invoke_with_args the fields derived in return_type(_from_args) can be reused (performance) and the duplicate derivation logic can be removed.

Which issue does this PR close?

Rationale for this change

Reusing the already derived fields results in a small, but measurable, performance improvement according to the added micro benchmark.

What changes are included in this PR?

Replaces the invoke_batch implementations with invoke_with_args implementations.

Are these changes tested?

Covered by the existing dataframe/dataframe_functions tests

Are there any user-facing changes?

Not at the SQL level. At the library level calls to invoke_batch for these two functions will result in a runtime error.

@github-actions github-actions bot added core Core DataFusion crate functions labels Jan 24, 2025
@@ -203,12 +137,19 @@ impl ScalarUDFImpl for NamedStructFunc {
))))
}

fn invoke_batch(
Copy link

@lichuang lichuang Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If delete invoke_batch function, the default invoke_batch function will call invoke when args is not empty, but invoke function is left impl by Specific type, is it ok?

Copy link
Contributor Author

@pepijnve pepijnve Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what the intention is for the deprecated functions from an API user point of view. I had interpreted ScalarUDF(Impl) as an extension point rather than as public API that library users would call directly. If that assumption is correct, then it doesn't matter if we remove the invoke_batch implementation since ScalarFunctionExpr never calls this method. The only usages of invoke_batch are in benchmark and test code.
Perhaps @alamb, who logged #13515, can provide some insight here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe reason we kept around deprecated functions is to give downstream users of DataFusion time to adjust their code -- especially on upgrade having a deprecated function with guidance of what to change has been helpful

You can read more about this strategy here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming from the Java world so I'll use the terminology from there; not sure what the equivalent is in Rust lingo. It's not explicitly stated which parts of the library are Service Provider Interface vs Application Programming Interface. My assumption was that ScalarUDFImpl is SPI and invoke_batch was kept around to not break all existing implementations and ScalarFunctionExpr is the API side of things which doesn't expose invoke_batch. Is that a correct interpretation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that ScalarUDFImpl is SPI

Yes, that is accurate in my understanding

and invoke_batch was kept around to not break all existing implementations

Yes that is also my understanding

and ScalarFunctionExpr is the API side of things which doesn't expose invoke_batch. Is that a correct interpretation?

I would say ScalarFunctionExpr is an implementation detail of how functions are invoked in the ExecutionPlan (aka the physical execution)

The split between logical/physical plans is explained a bit in the API docs / intro videos in case you are interested:
https://docs.rs/datafusion/latest/datafusion/index.html#query-planning-and-execution-overview

By implementing `invoke_with_args` the fields derived
in return_type(_from_args) can be reused (performance)
and the duplicate derivation logic can be removed.
@alamb alamb changed the title feat: implement invoke_with_args for struct and named_struct feat: Speed up struct and named_struct using invoke_with_args Jan 31, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve

This is a really nice cleanup and a good example of using the ScalarUDFImpl API👌

"return type field count != argument count / 2"
);

let values: Vec<ColumnarValue> = args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@alamb
Copy link
Contributor

alamb commented Jan 31, 2025

I also took the liberty of merging up from main to get the CI to run again.

@alamb
Copy link
Contributor

alamb commented Feb 1, 2025

Update here is we are very close to cutting the 45 release branch. See

I plan to merge this PR once that is done

@alamb
Copy link
Contributor

alamb commented Feb 1, 2025

We have a release branch for 45 now -- see #14008 (comment)

Let's keep the main moving!

@alamb alamb merged commit 278cf9d into apache:main Feb 1, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 1, 2025

Thanks again @pepijnve

cj-zhukov pushed a commit to cj-zhukov/datafusion that referenced this pull request Feb 3, 2025
…pache#14276)

By implementing `invoke_with_args` the fields derived
in return_type(_from_args) can be reused (performance)
and the duplicate derivation logic can be removed.

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid deriving Fields for each invocation of struct and named_struct
3 participants