-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trait based ScalarUDF API #8578
Conversation
bdd8ca1
to
7e7dae7
Compare
use datafusion_expr::{ColumnarValue, ScalarUDF, ScalarUDFImpl, Signature}; | ||
use std::sync::Arc; | ||
|
||
/// This example shows how to use the full ScalarUDFImpl API to implement a user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to create an example that shows how to make a more advanced UDF that special cases constant values.
This also shows how to create a ScalarUDF using a trait (rather than free functions and closures)
&return_type, | ||
&fun, | ||
)); | ||
struct TestScalarUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows an example of the difference in trait based vs low level ScalarValue::new
API that I propose to deprecate
While the trait requires more lines, I think it is much easier to implement as it is simply a standard trait implementation which I believe is far more common than Arc'd closures
/// | ||
/// See [`ScalarUDFImpl`] for a more convenient way to create a | ||
/// `ScalarUDF` using trait objects | ||
#[deprecated(since = "34.0.0", note = "please implement ScalarUDFImpl instead")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this low level API is quite akward to use and very hard to extend in backwards compatible ways. The trait is easer to use and easier to extend.
Thus I propose marking this API as deprecated (note most of the examples in codebase use create_udf
rather than ScalarUDF:new()
directly) so I think the impact will be limited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that current low-level API looks awkward to use. Ideally a trait defining what a UDF should implement should be better solution.
where | ||
F: ScalarUDFImpl + Send + Sync + 'static, | ||
{ | ||
// TODO change the internal implementation to use the trait object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to improve the internal representation as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8713 is the follow on PR
} | ||
} | ||
|
||
/// Trait for implementing [`ScalarUDF`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the proposed new trait. I think we can use this trait to add things such as "pre-compiling" arguments #8051 and adding better examples / documentation add examples and description to scalar/aggregate functions #8366.
cc @universalmind303 for your comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks much more intuitive than the current implementation. I actually just commented on an open issue about the api before reviewing this & my suggestion was nearly identical!
cc @2010YOUY01, @thinkharderdev, @viirya and @andygrove -- in case you have comments about the proposed way of implementing ScalarUDF. This PR doesn't make any API changes, but it does deprecate |
datafusion/expr/src/udf.rs
Outdated
/// Create a new `ScalarUDF` from a `[ScalarUDFImpl]` trait object | ||
/// | ||
/// Note this is the same as using the `From` impl (`ScalarUDF::from`) | ||
pub fn new_from_trait<F>(fun: F) -> ScalarUDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_from_impl
?
datafusion/expr/src/udf.rs
Outdated
/// can be used to implement any function. | ||
/// | ||
/// See [`advanced_udf.rs`] for a full example with implementation. See | ||
/// [`ScalarUDF`] for details on a simpler API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simpler API, do you mean create_udf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was trying to avoid replicating the same content (e.g. with links to create_udf, and simple example) all over the place (and just have it linked on ScalarUDF). I have tried to make this clearer
@@ -76,7 +76,8 @@ The challenge however is that DataFusion doesn't know about this function. We ne | |||
|
|||
### Registering a Scalar UDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add advanced example advanced_udf.rs
link to this document and also document ScalarUDFImpl
there too? Maybe a follow up.
datafusion/expr/src/udf.rs
Outdated
/// # Performance | ||
/// Many functions can be optimized for the case when one or more of their | ||
/// arguments are constant values [`ColumnarValue::Scalar`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this performance section, does it mean the implementations should optimize the case or DataFusion will optimize the case? Looks a bit unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that the implementations should optimize the case -- I have tried to clarify the comments in this regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the (as always) insightful review @viirya
datafusion/expr/src/udf.rs
Outdated
/// can be used to implement any function. | ||
/// | ||
/// See [`advanced_udf.rs`] for a full example with implementation. See | ||
/// [`ScalarUDF`] for details on a simpler API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was trying to avoid replicating the same content (e.g. with links to create_udf, and simple example) all over the place (and just have it linked on ScalarUDF). I have tried to make this clearer
datafusion/expr/src/udf.rs
Outdated
/// # Performance | ||
/// Many functions can be optimized for the case when one or more of their | ||
/// arguments are constant values [`ColumnarValue::Scalar`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that the implementations should optimize the case -- I have tried to clarify the comments in this regard.
// calculate the result for every row. The `unary` very | ||
// fast, "vectorized" code and handles things like null | ||
// values for us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I read it correctly:
// calculate the result for every row. The `unary` very | |
// fast, "vectorized" code and handles things like null | |
// values for us. | |
// calculate the result for every row. The `unary` is very | |
// fast "vectorized" code and handles things like null | |
// values for us. |
pub fn signature(&self) -> &Signature { | ||
&self.signature | ||
} | ||
|
||
/// Return the type of the function given its input types | ||
/// The datatype this function returns given the input argument input types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe?
/// The datatype this function returns given the input argument input types | |
/// The datatype this function returns given the input argument types |
@@ -93,6 +95,11 @@ let udf = create_udf( | |||
); | |||
``` | |||
|
|||
[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html | |
[`ScalarUDF`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this lower casing is done by prettier
so I can't impmement this suggestion without causing CI to fail 😬
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ git diff
diff --git a/docs/source/library-user-guide/adding-udfs.md b/docs/source/library-user-guide/adding-udfs.md
index c51e4de32..1d2cc0a12 100644
--- a/docs/source/library-user-guide/adding-udfs.md
+++ b/docs/source/library-user-guide/adding-udfs.md
@@ -95,7 +95,7 @@ let udf = create_udf(
);
-[`scalarudf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html
+[`ScalarUDF`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html
[`create_udf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.create_udf.html
[`make_scalar_function`]: https://docs.rs/datafusion/latest/datafusion/physical_expr/functions/fn.make_scalar_function.html
[`advanced_udf.rs`]: https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ npx [email protected] --check '{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' '!datafusion/CHANGELOG.md' README.md CONTRIBUTING.md
Checking formatting...
[warn] docs/source/library-user-guide/adding-udfs.md
[warn] Code style issues found in the above file. Forgot to run Prettier?
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ git reset --hard
HEAD is now at 3ce1802df Improve docs for aliases
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ npx [email protected] --check '{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' '!datafusion/CHANGELOG.md' README.md CONTRIBUTING.md
Checking formatting...
All matched files use Prettier code style!
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$
Co-authored-by: Liang-Chi Hsieh <[email protected]>
…fusion into alamb/better_scalar_api
Update here is I plan to merge this tomorrow unless anyone would like more time to review |
Nice! It would be very useful to be able to handle serde as well for custom implementations (perhaps in a different PR?). I think this could fit relatively easily into |
I have several follow on tasks I will do like shortly:
|
Filed #8706 |
/// | ||
/// 1. For simple (less performant) use cases, use [`create_udf`] and [`simple_udf.rs`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less performant
Hi, is there anyone who would like to explain a bit about why create_udf()
is less performant than the UDFs created by ScalarUDFImpl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that create_udf() always converts its arguments to ArrayRef
and thus you can't implement special cases for constant values (ScalarValue
) -- instead the scalar value is always converted into an array.
Update: this does not seem to be correct. I will do some more investigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9384 to clarify docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using create_udf
create an extra indirection. Under the hood it's creating a
pub struct SimpleScalarUDF {
name: String,
signature: Signature,
return_type: DataType,
fun: ScalarFunctionImplementation,
}
impl ScalarUDFImpl for SimpleScalarUDF {
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
(self.fun)(args)
}
}
so it adds an extra call for every batch processed through the UDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense 👍 -- I don't think the overhead of a single function call is worth calling out in the docs however (I think it is more confusing than helpfl), though please let me know if you disagree on #9384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I agree it's not meaningful enough to call out tin docs
Which issue does this PR close?
Closes #8568
Rationale for this change
This PR is a step towards #8045:
What changes are included in this PR?
ScalarUDFImpl
-- better names welcomed)advanced_udf.rs
)If this PR is accepted, I plan to file tickets to track
AggregateUDF
andWindowUDF
, for the same reasonsAre these changes tested?
Yes, both new tests as well as updated existing tests
Are there any user-facing changes?
There is a new way to define ScalarUDFs and additional documentation.