-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(substrait): modular substrait consumer #13803
Conversation
rels: &[Rel], | ||
state: &dyn SubstraitPlanningState, | ||
extensions: &Extensions, | ||
is_all: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting from here, a lot of the changes look like this. Replacing state
and extensions
with a single consumer
argument that subsumes them both, and then threading the consumer
into calls.
state: &dyn SubstraitPlanningState, | ||
rel: &Rel, | ||
extensions: &Extensions, | ||
pub async fn from_project_rel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the from_*
are extractions of existing functionality. It is easier to see this if you hide whitespaces changes in the diff.
ad28bec
to
04491a6
Compare
@Blizzara I would appreciate if you could take a look and let me know if this approach seems reasonable, or if I've architecture astronauted myself. |
04491a6
to
b8e435f
Compare
Thanks for the ping - I took a look, and overall looks really good to me! I left couple thoughts (I see you already saw some), but nothing major. I already have usecase in mind for this internally, so looking forward to getting it merged 😄 One thing is we should consider splitting the consumer.rs into smaller files. It could be a separate PR to keep the diff smaller, but maybe moving all the static "consumer_X" functions into like |
let plan = self | ||
.state | ||
.serializer_registry() | ||
.deserialize_logical_plan(&ext_detail.type_url, &ext_detail.value)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback of the current implementation is that it's abusing the type_url
field in proto::Any
(which is quite clearly specified). Working with Substrait objects directly would eliminate the need for hacks like encoding node names as Any::type_url.
As a side note, this may sound hypocritical, given that I made it even worse with my PR by setting it to table names as well. But I only did it for consistency. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, this may sound hypocritical, given that I made it even worse #13772
I don't think it's hypocritical, you were aiming for consistency and I'm just being much more aggressive with my changes and breaking the API.
Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer
I think it makes sense to do that, and I'm happy to own it, but I would prefer to do that as a followup. I structured the |
Yup, I totally agree! I looked through the latest changes, looks even better now, just some more comments re SubstraitPlanningState but otherwise this looks very good by me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vbarua and @Blizzara -- this looks like a great improvement to me
I found another bug while reviewing this PR 🤦
The only thing I think we should try and fix before merging this is needing to copy the session context into an Arc
I realize the issue is related to fighting the borrow checker -- I played around with it and came up with this which seemed to work:
While it will be potentially disruptive to anyone using the Substrait bindings I think this is necessary to get the APIs in shape to support user defined functionaly as you point out.
I really like the fact that the existing high level API to create subtrait doesn't change
/// This trait is used to consume Substrait plans, converting them into DataFusion Logical Plans. | ||
/// It can be implemented by users to allow for custom handling of relations, expressions, etc. | ||
/// | ||
/// # Example Usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
/// use async_trait::async_trait; | ||
/// use datafusion::catalog::TableProvider; | ||
/// use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference}; | ||
/// use datafusion::error::Result; | ||
/// use datafusion::execution::{FunctionRegistry, SessionState}; | ||
/// use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder}; | ||
/// use std::sync::Arc; | ||
/// use substrait::proto; | ||
/// use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel}; | ||
/// use datafusion::arrow::datatypes::DataType; | ||
/// use datafusion::logical_expr::expr::ScalarFunction; | ||
/// use datafusion_substrait::extensions::Extensions; | ||
/// use datafusion_substrait::logical_plan::consumer::{ | ||
/// from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer | ||
/// }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make examples like this potentially read more nicely by putting #
in front of the import lines so they aren't rendered int he output.
I think it is also fine the way it is too
/// use async_trait::async_trait; | |
/// use datafusion::catalog::TableProvider; | |
/// use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference}; | |
/// use datafusion::error::Result; | |
/// use datafusion::execution::{FunctionRegistry, SessionState}; | |
/// use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder}; | |
/// use std::sync::Arc; | |
/// use substrait::proto; | |
/// use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel}; | |
/// use datafusion::arrow::datatypes::DataType; | |
/// use datafusion::logical_expr::expr::ScalarFunction; | |
/// use datafusion_substrait::extensions::Extensions; | |
/// use datafusion_substrait::logical_plan::consumer::{ | |
/// from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer | |
/// }; | |
/// # use async_trait::async_trait; | |
/// # use datafusion::catalog::TableProvider; | |
/// # use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference}; | |
/// # use datafusion::error::Result; | |
/// # use datafusion::execution::{FunctionRegistry, SessionState}; | |
/// # use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder}; | |
/// # use std::sync::Arc; | |
/// # use substrait::proto; | |
/// # use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel}; | |
/// # use datafusion::arrow::datatypes::DataType; | |
/// # use datafusion::logical_expr::expr::ScalarFunction; | |
/// # use datafusion_substrait::extensions::Extensions; | |
/// # use datafusion_substrait::logical_plan::consumer::{ | |
/// # from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer | |
/// # }; |
fn get_extensions(&self) -> &Extensions; | ||
fn get_function_registry(&self) -> &impl FunctionRegistry; | ||
|
||
// Relation Methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these //
comments aren't displayed in the documentationI think we should also add some documentation to them eventually in ///
style
https://docs.rs/datafusion-substrait/41.0.0/datafusion_substrait/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@/Blizzara also suggested splitting the code into modules. I've combined both of those ideas into a future issue #13864
apply_emit_kind(retrieve_rel_common(relation), plan?) | ||
} | ||
|
||
/// Can be used to consume standard Substrait without user-defined extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could create an example for this one or point at the existing documentation examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docs, I've pointed users towards from_substrait_plan
which uses this.
This is really only pub
so it can be used in integration tests.
@@ -262,7 +724,25 @@ async fn except_rels( | |||
|
|||
/// Convert Substrait Plan to DataFusion LogicalPlan | |||
pub async fn from_substrait_plan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for not changing the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll plan to merge this tomorrow unless anyone would like some more time to review |
pub fn new(extensions: &'a Extensions, state: &'a SessionState) -> Self { | ||
DefaultSubstraitConsumer { extensions, state } | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a prior versions this used to implement Default
. I've removed it as it was intended mostly for testing. External users should just call from_substrait_plan
as before.
pub fn new(extensions: &'a Extensions, state: &'a SessionState) -> Self { | ||
DefaultSubstraitConsumer { extensions, state } | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a prior versions this used to implement Default
. I've removed it as it was intended mostly for testing. External users should just call from_substrait_plan
as before.
} | ||
|
||
let consumer = DefaultSubstraitConsumer { | ||
extensions: &extensions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started thinking whether it actually really make sense to pass the Extensions into the SubstraitConsumer, or if there should be some initialize(plan: &Plan)
method or something in the SubstraitConsumer that loads them. But I didn't think yet much about it, maybe there is a reason to rather have it like this, and anyways no need to change anything for this PR anymore in any case 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to streamline the API further, but in the future 😅
Initialising the consumer directly from the plan makes sense to me, because we always need the extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
FYI @ccciudatu |
* feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing
Note: When reviewing this PR I would recommend hiding whitespace changes.
Which issue does this PR close?
Initial work for #13318
Rationale for this change
Improves the reusability of the Substrait consumer for users that utilize user-defined extensions and types.
I should note that this PR represents an approach to the problem, but is not necessarily the only way we could handle it. I'm very open to feedback on the approach.
What changes are included in this PR?
from_substrait_*
functions (i.e.from_filter_rel
,from_scalar_function
) to aid re-use.SubstraitConsumer
trait has been introduced with default methods that handle relations and expression using the above functions.SubstraitConsumer
trait also includes methods for handling extension relations (i.econsume_extension_single
) along with user-defined types (consume_user_defined_type
,consume_user_defined_literal
).&impl SubstraitConsumer
.SubstraitPlanningState
has been subsumed into theSubstraitConsumer
.Are these changes tested?
These changes refactor existing code and leverage their tests.
A doc comment has been added to show how the
SubstraitConsumer
trait can be implemented by users.Are there any user-facing changes?
Yes.
The top-level
from_substrait_plan(state, plan)
function now takes a&SessionState
as its first argument, instead of&dyn SubstraitPlanningState
. The same is true forfrom_substrait_extended_expr
. These two functions are the primary external facing APIs for intiation a conversion.A number of functions like
from_substrait_rel
,from_substrait_rex
, etc have had their API changed to make the first argument&impl SubstraitConsumer
, and remove allstate: &dyn SubstraitPlanningState
andextensions: &Extensions
arguments.