Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(substrait): modular substrait consumer #13803

Merged
merged 12 commits into from
Dec 21, 2024

Conversation

vbarua
Copy link
Contributor

@vbarua vbarua commented Dec 17, 2024

Note: When reviewing this PR I would recommend hiding whitespace changes.

Which issue does this PR close?

Initial work for #13318

Rationale for this change

Improves the reusability of the Substrait consumer for users that utilize user-defined extensions and types.

I should note that this PR represents an approach to the problem, but is not necessarily the only way we could handle it. I'm very open to feedback on the approach.

What changes are included in this PR?

  • Relation and expression handling code has been extracted into a series of from_substrait_* functions (i.e. from_filter_rel, from_scalar_function) to aid re-use.
  • A SubstraitConsumer trait has been introduced with default methods that handle relations and expression using the above functions.
  • The SubstraitConsumer trait also includes methods for handling extension relations (i.e consume_extension_single) along with user-defined types (consume_user_defined_type, consume_user_defined_literal).
  • All conversion methods take as their first argument an &impl SubstraitConsumer.
  • The role of SubstraitPlanningState has been subsumed into the SubstraitConsumer.

Are these changes tested?

These changes refactor existing code and leverage their tests.

A doc comment has been added to show how the SubstraitConsumer trait can be implemented by users.

Are there any user-facing changes?

Yes.

The top-level from_substrait_plan(state, plan) function now takes a &SessionState as its first argument, instead of &dyn SubstraitPlanningState. The same is true for from_substrait_extended_expr. These two functions are the primary external facing APIs for intiation a conversion.

A number of functions like from_substrait_rel, from_substrait_rex, etc have had their API changed to make the first argument &impl SubstraitConsumer, and remove all state: &dyn SubstraitPlanningState and extensions: &Extensions arguments.

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
rels: &[Rel],
state: &dyn SubstraitPlanningState,
extensions: &Extensions,
is_all: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from here, a lot of the changes look like this. Replacing state and extensions with a single consumer argument that subsumes them both, and then threading the consumer into calls.

state: &dyn SubstraitPlanningState,
rel: &Rel,
extensions: &Extensions,
pub async fn from_project_rel(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the from_* are extractions of existing functionality. It is easier to see this if you hide whitespaces changes in the diff.

@vbarua vbarua force-pushed the vbarua/modular-substrait-consumer branch from ad28bec to 04491a6 Compare December 17, 2024 03:00
@vbarua vbarua marked this pull request as ready for review December 17, 2024 03:01
@vbarua
Copy link
Contributor Author

vbarua commented Dec 17, 2024

@Blizzara I would appreciate if you could take a look and let me know if this approach seems reasonable, or if I've architecture astronauted myself.

@vbarua vbarua force-pushed the vbarua/modular-substrait-consumer branch from 04491a6 to b8e435f Compare December 17, 2024 03:29
@Blizzara
Copy link
Contributor

@Blizzara I would appreciate if you could take a look and let me know if this approach seems reasonable, or if I've architecture astronauted myself.

Thanks for the ping - I took a look, and overall looks really good to me! I left couple thoughts (I see you already saw some), but nothing major. I already have usecase in mind for this internally, so looking forward to getting it merged 😄

One thing is we should consider splitting the consumer.rs into smaller files. It could be a separate PR to keep the diff smaller, but maybe moving all the static "consumer_X" functions into like consumer/relations.rs, consumer/expressions.rs, consumer/literals.rs and consumer/types.rs (and one more for utils etc) or similar? Feels like now could be a good moment to do that.

let plan = self
.state
.serializer_registry()
.deserialize_logical_plan(&ext_detail.type_url, &ext_detail.value)?;
Copy link
Contributor

@ccciudatu ccciudatu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback of the current implementation is that it's abusing the type_url field in proto::Any (which is quite clearly specified). Working with Substrait objects directly would eliminate the need for hacks like encoding node names as Any::type_url.

As a side note, this may sound hypocritical, given that I made it even worse with my PR by setting it to table names as well. But I only did it for consistency. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this may sound hypocritical, given that I made it even worse #13772

I don't think it's hypocritical, you were aiming for consistency and I'm just being much more aggressive with my changes and breaking the API.

Include SerializerRegistry based handlers for Extension Relations in the
DefaultSubstraitConsumer
@vbarua
Copy link
Contributor Author

vbarua commented Dec 17, 2024

@Blizzara

One thing is we should consider splitting the consumer.rs into smaller files. It could be a separate PR to keep the diff smaller, but maybe moving all the static "consumer_X" functions into like consumer/relations.rs, consumer/expressions.rs, consumer/literals.rs and consumer/types.rs (and one more for utils etc) or similar? Feels like now could be a good moment to do that.

I think it makes sense to do that, and I'm happy to own it, but I would prefer to do that as a followup. I structured the consumer.rs changes in order to make it easy to see that the functions I added were extracted from the big from_substrait_rel and from_substrait_rex switch statements.

@Blizzara
Copy link
Contributor

I think it makes sense to do that, and I'm happy to own it, but I would prefer to do that as a followup. I structured the consumer.rs changes in order to make it easy to see that the functions I added were extracted from the big from_substrait_rel and from_substrait_rex switch statements.

Yup, I totally agree!

I looked through the latest changes, looks even better now, just some more comments re SubstraitPlanningState but otherwise this looks very good by me!

@vbarua vbarua requested a review from Blizzara December 18, 2024 17:22
Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 Thanks @vbarua, and FYI @alamb :)

@alamb alamb added the api change Changes the API exposed to users of the crate label Dec 19, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vbarua and @Blizzara -- this looks like a great improvement to me

I found another bug while reviewing this PR 🤦

The only thing I think we should try and fix before merging this is needing to copy the session context into an Arc

I realize the issue is related to fighting the borrow checker -- I played around with it and came up with this which seemed to work:

While it will be potentially disruptive to anyone using the Substrait bindings I think this is necessary to get the APIs in shape to support user defined functionaly as you point out.

I really like the fact that the existing high level API to create subtrait doesn't change

/// This trait is used to consume Substrait plans, converting them into DataFusion Logical Plans.
/// It can be implemented by users to allow for custom handling of relations, expressions, etc.
///
/// # Example Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Comment on lines 120 to 134
/// use async_trait::async_trait;
/// use datafusion::catalog::TableProvider;
/// use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference};
/// use datafusion::error::Result;
/// use datafusion::execution::{FunctionRegistry, SessionState};
/// use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder};
/// use std::sync::Arc;
/// use substrait::proto;
/// use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel};
/// use datafusion::arrow::datatypes::DataType;
/// use datafusion::logical_expr::expr::ScalarFunction;
/// use datafusion_substrait::extensions::Extensions;
/// use datafusion_substrait::logical_plan::consumer::{
/// from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer
/// };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make examples like this potentially read more nicely by putting # in front of the import lines so they aren't rendered int he output.

I think it is also fine the way it is too

Suggested change
/// use async_trait::async_trait;
/// use datafusion::catalog::TableProvider;
/// use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference};
/// use datafusion::error::Result;
/// use datafusion::execution::{FunctionRegistry, SessionState};
/// use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder};
/// use std::sync::Arc;
/// use substrait::proto;
/// use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel};
/// use datafusion::arrow::datatypes::DataType;
/// use datafusion::logical_expr::expr::ScalarFunction;
/// use datafusion_substrait::extensions::Extensions;
/// use datafusion_substrait::logical_plan::consumer::{
/// from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer
/// };
/// # use async_trait::async_trait;
/// # use datafusion::catalog::TableProvider;
/// # use datafusion::common::{not_impl_err, substrait_err, DFSchema, ScalarValue, TableReference};
/// # use datafusion::error::Result;
/// # use datafusion::execution::{FunctionRegistry, SessionState};
/// # use datafusion::logical_expr::{Expr, LogicalPlan, LogicalPlanBuilder};
/// # use std::sync::Arc;
/// # use substrait::proto;
/// # use substrait::proto::{ExtensionLeafRel, FilterRel, ProjectRel};
/// # use datafusion::arrow::datatypes::DataType;
/// # use datafusion::logical_expr::expr::ScalarFunction;
/// # use datafusion_substrait::extensions::Extensions;
/// # use datafusion_substrait::logical_plan::consumer::{
/// # from_project_rel, from_substrait_rel, from_substrait_rex, SubstraitConsumer
/// # };

fn get_extensions(&self) -> &Extensions;
fn get_function_registry(&self) -> &impl FunctionRegistry;

// Relation Methods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these // comments aren't displayed in the documentationI think we should also add some documentation to them eventually in /// style

https://docs.rs/datafusion-substrait/41.0.0/datafusion_substrait/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@/Blizzara also suggested splitting the code into modules. I've combined both of those ideas into a future issue #13864

apply_emit_kind(retrieve_rel_common(relation), plan?)
}

/// Can be used to consume standard Substrait without user-defined extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could create an example for this one or point at the existing documentation examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs, I've pointed users towards from_substrait_plan which uses this.

This is really only pub so it can be used in integration tests.

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Outdated Show resolved Hide resolved
@@ -262,7 +724,25 @@ async fn except_rels(

/// Convert Substrait Plan to DataFusion LogicalPlan
pub async fn from_substrait_plan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for not changing the API

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @vbarua and thank you @Blizzara for your comments and feedback

@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

I'll plan to merge this tomorrow unless anyone would like some more time to review

pub fn new(extensions: &'a Extensions, state: &'a SessionState) -> Self {
DefaultSubstraitConsumer { extensions, state }
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a prior versions this used to implement Default. I've removed it as it was intended mostly for testing. External users should just call from_substrait_plan as before.

pub fn new(extensions: &'a Extensions, state: &'a SessionState) -> Self {
DefaultSubstraitConsumer { extensions, state }
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a prior versions this used to implement Default. I've removed it as it was intended mostly for testing. External users should just call from_substrait_plan as before.

}

let consumer = DefaultSubstraitConsumer {
extensions: &extensions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started thinking whether it actually really make sense to pass the Extensions into the SubstraitConsumer, or if there should be some initialize(plan: &Plan) method or something in the SubstraitConsumer that loads them. But I didn't think yet much about it, maybe there is a reason to rather have it like this, and anyways no need to change anything for this PR anymore in any case 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to streamline the API further, but in the future 😅

Initialising the consumer directly from the plan makes sense to me, because we always need the extensions.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Dec 21, 2024

FYI @ccciudatu

@alamb alamb merged commit b99400e into apache:main Dec 21, 2024
24 checks passed
zhuqi-lucas pushed a commit to zhuqi-lucas/arrow-datafusion that referenced this pull request Dec 23, 2024
* feat(substrait): modular substrait consumer

* feat(substrait): include Extension Rel handlers in default consumer

Include SerializerRegistry based handlers for Extension Relations in the
DefaultSubstraitConsumer

* refactor(substrait) _selection -> _field_reference

* refactor(substrait): remove SubstraitPlannerState usage from consumer

* refactor: get_state() -> get_function_registry()

* docs: elide imports from example

* test: simplify test

* refactor: remove Arc from DefaultSubstraitConsumer

* doc: add ticket for API improvements

* doc: link DefaultSubstraitConsumer to from_subtrait_plan

* refactor: remove redundant Extensions parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants