Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Pullout pre/post project for generate #4952

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Mar 14, 2024

A follow-up of #4901

Pullout pre/post project for GenerateExec. The added projections are moved into Backends API. With this approach, Literal and CreateArray for Inline are also supported as they can be directly constructed by the expression transformer in the project.

Other fixes: 1. Fix coredump on array of null Struct Literal; fallback this case. 2. Add field name for constructing struct literal in SubstratiToVeloxExpr to avoid the downstream Velox operator from referencing unexpected struct fields by name.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@marin-ma marin-ma force-pushed the pullout-proj-generate branch from 83eef04 to cea6263 Compare March 19, 2024 07:13
Copy link

Run Gluten Clickhouse CI

@marin-ma marin-ma force-pushed the pullout-proj-generate branch from cea6263 to c3529b8 Compare March 19, 2024 07:44
Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@marin-ma marin-ma marked this pull request as ready for review March 19, 2024 10:23
@marin-ma
Copy link
Contributor Author

@liujiayi771 Could you help to review?

I also have concerns about the fallback mechanism after pulling out the pre/post project. As the rules are applied before TransformHintRule, do we have a framework or new rule to remove the pre/post project if this operator fails its validation?

@marin-ma
Copy link
Contributor Author

cc: @zhouyuan

@liujiayi771
Copy link
Contributor

@marin-ma RewriteSparkPlanRulesManager will fallback these pre/post project.

@liujiayi771
Copy link
Contributor

cc @ulysses-you

@@ -750,4 +750,8 @@ class CHSparkPlanExecApi extends SparkPlanExecApi {
): GenerateExecTransformerBase = {
CHGenerateExecTransformer(generator, requiredChildOutput, outer, generatorOutput, child)
}

override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of pull out should also apply to CH? It's just that CH hasn't encountered a scenario that requires pull out yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, CH doesn't need pre/post project for their already supported generator type. But not sure if it would be needed in the future with more Generator supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explode and PosExplode are already supported in CH, and pre/post projection are not needed. Looks like we cannot unify the implementation.

Copy link
Contributor

@liujiayi771 liujiayi771 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for pre-project, CH and Velox are consistent; they can only execute expressions within the project operator. I think that the logic of pre-project can also be applied to CH at present, it's just that CH has not yet started to adapt GeneratorExec that Generator.child include expressions. I think you can put GeneratorExec into needsPreProject for validation, and only apply the pull-out logic when the Generator.child is not an Attribute. This should not affect the current CH, and later, if CH wants to adapt GeneratorExec with expressions in Generator.child, it can be directly reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerateExec was initially implemented for CH, and they should have the native implementation for the generator expressions. Please check #574

Velox backend does not have the native function implementation to the generator expressions. But it has an UnnestNode operator that can help to generate the expected output. That's why a pre/post projection is added to adopt the input/output for the UnnestNode.

I believe CH doesn't require the pre/post project logic for the supported generator. cc @taiyang-li

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ch doesn't need it for now. thanks for reminding me

@@ -750,4 +750,8 @@ class CHSparkPlanExecApi extends SparkPlanExecApi {
): GenerateExecTransformerBase = {
CHGenerateExecTransformer(generator, requiredChildOutput, outer, generatorOutput, child)
}

override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of pull out should also apply to CH? It's just that CH hasn't encountered a scenario that requires pull out yet? If that is the case, then we can centralize these pieces of code within the pull out rule.


override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = {
if (supportsGenerate(generate)) {
val newGeneratorChild = generate.generator.asInstanceOf[UnaryExpression].child match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Generators are not UnaryExpression, use match case, and throw exception if it is not UnaryExpression.

val newGeneratorChild = generate.generator.asInstanceOf[UnaryExpression].child match {
case attr: Attribute => attr
case expr @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _)) =>
Alias(expr, generatePreAliasName)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to maintain consistency with the code patterns of other 'pull out' operations, using methods such as isNotAttribute and replaceExpressionWithAttribute from PullOutProjectHelper.

case expr @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _)) =>
Alias(expr, generatePreAliasName)()
case other =>
throw new UnsupportedOperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only three scenarios that require a 'pull out'? Is it the case that in other situations, as long as it's not an Attribute, a 'pull out' can be performed without necessarily throwing an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all kind of child expression I've found from the existing UTs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for non-Attribute cases, you can simply hand them over to the replaceExpressionWithAttribute method to handle, without the need to match the current cases only encountered in UTs. If other cases arise in the future, they can also be supported. If the new case is not yet supported by Gluten, it will only cause pre-project to fallback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the implementation in the initial commit. Could you help to check if that matches your expectation?
c3529b8#diff-c49dd56d8ac0d29278ad2e6a33cc0cb97d86a4c74d3e5c011bad4733ba1203feR186-R195

The reason of giving up using replaceExpressionWithAttribute is that this function also matches BouldReference, which is not the case used by Generator.child, and the output cannot be directly used as the input to pre-project. The new child should be either the original Attribute, or an Alias to other expressions.

I would think removing the type list @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _)) here is a better way.

Copy link
Contributor

@liujiayi771 liujiayi771 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Do you mean to say that if Generator.child is a BoundReference, it also needs to be wrapped with an Alias?
The BoundReference within replaceExpressionWithAttribute is a special optimization case that was initially placed within the case other. I think if that's the only reason, we could add a parameter to replaceExpressionWithAttribute named replaceBoundReference: Boolean, with the default value of false. The Generator could set this to true. Then it could be changed to case e: BoundReference if !replaceBoundReference => e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that in your initial commit, the main logic codes are put in pull-out rule. Why do you move it to SparkPlanExecApi now? I think it’s better for SparkPlanExecApi to only control whether it's necessary to pull out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiayi771 This pull-out logic fully depends on the native implementation of Velox library. It's not a general case.

generate.copy(
generator =
generate.generator.withNewChildren(Seq(newGeneratorChild)).asInstanceOf[Generator],
child = ProjectExec(generate.child.output :+ newGeneratorChild, generate.child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use eliminateProjectList to get the projectList of pre project.

Copy link
Contributor Author

@marin-ma marin-ma Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean by using elimainateProjectList(generate.child.output, newGeneratorChild)? newGeneratorChild can be a duplicated Attribute in generate.child.output. It shouldn't be eliminated because the native backend identifies the last field of projection as generator's input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments for this explanation?


import scala.collection.JavaConverters._

case class GlutenGeneratorWrapper(unwrapped: Generator) extends Generator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this wrapper ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is used to stop applying the PullOutPostProject on the same GenerateExec once it has been applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to worry about this. PullOutPostProject would not apply same operator twice. The RewriteSparkPlanRulesManager framework makes sure only transforming once for each operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After transforming, the GenerateExec becomes ProjectExec(newChild: GenerateExec), and the rule continues to apply on the newChild: GenerateExec, which results in infinite recursion. By adding this wrapper, the next time the supportsGenerate will return false, preventing it from infinitely applying on the GenerateExec https://github.com/apache/incubator-gluten/pull/4952/files#diff-4ebd3cbecb20dc83596c28d20936d40f48ed1b41f334525fd5b5bdca3ac3bf3fR730

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, shall we change transform to transformUp ?

case (plan, rule) =>
// Some rewrite rules may generate new parent plan node, we should use transform to
// rewrite the original plan. For example, PullOutPreProject and PullOutPostProject
// will generate post-project plan node.
plan.transform { case p => rule.apply(p) }

Copy link

Run Gluten Clickhouse CI

context,
operatorId,
child.output.size)
object PullOutGenerateProjectHelper extends PullOutProjectHelper {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiayi771 Could you help to review again? I've made some code structure changes since your last review. This class is created for velox backend to generate the pre/post projection for GenerateExec. By extending PullOutProjectHelper here, the protected scope of the helper functions can be preserveed.

@@ -57,12 +57,13 @@ trait PullOutProjectHelper {

protected def replaceExpressionWithAttribute(
expr: Expression,
projectExprsMap: mutable.HashMap[Expression, NamedExpression]): Expression =
projectExprsMap: mutable.HashMap[Expression, NamedExpression],
replaceBoundReference: Boolean = true): Expression =
Copy link
Contributor

@liujiayi771 liujiayi771 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should be false, and GenerateExec set it to true. "replace" means using an Alias wrapper.

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@marin-ma
Copy link
Contributor Author

@liujiayi771 @ulysses-you Could you help to review again? Thanks!

@ulysses-you
Copy link
Contributor

It seems the failed tests are irrelevant, I tried to re-run the failed tests.

@ulysses-you ulysses-you changed the title [WIP] Pullout pre/post project for generate [CORE] Pullout pre/post project for generate Mar 21, 2024
@liujiayi771
Copy link
Contributor

@marin-ma LGTM. Thanks.

}

object SparkPlanExecApiImpl {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change

Copy link

Run Gluten Clickhouse CI

@marin-ma marin-ma force-pushed the pullout-proj-generate branch from 21f5b44 to 4e73c2c Compare March 25, 2024 06:53
Copy link

Run Gluten Clickhouse CI

@marin-ma marin-ma merged commit 59ff500 into apache:main Mar 25, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants