-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Pullout pre/post project for generate #4952
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
83eef04
to
cea6263
Compare
Run Gluten Clickhouse CI |
cea6263
to
c3529b8
Compare
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
@liujiayi771 Could you help to review? I also have concerns about the fallback mechanism after pulling out the pre/post project. As the rules are applied before |
cc: @zhouyuan |
@marin-ma |
cc @ulysses-you |
@@ -750,4 +750,8 @@ class CHSparkPlanExecApi extends SparkPlanExecApi { | |||
): GenerateExecTransformerBase = { | |||
CHGenerateExecTransformer(generator, requiredChildOutput, outer, generatorOutput, child) | |||
} | |||
|
|||
override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of pull out should also apply to CH? It's just that CH hasn't encountered a scenario that requires pull out yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, CH doesn't need pre/post project for their already supported generator type. But not sure if it would be needed in the future with more Generator
supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explode
and PosExplode
are already supported in CH, and pre/post projection are not needed. Looks like we cannot unify the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for pre-project, CH and Velox are consistent; they can only execute expressions within the project operator. I think that the logic of pre-project can also be applied to CH at present, it's just that CH has not yet started to adapt GeneratorExec
that Generator.child
include expressions. I think you can put GeneratorExec
into needsPreProject
for validation, and only apply the pull-out logic when the Generator.child
is not an Attribute
. This should not affect the current CH, and later, if CH wants to adapt GeneratorExec
with expressions in Generator.child
, it can be directly reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenerateExec
was initially implemented for CH, and they should have the native implementation for the generator expressions. Please check #574
Velox backend does not have the native function implementation to the generator expressions. But it has an UnnestNode
operator that can help to generate the expected output. That's why a pre/post projection is added to adopt the input/output for the UnnestNode
.
I believe CH doesn't require the pre/post project logic for the supported generator. cc @taiyang-li
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch doesn't need it for now. thanks for reminding me
@@ -750,4 +750,8 @@ class CHSparkPlanExecApi extends SparkPlanExecApi { | |||
): GenerateExecTransformerBase = { | |||
CHGenerateExecTransformer(generator, requiredChildOutput, outer, generatorOutput, child) | |||
} | |||
|
|||
override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of pull out should also apply to CH? It's just that CH hasn't encountered a scenario that requires pull out yet? If that is the case, then we can centralize these pieces of code within the pull out rule.
|
||
override def genPreProjectForGenerate(generate: GenerateExec): SparkPlan = { | ||
if (supportsGenerate(generate)) { | ||
val newGeneratorChild = generate.generator.asInstanceOf[UnaryExpression].child match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Generator
s are not UnaryExpression
, use match case, and throw exception if it is not UnaryExpression
.
val newGeneratorChild = generate.generator.asInstanceOf[UnaryExpression].child match { | ||
case attr: Attribute => attr | ||
case expr @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _)) => | ||
Alias(expr, generatePreAliasName)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to maintain consistency with the code patterns of other 'pull out' operations, using methods such as isNotAttribute
and replaceExpressionWithAttribute
from PullOutProjectHelper
.
case expr @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _)) => | ||
Alias(expr, generatePreAliasName)() | ||
case other => | ||
throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the only three scenarios that require a 'pull out'? Is it the case that in other situations, as long as it's not an Attribute, a 'pull out' can be performed without necessarily throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all kind of child expression I've found from the existing UTs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for non-Attribute
cases, you can simply hand them over to the replaceExpressionWithAttribute
method to handle, without the need to match the current cases only encountered in UTs. If other cases arise in the future, they can also be supported. If the new case is not yet supported by Gluten, it will only cause pre-project to fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the implementation in the initial commit. Could you help to check if that matches your expectation?
c3529b8#diff-c49dd56d8ac0d29278ad2e6a33cc0cb97d86a4c74d3e5c011bad4733ba1203feR186-R195
The reason of giving up using replaceExpressionWithAttribute
is that this function also matches BouldReference
, which is not the case used by Generator.child
, and the output cannot be directly used as the input to pre-project. The new child should be either the original Attribute
, or an Alias
to other expressions.
I would think removing the type list @ (Literal(_, _) | CreateMap(_, _) | CreateArray(_, _))
here is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Do you mean to say that if Generator.child
is a BoundReference
, it also needs to be wrapped with an Alias?
The BoundReference
within replaceExpressionWithAttribute
is a special optimization case that was initially placed within the case other
. I think if that's the only reason, we could add a parameter to replaceExpressionWithAttribute
named replaceBoundReference: Boolean
, with the default value of false. The Generator
could set this to true. Then it could be changed to case e: BoundReference if !replaceBoundReference => e
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that in your initial commit, the main logic codes are put in pull-out rule. Why do you move it to SparkPlanExecApi
now? I think it’s better for SparkPlanExecApi
to only control whether it's necessary to pull out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 This pull-out logic fully depends on the native implementation of Velox library. It's not a general case.
generate.copy( | ||
generator = | ||
generate.generator.withNewChildren(Seq(newGeneratorChild)).asInstanceOf[Generator], | ||
child = ProjectExec(generate.child.output :+ newGeneratorChild, generate.child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use eliminateProjectList
to get the projectList
of pre project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean by using elimainateProjectList(generate.child.output, newGeneratorChild)
? newGeneratorChild
can be a duplicated Attribute in generate.child.output
. It shouldn't be eliminated because the native backend identifies the last field of projection as generator's input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments for this explanation?
|
||
import scala.collection.JavaConverters._ | ||
|
||
case class GlutenGeneratorWrapper(unwrapped: Generator) extends Generator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this wrapper ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper is used to stop applying the PullOutPostProject
on the same GenerateExec
once it has been applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to worry about this. PullOutPostProject
would not apply same operator twice. The RewriteSparkPlanRulesManager
framework makes sure only transforming once for each operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After transforming, the GenerateExec
becomes ProjectExec(newChild: GenerateExec)
, and the rule continues to apply on the newChild: GenerateExec
, which results in infinite recursion. By adding this wrapper, the next time the supportsGenerate
will return false, preventing it from infinitely applying on the GenerateExec
https://github.com/apache/incubator-gluten/pull/4952/files#diff-4ebd3cbecb20dc83596c28d20936d40f48ed1b41f334525fd5b5bdca3ac3bf3fR730
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, shall we change transform
to transformUp
?
Lines 75 to 79 in bb04647
case (plan, rule) => | |
// Some rewrite rules may generate new parent plan node, we should use transform to | |
// rewrite the original plan. For example, PullOutPreProject and PullOutPostProject | |
// will generate post-project plan node. | |
plan.transform { case p => rule.apply(p) } |
Run Gluten Clickhouse CI |
context, | ||
operatorId, | ||
child.output.size) | ||
object PullOutGenerateProjectHelper extends PullOutProjectHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 Could you help to review again? I've made some code structure changes since your last review. This class is created for velox backend to generate the pre/post projection for GenerateExec. By extending PullOutProjectHelper
here, the protected scope of the helper functions can be preserveed.
@@ -57,12 +57,13 @@ trait PullOutProjectHelper { | |||
|
|||
protected def replaceExpressionWithAttribute( | |||
expr: Expression, | |||
projectExprsMap: mutable.HashMap[Expression, NamedExpression]): Expression = | |||
projectExprsMap: mutable.HashMap[Expression, NamedExpression], | |||
replaceBoundReference: Boolean = true): Expression = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value should be false, and GenerateExec
set it to true. "replace" means using an Alias
wrapper.
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
@liujiayi771 @ulysses-you Could you help to review again? Thanks! |
It seems the failed tests are irrelevant, I tried to re-run the failed tests. |
@marin-ma LGTM. Thanks. |
} | ||
|
||
object SparkPlanExecApiImpl {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary change
Run Gluten Clickhouse CI |
21f5b44
to
4e73c2c
Compare
Run Gluten Clickhouse CI |
A follow-up of #4901
Pullout pre/post project for
GenerateExec
. The added projections are moved into Backends API. With this approach,Literal
andCreateArray
forInline
are also supported as they can be directly constructed by the expression transformer in the project.Other fixes: 1. Fix coredump on array of null Struct Literal; fallback this case. 2. Add field name for constructing struct literal in
SubstratiToVeloxExpr
to avoid the downstream Velox operator from referencing unexpected struct fields by name.