Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove kyuubi dependency of the spark lineage plugin #3537

Closed
wants to merge 4 commits into from

Conversation

wForget
Copy link
Member

@wForget wForget commented Sep 21, 2022

Why are the changes needed?

Remove kyuubi dependency of the spark lineage plugin.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate
    image

  • Run test locally before make a pull request

@wForget
Copy link
Member Author

wForget commented Sep 21, 2022

cc @pan3793 @iodone

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Merging #3537 (1cd2f54) into master (66f28ef) will increase coverage by 0.92%.
The diff coverage is 48.71%.

@@             Coverage Diff              @@
##             master    #3537      +/-   ##
============================================
+ Coverage     51.66%   52.58%   +0.92%     
  Complexity       13       13              
============================================
  Files           482      494      +12     
  Lines         26933    27737     +804     
  Branches       3760     3834      +74     
============================================
+ Hits          13914    14585     +671     
- Misses        11663    11758      +95     
- Partials       1356     1394      +38     
Impacted Files Coverage Δ
...bi/plugin/lineage/helper/SparkListenerHelper.scala 33.33% <ø> (ø)
...in/lineage/helper/SparkSQLLineageParseHelper.scala 62.70% <0.00%> (ø)
...kyuubi/plugin/lineage/helper/SemanticVersion.scala 39.28% <39.28%> (ø)
.../SparkOperationLineageQueryExecutionListener.scala 66.66% <50.00%> (ø)
.../plugin/lineage/events/OperationLineageEvent.scala 66.66% <100.00%> (+2.66%) ⬆️
...ache/spark/kyuubi/lineage/SparkContextHelper.scala 100.00% <100.00%> (ø)
...apache/kyuubi/service/TBinaryFrontendService.scala 47.77% <0.00%> (-39.98%) ⬇️
...rg/apache/kyuubi/ctl/cmd/log/LogBatchCommand.scala 63.93% <0.00%> (-14.07%) ⬇️
...kyuubi/engine/spark/session/SparkSessionImpl.scala 61.53% <0.00%> (-9.05%) ⬇️
...che/kyuubi/server/KyuubiTHttpFrontendService.scala 59.86% <0.00%> (-4.37%) ⬇️
... and 82 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@iodone
Copy link
Contributor

iodone commented Sep 22, 2022

cc @pan3793 @iodone

What is the reason for the removal? Is it possible to have the lineage plugin applied to the custom spark engine that is not Kyubi Engine?

@wForget
Copy link
Member Author

wForget commented Sep 22, 2022

Is it possible to have the lineage plugin applied to the custom spark engine that is not Kyubi Engine?

Yes, I would prefer it to remove kyuubi dependencies and be able to use it in any spark application.

@ulysses-you
Copy link
Contributor

is it possible to shade the kyuubi dependencies into lineage plugin ?

@yikf
Copy link
Contributor

yikf commented Sep 22, 2022

is it possible to shade the kyuubi dependencies into lineage plugin ?

Just a quesion, what's the reason for wanting shade? to reduce code redundancy?

I think the shade approach would have redundant multiple jars in the Kyuubi scenario, and given the background of not much redundant code in the current approach, I prefer to remove the dependencies directly, what do you think?

case e: SparkListenerEvent if SparkUtilsHelper.lineageClassesArePresent =>
e match {
case OperationLineageEventWrapper(lineageEvent) =>
updateStatementLineageStore(lineageEvent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can collect both operationEvent and lineageEvent, whether to consider adding statementId into the operationEvent with the same executionId

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can collect both operationEvent and lineageEvent, whether to consider adding statementId into the operationEvent with the same executionId

If we want to get the statementId we may need to use SQLOperationListener and make larger change, should we do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we get the statementId via qe.sparkSession.sparkContext.getLocalProperty("kyuubi.statement.id") in SparkOperationLineageQueryExecutionListener#onSuccess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkOperationEvent was been stored in the same kvstore with LineageEvent , can we associate them here?

The listener is a separate thread from the execution thread of the spark session, and I'm not sure if we can get it correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KVStore seems to require statementId to get SparkOperationEvent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it seems to be difficult to correlate the two.

@@ -61,6 +61,14 @@ class EngineEventsStore(store: KVStore) {
}
}

def getStatementLineage(executionId: String): Option[SparkOperationLineageEvent] = {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has never been called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has never been called?

Yes, it's just for convenience we use it later such as displaying lineage in the Spark UI. do i need to remove it?

@wForget
Copy link
Member Author

wForget commented Sep 26, 2022

@iodone , thanks for your review, sorry for replying so late.

@iodone
Copy link
Contributor

iodone commented Sep 28, 2022

@wForget #3444, In order to implement the PlanOnly mode of lineage, imported SQLParserHelper with spark-lineage-plugin. This will cause the kyuubi spark engine to strongly depend on spark-lineage-plugin.
Are you in favor of this usage, and if that holds true, your implementation here could just refer directly to spark-lineage-plugin. WDYT?

@wForget
Copy link
Member Author

wForget commented Sep 28, 2022

@wForget #3444, In order to implement the PlanOnly mode of lineage, imported SQLParserHelper with spark-lineage-plugin. This will cause the kyuubi spark engine to strongly depend on spark-lineage-plugin. Are you in favor of this usage, and if that holds true, your implementation here could just refer directly to spark-lineage-plugin. WDYT?

I don't tend to add lineage plugin dependencies in kyuubi engine, because in the future we may make more adaptations in lineage plugins, such as we can send lineage events to kafla, atlas, etc. Maybe we can extract a lineage common module to make kyuubi engine depend on.

def lineageClassesArePresent: Boolean = {
try {
Utils.classForName(
"org.apache.kyuubi.plugin.lineage.SparkOperationLineageQueryExecutionListener")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn #3444 Can we refer to this method, kyuubi-spark-engine depend on the lineage plugin using provided scope

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm not talking about the maven scope. kyuubi-spark-engine should not depend on an engine plugin。

@yaooqinn
Copy link
Member

yaooqinn commented Sep 29, 2022

A general question, why does a spark plugin have to rely on the kyuubi code path?

@wForget
Copy link
Member Author

wForget commented Sep 29, 2022

A general question, why does a spark plugin have to rely on the kyuubi code path?

The current lineage plugin is more like a kyuubi spark engine plugin, we rely on kyuubi-events module to produce kyuubi lineage event. This pr wants to change lineage to spark event, in order to remove kyuubi dependency.
However, I thought we might use lineage spark event in engine later such as show it on spark ui, so I introduced lineage plugin dependency in kyuubi engine again.

@yaooqinn
Copy link
Member

So we can not call it a plugin, it's a upstream package

@yaooqinn
Copy link
Member

oh, I mess this PR with @iodone's one

* Encapsulate a component (Kyuubi/Spark/Hive/Flink etc.) version
* for the convenience of version checks.
*/
case class SemanticVersion(majorVersion: Int, minorVersion: Int) {
Copy link
Member

@yaooqinn yaooqinn Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, before introducing this, the version comparison seems to be very simple. Now, plenty of copies of this class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, before introducing this, the version comparison seems to be very simple. Now, plenty of copies of this class.

In this plugin we only use isSparkVersionAtMost method, do I need to simply reimplement a method instead of copying SemanticVersion?

@pan3793
Copy link
Member

pan3793 commented Nov 1, 2022

Glancing at the comments, I think the author and reviewers reached a consensus the lineage plugin should be a pure Spark plugin and should not depend on kyuubi-common, kyuubi-event modules, but have some arguments on the relationship between kyuubi-spark-sql-engine and kyuubi-spark-lineage, so, can we narrow this PR scope into making kyuubi-spark-lineage independent w/ kyuubi-event?

Some users want to use this plugin along w/ the latest release kyuubi-spark-sql-engine but got class conflict error since there were incompatible changes that occurred in kyuubi-event.

@wForget
Copy link
Member Author

wForget commented Nov 2, 2022

Glancing at the comments, I think the author and reviewers reached a consensus the lineage plugin should be a pure Spark plugin and should not depend on kyuubi-common, kyuubi-event modules, but have some arguments on the relationship between kyuubi-spark-sql-engine and kyuubi-spark-lineage, so, can we narrow this PR scope into making kyuubi-spark-lineage independent w/ kyuubi-event?

Good idea. I will narrow this PR scope.

@wForget
Copy link
Member Author

wForget commented Nov 2, 2022

seems unrelated failure:

AdminResourceSuite:
- delete engine - user share level *** FAILED ***
  List("serviceUri=localhost:32971;version=1.7.0-SNAPSHOT;refId=8bd0f165-37a6-49cf-af67-d82f7a0b3cbb;sequence=0000000000") had size 1 instead of expected size 0 (AdminResourceSuite.scala:103)

@iodone
Copy link
Contributor

iodone commented Nov 2, 2022

seems unrelated failure:

AdminResourceSuite:
- delete engine - user share level *** FAILED ***
  List("serviceUri=localhost:32971;version=1.7.0-SNAPSHOT;refId=8bd0f165-37a6-49cf-af67-d82f7a0b3cbb;sequence=0000000000") had size 1 instead of expected size 0 (AdminResourceSuite.scala:103)

I have the same problem in #3558 , but other unrelated changes do not trigger this failure case.

@wForget
Copy link
Member Author

wForget commented Nov 2, 2022

seems unrelated failure:

AdminResourceSuite:
- delete engine - user share level *** FAILED ***
  List("serviceUri=localhost:32971;version=1.7.0-SNAPSHOT;refId=8bd0f165-37a6-49cf-af67-d82f7a0b3cbb;sequence=0000000000") had size 1 instead of expected size 0 (AdminResourceSuite.scala:103)

I have the same problem in #3558 , but other unrelated changes do not trigger this failure case.

This test case doesn't use the kyuubi-spark-lineage plugin, let me re-run the check.

@iodone
Copy link
Contributor

iodone commented Nov 3, 2022

seems unrelated failure:

AdminResourceSuite:
- delete engine - user share level *** FAILED ***
  List("serviceUri=localhost:32971;version=1.7.0-SNAPSHOT;refId=8bd0f165-37a6-49cf-af67-d82f7a0b3cbb;sequence=0000000000") had size 1 instead of expected size 0 (AdminResourceSuite.scala:103)

I have the same problem in #3558 , but other unrelated changes do not trigger this failure case.

This test case doesn't use the kyuubi-spark-lineage plugin, let me re-run the check.

In my case, this failure case still exists with re-running

@wForget
Copy link
Member Author

wForget commented Nov 3, 2022

In my case, this failure case still exists with re-running

It is being followed up in #3753, let's try again later.

@pan3793 pan3793 added this to the v1.7.0 milestone Nov 3, 2022
@pan3793
Copy link
Member

pan3793 commented Nov 3, 2022

Thanks, merging to master

@pan3793 pan3793 closed this in 7f83c45 Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants