-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove kyuubi dependency of the spark lineage plugin #3537
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3537 +/- ##
============================================
+ Coverage 51.66% 52.58% +0.92%
Complexity 13 13
============================================
Files 482 494 +12
Lines 26933 27737 +804
Branches 3760 3834 +74
============================================
+ Hits 13914 14585 +671
- Misses 11663 11758 +95
- Partials 1356 1394 +38
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Yes, I would prefer it to remove kyuubi dependencies and be able to use it in any spark application. |
is it possible to shade the kyuubi dependencies into lineage plugin ? |
Just a quesion, what's the reason for wanting shade? to reduce code redundancy? I think the shade approach would have redundant multiple jars in the Kyuubi scenario, and given the background of not much redundant code in the current approach, I prefer to remove the dependencies directly, what do you think? |
case e: SparkListenerEvent if SparkUtilsHelper.lineageClassesArePresent => | ||
e match { | ||
case OperationLineageEventWrapper(lineageEvent) => | ||
updateStatementLineageStore(lineageEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can collect both operationEvent
and lineageEvent
, whether to consider adding statementId
into the operationEvent
with the same executionId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can collect both
operationEvent
andlineageEvent
, whether to consider addingstatementId
into theoperationEvent
with the sameexecutionId
If we want to get the statementId we may need to use SQLOperationListener and make larger change, should we do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we get the statementId
via qe.sparkSession.sparkContext.getLocalProperty("kyuubi.statement.id")
in SparkOperationLineageQueryExecutionListener#onSuccess
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkOperationEvent
was been stored in the same kvstore
with LineageEvent
, can we associate them here?
The listener is a separate thread from the execution thread of the spark session, and I'm not sure if we can get it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KVStore seems to require statementId to get SparkOperationEvent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it seems to be difficult to correlate the two.
@@ -61,6 +61,14 @@ class EngineEventsStore(store: KVStore) { | |||
} | |||
} | |||
|
|||
def getStatementLineage(executionId: String): Option[SparkOperationLineageEvent] = { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has never been called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has never been called?
Yes, it's just for convenience we use it later such as displaying lineage in the Spark UI. do i need to remove it?
@iodone , thanks for your review, sorry for replying so late. |
@wForget #3444, In order to implement the |
I don't tend to add lineage plugin dependencies in kyuubi engine, because in the future we may make more adaptations in lineage plugins, such as we can send lineage events to kafla, atlas, etc. Maybe we can extract a lineage common module to make kyuubi engine depend on. |
def lineageClassesArePresent: Boolean = { | ||
try { | ||
Utils.classForName( | ||
"org.apache.kyuubi.plugin.lineage.SparkOperationLineageQueryExecutionListener") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm not talking about the maven scope. kyuubi-spark-engine
should not depend on an engine plugin。
A general question, why does a spark plugin have to rely on the kyuubi code path? |
The current lineage plugin is more like a kyuubi spark engine plugin, we rely on kyuubi-events module to produce kyuubi lineage event. This pr wants to change lineage to spark event, in order to remove kyuubi dependency. |
So we can not call it a plugin, it's a upstream package |
oh, I mess this PR with @iodone's one |
* Encapsulate a component (Kyuubi/Spark/Hive/Flink etc.) version | ||
* for the convenience of version checks. | ||
*/ | ||
case class SemanticVersion(majorVersion: Int, minorVersion: Int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, before introducing this, the version comparison seems to be very simple. Now, plenty of copies of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, before introducing this, the version comparison seems to be very simple. Now, plenty of copies of this class.
In this plugin we only use isSparkVersionAtMost method, do I need to simply reimplement a method instead of copying SemanticVersion
?
Glancing at the comments, I think the author and reviewers reached a consensus the lineage plugin should be a pure Spark plugin and should not depend on Some users want to use this plugin along w/ the latest release |
Good idea. I will narrow this PR scope. |
seems unrelated failure:
|
I have the same problem in #3558 , but other unrelated changes do not trigger this failure case. |
This test case doesn't use the kyuubi-spark-lineage plugin, let me re-run the check. |
In my case, this failure case still exists with re-running |
It is being followed up in #3753, let's try again later. |
Thanks, merging to master |
Why are the changes needed?
Remove kyuubi dependency of the spark lineage plugin.
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request