-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1601] Support revise lost shuffles #2746
base: main
Are you sure you want to change the base?
Conversation
@FMX, could you also support the corresponding cli command for the HTTP endpoint to revise lost shuffles? |
Sounds good. I'll add the cli command. |
@SteNicholas Thanks. I have added the CLI command and the API endpoint. Please review this PR when you have time. |
if (masterOptions.addClusterAlias != null && masterOptions.addClusterAlias.nonEmpty) | ||
runAddClusterAlias | ||
if (masterOptions.removeClusterAlias != null && masterOptions.removeClusterAlias.nonEmpty) | ||
if (masterOptions.removeClusterAlias != null && masterOptions.removeClusterAlias.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this?
@@ -107,6 +107,9 @@ enum MessageType { | |||
REPORT_BARRIER_STAGE_ATTEMPT_FAILURE_RESPONSE = 84; | |||
SEGMENT_START = 85; | |||
NOTIFY_REQUIRED_SEGMENT = 86; | |||
|
|||
REVISE_LOST_SHUFFLES = 202; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the number start from 202?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep compatible with our internal version.
@@ -433,6 +436,7 @@ message PbHeartbeatFromApplicationResponse { | |||
repeated PbWorkerInfo excludedWorkers = 2; | |||
repeated PbWorkerInfo unknownWorkers = 3; | |||
repeated PbWorkerInfo shuttingWorkers = 4; | |||
repeated int32 registeredShuffles = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeated int32 registeredShuffles = 6; | |
repeated int32 registeredShuffles = 5; |
@@ -73,6 +75,7 @@ message ResourceRequest { | |||
optional WorkerEventRequest workerEventRequest = 22; | |||
optional ApplicationMetaRequest applicationMetaRequest = 23; | |||
optional ReportWorkerDecommissionRequest reportWorkerDecommissionRequest = 24; | |||
optional ReviseLostShufflesRequest reviseLostShufflesRequest = 102; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the number 102?
@@ -1069,13 +1096,18 @@ private[celeborn] class Master( | |||
if (shouldResponse) { | |||
// UserResourceConsumption and DiskInfo are eliminated from WorkerInfo | |||
// during serialization of HeartbeatFromApplicationResponse | |||
var appRelatedShuffles = statusSystem.registeredAppAndShuffles.get(appId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var appRelatedShuffles = statusSystem.registeredAppAndShuffles.get(appId) | |
val appRelatedShuffles = statusSystem.registeredAppAndShuffles.getOrDefault(appId, Collecitons.emptySet()) |
e71ac9e
to
1daa172
Compare
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
What changes were proposed in this pull request?
To support revising lost shuffle IDs in a long-running job such as flink batch jobs.
Why are the changes needed?
Does this PR introduce any user-facing change?
NO.
How was this patch tested?
Cluster tests.