-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic job scaling rules #756
base: master
Are you sure you want to change the base?
Conversation
Test Results645 tests +23 635 ✅ +23 8m 36s ⏱️ +35s Results for commit 5cbfd7f. ± Comparison against base commit 1ffc3b9. This pull request removes 2 and adds 25 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
c8f9ba9
to
c0744fe
Compare
Uploaded ArtifactsTo use these artifacts in your Gradle project, paste the following lines in your build.gradle.
|
bc4b4c7
to
4fd427a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At very high level the implementation follows the idea in the drawing. I don't see any major flaws to block you for integration testing. However for a better review experience I think there is opportunity to split this PR is smaller chunks.
public static final String TRIGGER_TYPE_SCHEDULE = "schedule"; | ||
public static final String TRIGGER_TYPE_PERPETUAL = "perpetual"; | ||
public static final String TRIGGER_TYPE_CUSTOM = "custom"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe use enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh i found using enum on proto class and tests to be quite verbose and error prone comparing to const strings.
|
||
@Builder | ||
@Value | ||
public class JobScalingRule implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this model is quite flexible, but I'm concerned it could lead to confusion about which policies are currently playing a major role or influencing decisions. Some policies might even contradict each other, like "scaling up vs. scaling down," potentially causing the algorithm to oscillate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a new metric to report current active rule too. Will leave the logic as is in beta and revisit later from user feedback.
import io.mantisrx.server.core.NamedJobInfo; | ||
import io.mantisrx.server.core.PostJobStatusRequest; | ||
import io.mantisrx.server.core.Status; | ||
import io.mantisrx.server.core.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't use wildcard imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this and i would like to just apply current default profile.
@@ -745,6 +741,52 @@ public Observable<JobSchedulingInfo> schedulingChanges(final String jobId) { | |||
; | |||
} | |||
|
|||
public Observable<JobScalerRuleInfo> jobScalerRulesStream(final String jobId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider adding docstrings here to provide a quick understanding of the purpose of this long function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: These functions have similar verbose code for subscribing and streaming events from master (or at least that's my understanding of this code). Consider extracting that into a separate auxiliary function.
import io.mantisrx.server.core.JobSchedulingInfo; | ||
import io.mantisrx.server.core.NamedJobInfo; | ||
import io.mantisrx.server.core.Status; | ||
import io.mantisrx.server.core.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
String jobId; | ||
boolean jobCompleted; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I didn't expect these to be part of this message. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the same semantics as other stream APIs and this mechanism is used to gracefully close the connection if the current job is terminated to avoid connection leak.
@@ -308,6 +315,9 @@ private Receive getInitializingBehavior() { | |||
.match(ScaleStageRequest.class, (x) -> getSender().tell(new ScaleStageResponse(x.requestId, CLIENT_ERROR, genUnexpectedMsg(x.toString(), state), 0), getSelf())) | |||
.match(ResubmitWorkerRequest.class, (x) -> getSender().tell(new ResubmitWorkerResponse(x.requestId, CLIENT_ERROR, genUnexpectedMsg(x.toString(), state)), getSelf())) | |||
.match(WorkerEvent.class, (x) -> logger.warn(genUnexpectedMsg(x.toString(), state))) | |||
.match(JobClusterScalerRuleProto.CreateScalerRuleRequest.class, (x) -> getSender().tell(JobClusterScalerRuleProto.CreateScalerRuleResponse.builder().requestId(x.requestId).responseCode(CLIENT_ERROR_NOT_FOUND).message(genUnexpectedMsg(x.toString(), state)).build(), getSelf())) | |||
.match(JobClusterScalerRuleProto.DeleteScalerRuleRequest.class, (x) -> getSender().tell(JobClusterScalerRuleProto.DeleteScalerRuleResponse.builder().requestId(x.requestId).responseCode(CLIENT_ERROR_NOT_FOUND).message(genUnexpectedMsg(x.toString(), state)).build(), getSelf())) | |||
.match(JobClusterScalerRuleProto.GetScalerRulesRequest.class, (x) -> getSender().tell(JobClusterScalerRuleProto.GetScalerRulesResponse.builder().requestId(x.requestId).responseCode(CLIENT_ERROR_NOT_FOUND).message(genUnexpectedMsg(x.toString(), state)).build(), getSelf())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you accept GetJobScalerRuleStreamRequest
during initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! updated.
6838080
to
a216116
Compare
[Reviewer Guidance]
The following is a summary of critical classes for review.
Runtime
Introduce new v2 job master service with refactor to current job scaler to allow ScalerControllerActor to recreate/shutdown running scaler. Support rule actors under coordinator actor.
API
New control plane API to support CRUD on scaling rules.
ControlPlane
Control plane implementation to support new APIs and the scaling changes stream API.
Persistency
E2E Test
Dynamic Job Scaling Rules Implementation
Problem Statement
In order to effectively manage job scaling in various scenarios, particularly during periods of bursty traffic, there is a need to dynamically update job auto scaler settings such as minimum, maximum, step-size, and burst-change. The goal is to consolidate the APIs required to modify a job's auto scaler and to allow changes to the job scaler without necessitating job resubmission.
Solution
This pull request introduces support for dynamic job scaling rules that override the default job scaler configuration. It allows multiple rules to coexist using different modes, providing flexibility in handling diverse scaling requirements.
Key Features:
JobScalingRuleCustomTrigger
interface to customize the triggering conditions and actions.Internals
ruleId
, with newer rules taking precedence over older ones./jobScalerRules/
stream API to listen for rule changes, ensuring that the system is responsive to updates.schedulingInfo
scaler config is treated as the default rule with an ID of "-1".JM v2
This implementation provides a robust framework for managing job scaling dynamically, enhancing the system's ability to handle varying traffic patterns efficiently. The introduction of multiple rule types and the ability to customize scaling behavior offers significant flexibility and control to users.
API Reference for Job Cluster Scaler Rules
This document provides details on the APIs available under
api/v1/jobClusters/{}/scalerRules
. These APIs allow you to manage scaler rules for job clusters, including creating, retrieving, and deleting scaler rules.Endpoints
1. Create Scaler Rule
POST /api/v1/jobClusters/{clusterName}/scalerRules
Request Schema
Sample Request Payload