Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec #2468

Merged
merged 1 commit into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,26 @@ _Appears in:_
| `serviceUnhealthySecondThreshold` _integer_ | Deprecated: This field is not used anymore. ref: https://github.com/ray-project/kuberay/issues/1685 | | |
| `deploymentUnhealthySecondThreshold` _integer_ | Deprecated: This field is not used anymore. ref: https://github.com/ray-project/kuberay/issues/1685 | | |
| `serveService` _[Service](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#service-v1-core)_ | ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics. | | |
| `upgradeStrategy` _[RayServiceUpgradeStrategy](#rayserviceupgradestrategy)_ | UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None` | | |
| `serveConfigV2` _string_ | Important: Run "make" to regenerate code after modifying this file<br />Defines the applications and deployments to deploy, should be a YAML multi-line scalar string. | | |
| `rayClusterConfig` _[RayClusterSpec](#rayclusterspec)_ | | | |




#### RayServiceUpgradeStrategy

_Underlying type:_ _string_





_Appears in:_
- [RayServiceSpec](#rayservicespec)



#### ScaleStrategy


Expand Down
2 changes: 2 additions & 0 deletions helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions ray-operator/apis/ray/v1/rayservice_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,15 @@ const (
FailedToUpdateService ServiceStatus = "FailedToUpdateService"
)

type RayServiceUpgradeStrategy string

const (
// During upgrade, NewCluster strategy will create new upgraded cluster and switch to it when it becomes ready
NewCluster RayServiceUpgradeStrategy = "NewCluster"
chiayi marked this conversation as resolved.
Show resolved Hide resolved
// No new cluster will be created while the strategy is set to None
None RayServiceUpgradeStrategy = "None"
)

// These statuses should match Ray Serve's application statuses
// See `enum ApplicationStatus` in https://sourcegraph.com/github.com/ray-project/ray/-/blob/src/ray/protobuf/serve.proto for more details.
var ApplicationStatusEnum = struct {
Expand Down Expand Up @@ -57,6 +66,8 @@ type RayServiceSpec struct {
DeploymentUnhealthySecondThreshold *int32 `json:"deploymentUnhealthySecondThreshold,omitempty"`
// ServeService is the Kubernetes service for head node and worker nodes who have healthy http proxy to serve traffics.
ServeService *corev1.Service `json:"serveService,omitempty"`
// UpgradeStrategy represents the strategy used when upgrading the RayService. Currently supports `NewCluster` and `None`
UpgradeStrategy *RayServiceUpgradeStrategy `json:"upgradeStrategy,omitempty"`
// Important: Run "make" to regenerate code after modifying this file
// Defines the applications and deployments to deploy, should be a YAML multi-line scalar string.
ServeConfigV2 string `json:"serveConfigV2,omitempty"`
Expand Down
5 changes: 5 additions & 0 deletions ray-operator/apis/ray/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions ray-operator/config/crd/bases/ray.io_rayservices.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 17 additions & 2 deletions ray-operator/controllers/ray/rayservice_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,12 @@ func validateRayServiceSpec(rayService *rayv1.RayService) error {
if headSvc := rayService.Spec.RayClusterSpec.HeadGroupSpec.HeadService; headSvc != nil && headSvc.Name != "" {
return fmt.Errorf("spec.rayClusterConfig.headGroupSpec.headService.metadata.name should not be set")
}

if upgradeStrategy := rayService.Spec.UpgradeStrategy; upgradeStrategy != nil {
if *upgradeStrategy != rayv1.NewCluster && *upgradeStrategy != rayv1.None {
return fmt.Errorf("spec.UpgradeStrategy value %s is invalid, valid options are %s or %s", *upgradeStrategy, rayv1.NewCluster, rayv1.None)
}
}
return nil
}

Expand Down Expand Up @@ -422,10 +428,19 @@ func (r *RayServiceReconciler) reconcileRayCluster(ctx context.Context, rayServi
if clusterAction == RolloutNew {
// For LLM serving, some users might not have sufficient GPU resources to run two RayClusters simultaneously.
// Therefore, KubeRay offers ENABLE_ZERO_DOWNTIME as a feature flag for zero-downtime upgrades.
zeroDowntimeEnvVar := os.Getenv(ENABLE_ZERO_DOWNTIME)
rayServiceSpecUpgradeStrategy := rayServiceInstance.Spec.UpgradeStrategy
// There are two ways to enable zero downtime upgrade. Through ENABLE_ZERO_DOWNTIME env var or setting Spec.UpgradeStrategy.
// If no fields are set, zero downtime upgrade by default is enabled.
// Spec.UpgradeStrategy takes precedence over ENABLE_ZERO_DOWNTIME.
enableZeroDowntime := true
if s := os.Getenv(ENABLE_ZERO_DOWNTIME); strings.ToLower(s) == "false" {
enableZeroDowntime = false
if zeroDowntimeEnvVar != "" {
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
enableZeroDowntime = strings.ToLower(zeroDowntimeEnvVar) != "false"
}
if rayServiceSpecUpgradeStrategy != nil {
enableZeroDowntime = *rayServiceSpecUpgradeStrategy == rayv1.NewCluster
}

if enableZeroDowntime || !enableZeroDowntime && activeRayCluster == nil {
// Add a pending cluster name. In the next reconcile loop, shouldPrepareNewRayCluster will return DoNothing and we will
// actually create the pending RayCluster instance.
Expand Down
67 changes: 59 additions & 8 deletions ray-operator/controllers/ray/rayservice_controller_unit_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ func TestValidateRayServiceSpec(t *testing.T) {
Spec: rayv1.RayServiceSpec{},
})
assert.NoError(t, err, "The RayService spec is valid.")

var upgradeStrat rayv1.RayServiceUpgradeStrategy = "invalidStrategy"
err = validateRayServiceSpec(&rayv1.RayService{
Spec: rayv1.RayServiceSpec{
UpgradeStrategy: &upgradeStrat,
},
})
assert.Error(t, err, "spec.UpgradeStrategy is invalid")
}

func TestGenerateHashWithoutReplicasAndWorkersToDelete(t *testing.T) {
Expand Down Expand Up @@ -762,12 +770,13 @@ func TestReconcileRayCluster(t *testing.T) {
}

tests := map[string]struct {
activeCluster *rayv1.RayCluster
kubeRayVersion string
updateRayClusterSpec bool
enableZeroDowntime bool
shouldPrepareNewCluster bool
updateKubeRayVersion bool
activeCluster *rayv1.RayCluster
rayServiceUpgradeStrategy rayv1.RayServiceUpgradeStrategy
kubeRayVersion string
updateRayClusterSpec bool
enableZeroDowntime bool
shouldPrepareNewCluster bool
updateKubeRayVersion bool
}{
// Test 1: Neither active nor pending clusters exist. The `markRestart` function will be called, so the `PendingServiceStatus.RayClusterName` should be set.
"Zero-downtime upgrade is enabled. Neither active nor pending clusters exist.": {
Expand All @@ -790,8 +799,8 @@ func TestReconcileRayCluster(t *testing.T) {
enableZeroDowntime: true,
shouldPrepareNewCluster: true,
},
// Test 4: The active cluster exists. Trigger the zero-downtime upgrade.
"Zero-downtime upgrade is disabled. The active cluster exists. Trigger the zero-downtime upgrade.": {
// Test 4: The active cluster exists. Zero-downtime upgrade is false, should not trigger zero-downtime upgrade.
"Zero-downtime upgrade is disabled. The active cluster exists. Does not trigger the zero-downtime upgrade.": {
activeCluster: activeCluster.DeepCopy(),
updateRayClusterSpec: true,
enableZeroDowntime: false,
Expand All @@ -815,6 +824,45 @@ func TestReconcileRayCluster(t *testing.T) {
updateKubeRayVersion: true,
kubeRayVersion: "new-version",
},
// Test 7: Zero downtime upgrade is enabled, but is enabled through the RayServiceSpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more tests? We have 4 combinations enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None).

Copy link
Contributor Author

@chiayi chiayi Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following combinations:
env var: false, spec: true
env var: true, spec: false
env var: false, spec: unset
env var: false, spec: false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I believe there are 6 cases based on enableZeroDowntime (true, false) and rayServiceUpgradeStrategy (NewCluster, None, "").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all 6 of the combinations are covered. The "" should be covered by the original tests and the ("NewCluster", "None") x (true, false) are covered by the new tests that are added.

"Zero-downtime upgrade enabled. The active cluster exist. Zero-downtime upgrade is triggered through RayServiceSpec.": {
activeCluster: activeCluster.DeepCopy(),
updateRayClusterSpec: true,
enableZeroDowntime: true,
shouldPrepareNewCluster: true,
rayServiceUpgradeStrategy: rayv1.NewCluster,
},
// Test 8: Zero downtime upgrade is enabled. Env var is set to false but RayServiceSpec is set to NewCluster. Trigger the zero-downtime upgrade.
"Zero-downtime upgrade is enabled through RayServiceSpec and not through env var. Active cluster exist. Trigger the zero-downtime upgrade.": {
activeCluster: activeCluster.DeepCopy(),
updateRayClusterSpec: true,
enableZeroDowntime: false,
shouldPrepareNewCluster: true,
rayServiceUpgradeStrategy: rayv1.NewCluster,
},
// Test 9: Zero downtime upgrade is disabled. Env var is set to true but RayServiceSpec is set to None.
"Zero-downtime upgrade is disabled. Env var is set to true but RayServiceSpec is set to None.": {
activeCluster: activeCluster.DeepCopy(),
updateRayClusterSpec: true,
enableZeroDowntime: true,
shouldPrepareNewCluster: false,
rayServiceUpgradeStrategy: rayv1.None,
},
// Test 10: Zero downtime upgrade is enabled. Neither the env var nor the RayServiceSpec is set. Trigger the zero-downtime upgrade.
"Zero-downtime upgrade is enabled. Neither the env var nor the RayServiceSpec is set.": {
activeCluster: nil,
updateRayClusterSpec: true,
shouldPrepareNewCluster: true,
rayServiceUpgradeStrategy: "",
},
// Test 11: Zero downtime upgrade is disabled. Both the env var and the RayServiceSpec is set to disable zero-downtime upgrade.
"Zero-downtime upgrade is disabled by both env var and RayServiceSpec.": {
activeCluster: activeCluster.DeepCopy(),
updateRayClusterSpec: true,
enableZeroDowntime: false,
shouldPrepareNewCluster: false,
rayServiceUpgradeStrategy: rayv1.None,
},
}

for name, tc := range tests {
Expand All @@ -838,6 +886,9 @@ func TestReconcileRayCluster(t *testing.T) {
Scheme: newScheme,
}
service := rayService.DeepCopy()
if tc.rayServiceUpgradeStrategy != "" {
service.Spec.UpgradeStrategy = &tc.rayServiceUpgradeStrategy
}
if tc.updateRayClusterSpec {
service.Spec.RayClusterSpec.RayVersion = "new-version"
}
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading