support extended resources for Ray pods #2436

YQ-Wang · 2024-10-10T21:43:21Z

Why are these changes needed?

This PR adds support for the extended resource type "vpc.amazonaws.com/efa" in the KubeRay APIServer. This enhancement allows the creation of Ray clusters that can utilize EFA resources in Pod request specifications, similar to how CPU and memory resources are specified.

The integration of EFA support is particularly beneficial for distributed training workloads in AWS Ray clusters. By leveraging EFA, we can reduce the time required to complete large-scale distributed training tasks in Ray.

Added support for the "vpc.amazonaws.com/efa" extended resource type in Pod request specifications
Updated the KubeRay APIServer to recognize and process EFA resource requests

Related issue number

Closes #2435

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

YQ-Wang · 2024-10-11T00:31:21Z

@kevin85421 PTAL when you get a chance.

andrewsykim · 2024-10-11T01:25:01Z

proto/go_client/config.pb.go

@@ -470,6 +470,8 @@ type ComputeTemplate struct {
 	GpuAccelerator string `protobuf:"bytes,6,opt,name=gpu_accelerator,json=gpuAccelerator,proto3" json:"gpu_accelerator,omitempty"`
 	// Optional pod tolerations
 	Tolerations []*PodToleration `protobuf:"bytes,7,rep,name=tolerations,proto3" json:"tolerations,omitempty"`
+	// Optional. Number of efas
+	Efa uint32 `protobuf:"varint,8,opt,name=efa,proto3" json:"efa,omitempty"`


Is it worth generalizing this to a list of custom accelerators that are added to container resources? How long do we think this list will grow over time?

Sounds good, i generalized this as extended resources. PTAL

andrewsykim · 2024-10-14T01:35:30Z

apiserver/pkg/model/converter_test.go

+		"gpu":                "0",
+		"gpu_accelerator":    "",
+		"memory":             "8",
+		"extended_resources": "{\"vpc.amazonaws.com/efa\": 32}",


Consider "custom_resources" instead, which is more aligned to Ray terminology: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources

On second thought, maybe custom_resources is misleading because this is never passed into the --resources flag in ray start. Do you need to include this as a custom resoruce in Ray or is it enough to add it as a container resource?

This is not for custom_resources, but the extended resource for kubernetes container.

andrewsykim · 2024-10-14T01:37:56Z

apiserver/pkg/util/cluster.go

@@ -145,6 +145,15 @@ func buildNodeGroupAnnotations(computeTemplate *api.ComputeTemplate, image strin
 	return annotations
 }

+// Add resource to container
+func addResourceToContainer(container *corev1.Container, resourceName string, quantity uint32) {
+	if quantity > 0 {


nit:

if quantity == 0 { return } quantityStr := fmt.Sprint(quantity)\ container.Resources.Requests[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr) container.Resources.Limits[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr)

Done, thanks!

andrewsykim

Just left one more small comment, other LGTM

andrewsykim · 2024-10-15T14:27:26Z

apiserver/pkg/util/cluster.go

@@ -800,14 +822,20 @@ func (c *RayCluster) SetAnnotationsToAllTemplates(key string, value string) {

 // Build compute template
 func NewComputeTemplate(runtime *api.ComputeTemplate) (*corev1.ConfigMap, error) {
+	extendedResourcesJSON, err := json.Marshal(runtime.ExtendedResources)
+	if err != nil {
+		return nil, fmt.Errorf("failed to marshal extended resources: %v", err)


When we fail to marshal runtime.Tolerations on line 842, we log error instead and leave tolerations unset. Should we consider something similar for extended resources?

Returning is probably better here, we should consider updating line 841 below to also return error in a follow-up PR

In this change, I've modified the NewComputeTemplate function to return an error when marshaling runtime.Tolerations fails: #2444

support aws efa

2625fa2

kevin85421 assigned kevin85421 and andrewsykim Oct 11, 2024

andrewsykim reviewed Oct 11, 2024

View reviewed changes

support extended resources

2ccc105

YQ-Wang changed the title ~~support aws efa~~ support extended resources for Ray pods Oct 11, 2024

YQ-Wang requested a review from andrewsykim October 11, 2024 20:40

support extended resources

e9d4d7a

andrewsykim reviewed Oct 14, 2024

View reviewed changes

address comments

67c152c

YQ-Wang requested a review from andrewsykim October 14, 2024 16:48

YQ-Wang added 2 commits October 14, 2024 10:26

address comments

74cb3e5

Merge branch 'master' into aws-efa

e77cf81

andrewsykim approved these changes Oct 15, 2024

View reviewed changes

andrewsykim merged commit 22d546a into ray-project:master Oct 15, 2024
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support extended resources for Ray pods #2436

support extended resources for Ray pods #2436

YQ-Wang commented Oct 10, 2024

YQ-Wang commented Oct 11, 2024

andrewsykim Oct 11, 2024

YQ-Wang Oct 11, 2024

andrewsykim Oct 14, 2024

andrewsykim Oct 14, 2024

YQ-Wang Oct 14, 2024

andrewsykim Oct 14, 2024

YQ-Wang Oct 14, 2024

andrewsykim left a comment

andrewsykim Oct 15, 2024

andrewsykim Oct 15, 2024

YQ-Wang Oct 15, 2024

support extended resources for Ray pods #2436

support extended resources for Ray pods #2436

Conversation

YQ-Wang commented Oct 10, 2024

Why are these changes needed?

Related issue number

Checks

YQ-Wang commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment