Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support extended resources for Ray pods #2436

Merged
merged 6 commits into from
Oct 15, 2024

Conversation

YQ-Wang
Copy link
Contributor

@YQ-Wang YQ-Wang commented Oct 10, 2024

Why are these changes needed?

This PR adds support for the extended resource type "vpc.amazonaws.com/efa" in the KubeRay APIServer. This enhancement allows the creation of Ray clusters that can utilize EFA resources in Pod request specifications, similar to how CPU and memory resources are specified.

The integration of EFA support is particularly beneficial for distributed training workloads in AWS Ray clusters. By leveraging EFA, we can reduce the time required to complete large-scale distributed training tasks in Ray.

  • Added support for the "vpc.amazonaws.com/efa" extended resource type in Pod request specifications
  • Updated the KubeRay APIServer to recognize and process EFA resource requests

Related issue number

Closes #2435

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@YQ-Wang
Copy link
Contributor Author

YQ-Wang commented Oct 11, 2024

@kevin85421 PTAL when you get a chance.

@@ -470,6 +470,8 @@ type ComputeTemplate struct {
GpuAccelerator string `protobuf:"bytes,6,opt,name=gpu_accelerator,json=gpuAccelerator,proto3" json:"gpu_accelerator,omitempty"`
// Optional pod tolerations
Tolerations []*PodToleration `protobuf:"bytes,7,rep,name=tolerations,proto3" json:"tolerations,omitempty"`
// Optional. Number of efas
Efa uint32 `protobuf:"varint,8,opt,name=efa,proto3" json:"efa,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth generalizing this to a list of custom accelerators that are added to container resources? How long do we think this list will grow over time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, i generalized this as extended resources. PTAL

@YQ-Wang YQ-Wang changed the title support aws efa support extended resources for Ray pods Oct 11, 2024
@YQ-Wang YQ-Wang requested a review from andrewsykim October 11, 2024 20:40
"gpu": "0",
"gpu_accelerator": "",
"memory": "8",
"extended_resources": "{\"vpc.amazonaws.com/efa\": 32}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider "custom_resources" instead, which is more aligned to Ray terminology: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe custom_resources is misleading because this is never passed into the --resources flag in ray start. Do you need to include this as a custom resoruce in Ray or is it enough to add it as a container resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for custom_resources, but the extended resource for kubernetes container.

@@ -145,6 +145,15 @@ func buildNodeGroupAnnotations(computeTemplate *api.ComputeTemplate, image strin
return annotations
}

// Add resource to container
func addResourceToContainer(container *corev1.Container, resourceName string, quantity uint32) {
if quantity > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if quantity == 0 {
  return
}

quantityStr := fmt.Sprint(quantity)\
container.Resources.Requests[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr)
container.Resources.Limits[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@YQ-Wang YQ-Wang requested a review from andrewsykim October 14, 2024 16:48
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left one more small comment, other LGTM

@@ -800,14 +822,20 @@ func (c *RayCluster) SetAnnotationsToAllTemplates(key string, value string) {

// Build compute template
func NewComputeTemplate(runtime *api.ComputeTemplate) (*corev1.ConfigMap, error) {
extendedResourcesJSON, err := json.Marshal(runtime.ExtendedResources)
if err != nil {
return nil, fmt.Errorf("failed to marshal extended resources: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we fail to marshal runtime.Tolerations on line 842, we log error instead and leave tolerations unset. Should we consider something similar for extended resources?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning is probably better here, we should consider updating line 841 below to also return error in a follow-up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this change, I've modified the NewComputeTemplate function to return an error when marshaling runtime.Tolerations fails: #2444

@andrewsykim andrewsykim merged commit 22d546a into ray-project:master Oct 15, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support for AWS Elastic Fabric Adapter (EFA) in KubeRay
3 participants