-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support extended resources for Ray pods #2436
Conversation
@kevin85421 PTAL when you get a chance. |
proto/go_client/config.pb.go
Outdated
@@ -470,6 +470,8 @@ type ComputeTemplate struct { | |||
GpuAccelerator string `protobuf:"bytes,6,opt,name=gpu_accelerator,json=gpuAccelerator,proto3" json:"gpu_accelerator,omitempty"` | |||
// Optional pod tolerations | |||
Tolerations []*PodToleration `protobuf:"bytes,7,rep,name=tolerations,proto3" json:"tolerations,omitempty"` | |||
// Optional. Number of efas | |||
Efa uint32 `protobuf:"varint,8,opt,name=efa,proto3" json:"efa,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth generalizing this to a list of custom accelerators that are added to container resources? How long do we think this list will grow over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, i generalized this as extended resources. PTAL
"gpu": "0", | ||
"gpu_accelerator": "", | ||
"memory": "8", | ||
"extended_resources": "{\"vpc.amazonaws.com/efa\": 32}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider "custom_resources" instead, which is more aligned to Ray terminology: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, maybe custom_resources is misleading because this is never passed into the --resources
flag in ray start
. Do you need to include this as a custom resoruce in Ray or is it enough to add it as a container resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for custom_resources, but the extended resource for kubernetes container.
apiserver/pkg/util/cluster.go
Outdated
@@ -145,6 +145,15 @@ func buildNodeGroupAnnotations(computeTemplate *api.ComputeTemplate, image strin | |||
return annotations | |||
} | |||
|
|||
// Add resource to container | |||
func addResourceToContainer(container *corev1.Container, resourceName string, quantity uint32) { | |||
if quantity > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if quantity == 0 {
return
}
quantityStr := fmt.Sprint(quantity)\
container.Resources.Requests[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr)
container.Resources.Limits[corev1.ResourceName(resourceName)] = resource.MustParse(quantityStr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left one more small comment, other LGTM
@@ -800,14 +822,20 @@ func (c *RayCluster) SetAnnotationsToAllTemplates(key string, value string) { | |||
|
|||
// Build compute template | |||
func NewComputeTemplate(runtime *api.ComputeTemplate) (*corev1.ConfigMap, error) { | |||
extendedResourcesJSON, err := json.Marshal(runtime.ExtendedResources) | |||
if err != nil { | |||
return nil, fmt.Errorf("failed to marshal extended resources: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we fail to marshal runtime.Tolerations on line 842, we log error instead and leave tolerations unset. Should we consider something similar for extended resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning is probably better here, we should consider updating line 841 below to also return error in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this change, I've modified the NewComputeTemplate
function to return an error when marshaling runtime.Tolerations
fails: #2444
Why are these changes needed?
This PR adds support for the extended resource type "vpc.amazonaws.com/efa" in the KubeRay APIServer. This enhancement allows the creation of Ray clusters that can utilize EFA resources in Pod request specifications, similar to how CPU and memory resources are specified.
The integration of EFA support is particularly beneficial for distributed training workloads in AWS Ray clusters. By leveraging EFA, we can reduce the time required to complete large-scale distributed training tasks in Ray.
Related issue number
Closes #2435
Checks