You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allow injecting a unique nodeSelector and toleration for each LWS replica to trigger cluster autoscaler to create a dedicated placement group for each replica.
In the api, the user sets the key they would like to use, and the value would be the name of the replica (the leader pod name)
The result is a toleration injected on the pods of a group as follows:
- key: group
operator: Equal
value: <lws-leader-name>
effect: NoSchedule
Why is this needed:
To force cluster autoscaler to create a node group per replica, which can be necessary to create compactly placed nodes (on the same rack for example) for better network performance, and can improve multi-host GPU inference.
Completion requirements:
This enhancement requires the following artifacts:
Design doc
API change
Docs update
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered:
Sorry, I don't quite understand how compact-placement-group is defined.
Does compact-placement-group mean the name of a leader pod or a user-defined field name?
What would you like to be added:
Allow injecting a unique nodeSelector and toleration for each LWS replica to trigger cluster autoscaler to create a dedicated placement group for each replica.
In the api, the user sets the key they would like to use, and the value would be the name of the replica (the leader pod name)
The result is a nodeSelector injected as follows:
compact-placement-group: <lws-leader-name>
Similarly for tolerations:
The result is a toleration injected on the pods of a group as follows:
Why is this needed:
To force cluster autoscaler to create a node group per replica, which can be necessary to create compactly placed nodes (on the same rack for example) for better network performance, and can improve multi-host GPU inference.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: