-
Notifications
You must be signed in to change notification settings - Fork 26
Agent Groups #10
Agent Groups #10
Conversation
Example of installation on a dev cluster I got:
As you can see, with this implementation it's easier for any devops engineer to have a granular configuration about different types of agent configurations/queues. |
Thanks @valeriano-manassero, that's a great idea! |
I also removed the hostPath mount because it's not something usable in production without tricks. Moreover the can cause issues if more than one pod is scheduled on the same node where both pods will try to write on the same path |
Regarding the hostPath mount that you removed, the |
I got errors sometimes especially when installing pip packages with many pods starting together.
Usually in k8s it's good to use any pvc that may use the default storageclass so the cache is not lost if a new pod is rolled out and many different pods eventually on the same node will have their specific pv to deal with. Moreover in this case I don't see the need of a mount since it's cache data and it's more than ok to lose it when a new pod is rolling out (cache will be regenerated). Obv all of this imho :) |
Hi @valeriano-manassero , I think the best approach is to make the hostPath mount optional, so a user could mount a persistent cache folder if the need arises. What do you think? As a side-note, we'll need to make sure the apt-cache is unique if there are two agents running with the same cache folder (as would be the case if the hoatPath is mounted) - if you saw any problems related to this behavior, please make sure there's an open issue on them so we'll make sure this is taken care of. |
k8s docs states
If apt-cache needs to be unique hostPath will not achieve this since you will get pods scheduled potentially on many nodes. It makes sense for you our am I missing any specific behaviour? |
Since I needed to rebase (and fix a small typo in CI), I restored the hostPath mount so we can eventually try to make it an option in another PR alongside other storagesclass solution. |
Closing so I will propose changes on ClearML chart. |
Sometimes having different groups of agents is useful for different queues or configurations.
This is especially true if the number of GPU nodes is limited and there's the need also of some cpu nodes on other queues.
P.S.
This is a proposal for a discussion, still WIP