Agent Groups #10

valeriano-manassero · 2020-12-14T14:35:41Z

Sometimes having different groups of agents is useful for different queues or configurations.
This is especially true if the number of GPU nodes is limited and there's the need also of some cpu nodes on other queues.

P.S.
This is a proposal for a discussion, still WIP

valeriano-manassero · 2020-12-14T15:39:05Z

Example of installation on a dev cluster I got:

agentGroups:
  - name: agent-group0
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 1
    dockerMode: false
    queues: "huge_taks_gpu default"
  - name: agent-group1
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 2
    dockerMode: false
    queues: "huge_taks_gpu default"
  - name: agent-group2
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 0
    dockerMode: false
    queues: "cpu_only_stuff default"

As you can see, with this implementation it's easier for any devops engineer to have a granular configuration about different types of agent configurations/queues.

sapir-allegro · 2020-12-14T16:00:18Z

Thanks @valeriano-manassero, that's a great idea!
I'll take a look soon and will update here.

valeriano-manassero · 2020-12-15T10:27:53Z

I also removed the hostPath mount because it's not something usable in production without tricks. Moreover the can cause issues if more than one pod is scheduled on the same node where both pods will try to write on the same path

sapir-allegro · 2020-12-15T14:32:31Z

Regarding the hostPath mount that you removed, the /root/.trains folder mount is there since it is a folder that the agent pulls data from (it contains some cache folders and more).
I don't think it would be a problem that multiple pods will use it since it's OK that more than one agent will use this folder (same as the fact that you can run multiple agents on your local computer).
Do you think it should be mounted in another way?

valeriano-manassero · 2020-12-15T14:51:23Z

Regarding the hostPath mount that you removed, the /root/.trains folder mount is there since it is a folder that the agent pulls data from (it contains some cache folders and more).
I don't think it would be a problem that multiple pods will use it since it's OK that more than one agent will use this folder (same as the fact that you can run multiple agents on your local computer).

I got errors sometimes especially when installing pip packages with many pods starting together.

Do you think it should be mounted in another way?

Usually in k8s it's good to use any pvc that may use the default storageclass so the cache is not lost if a new pod is rolled out and many different pods eventually on the same node will have their specific pv to deal with.

Moreover in this case I don't see the need of a mount since it's cache data and it's more than ok to lose it when a new pod is rolling out (cache will be regenerated).

Obv all of this imho :)

jkhenning · 2020-12-20T09:54:45Z

Hi @valeriano-manassero ,

I think the best approach is to make the hostPath mount optional, so a user could mount a persistent cache folder if the need arises. What do you think?

As a side-note, we'll need to make sure the apt-cache is unique if there are two agents running with the same cache folder (as would be the case if the hoatPath is mounted) - if you saw any problems related to this behavior, please make sure there's an open issue on them so we'll make sure this is taken care of.

valeriano-manassero · 2020-12-21T11:30:01Z

I think the best approach is to make the hostPath mount optional, so a user could mount a persistent cache folder if the need arises. What do you think?

k8s docs states hostPath is for development only and I basically agree. If you have many agents they will be probably scheduled on many different nodes, moreover it's possible a pod is rescheduled in another host losing the cache or starting to use an already existing one created in the past on that specific node.

As a side-note, we'll need to make sure the apt-cache is unique if there are two agents running with the same cache folder (as would be the case if the hoatPath is mounted) - if you saw any problems related to this behavior, please make sure there's an open issue on them so we'll make sure this is taken care of.

If apt-cache needs to be unique hostPath will not achieve this since you will get pods scheduled potentially on many nodes.
While this will work on development environments with one node only, on real clusters the right way to go should be to have a PVC in RWX mode.

It makes sense for you our am I missing any specific behaviour?

…ent group of hosts

valeriano-manassero · 2020-12-22T22:15:41Z

Since I needed to rebase (and fix a small typo in CI), I restored the hostPath mount so we can eventually try to make it an option in another PR alongside other storagesclass solution.

valeriano-manassero · 2021-01-20T14:53:04Z

Closing so I will propose changes on ClearML chart.

valeriano-manassero changed the title ~~[WIP] Agent Groups~~ Agent Groups Dec 14, 2020

valeriano-manassero added 2 commits December 22, 2020 22:56

agent group so we can have different configs/queues managed by differ…

20acb27

…ent group of hosts

fix kubeval ci script

3aae0b5

valeriano-manassero closed this Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Groups #10

Agent Groups #10

valeriano-manassero commented Dec 14, 2020

valeriano-manassero commented Dec 14, 2020

sapir-allegro commented Dec 14, 2020

valeriano-manassero commented Dec 15, 2020

sapir-allegro commented Dec 15, 2020

valeriano-manassero commented Dec 15, 2020

jkhenning commented Dec 20, 2020

valeriano-manassero commented Dec 21, 2020

valeriano-manassero commented Dec 22, 2020 •

edited

Loading

valeriano-manassero commented Jan 20, 2021

Agent Groups #10

Agent Groups #10

Conversation

valeriano-manassero commented Dec 14, 2020

valeriano-manassero commented Dec 14, 2020

sapir-allegro commented Dec 14, 2020

valeriano-manassero commented Dec 15, 2020

sapir-allegro commented Dec 15, 2020

valeriano-manassero commented Dec 15, 2020

jkhenning commented Dec 20, 2020

valeriano-manassero commented Dec 21, 2020

valeriano-manassero commented Dec 22, 2020 • edited Loading

valeriano-manassero commented Jan 20, 2021

valeriano-manassero commented Dec 22, 2020 •

edited

Loading