Skip to content
This repository has been archived by the owner on Jan 24, 2023. It is now read-only.

Agent Groups #10

Closed
wants to merge 2 commits into from
Closed

Agent Groups #10

wants to merge 2 commits into from

Conversation

valeriano-manassero
Copy link
Contributor

Sometimes having different groups of agents is useful for different queues or configurations.
This is especially true if the number of GPU nodes is limited and there's the need also of some cpu nodes on other queues.

P.S.
This is a proposal for a discussion, still WIP

@valeriano-manassero
Copy link
Contributor Author

Example of installation on a dev cluster I got:

agentGroups:
  - name: agent-group0
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 1
    dockerMode: false
    queues: "huge_taks_gpu default"
  - name: agent-group1
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 2
    dockerMode: false
    queues: "huge_taks_gpu default"
  - name: agent-group2
    numberOfTrainsAgents: 2
    nvidiaGpusPerAgent: 0
    dockerMode: false
    queues: "cpu_only_stuff default"

As you can see, with this implementation it's easier for any devops engineer to have a granular configuration about different types of agent configurations/queues.

@valeriano-manassero valeriano-manassero changed the title [WIP] Agent Groups Agent Groups Dec 14, 2020
@sapir-allegro
Copy link
Contributor

Thanks @valeriano-manassero, that's a great idea!
I'll take a look soon and will update here.

@valeriano-manassero
Copy link
Contributor Author

I also removed the hostPath mount because it's not something usable in production without tricks. Moreover the can cause issues if more than one pod is scheduled on the same node where both pods will try to write on the same path

@sapir-allegro
Copy link
Contributor

Regarding the hostPath mount that you removed, the /root/.trains folder mount is there since it is a folder that the agent pulls data from (it contains some cache folders and more).
I don't think it would be a problem that multiple pods will use it since it's OK that more than one agent will use this folder (same as the fact that you can run multiple agents on your local computer).
Do you think it should be mounted in another way?

@valeriano-manassero
Copy link
Contributor Author

Regarding the hostPath mount that you removed, the /root/.trains folder mount is there since it is a folder that the agent pulls data from (it contains some cache folders and more).
I don't think it would be a problem that multiple pods will use it since it's OK that more than one agent will use this folder (same as the fact that you can run multiple agents on your local computer).

I got errors sometimes especially when installing pip packages with many pods starting together.

Do you think it should be mounted in another way?

Usually in k8s it's good to use any pvc that may use the default storageclass so the cache is not lost if a new pod is rolled out and many different pods eventually on the same node will have their specific pv to deal with.

Moreover in this case I don't see the need of a mount since it's cache data and it's more than ok to lose it when a new pod is rolling out (cache will be regenerated).

Obv all of this imho :)

@jkhenning
Copy link
Member

Hi @valeriano-manassero ,

I think the best approach is to make the hostPath mount optional, so a user could mount a persistent cache folder if the need arises. What do you think?

As a side-note, we'll need to make sure the apt-cache is unique if there are two agents running with the same cache folder (as would be the case if the hoatPath is mounted) - if you saw any problems related to this behavior, please make sure there's an open issue on them so we'll make sure this is taken care of.

@valeriano-manassero
Copy link
Contributor Author

I think the best approach is to make the hostPath mount optional, so a user could mount a persistent cache folder if the need arises. What do you think?

k8s docs states hostPath is for development only and I basically agree. If you have many agents they will be probably scheduled on many different nodes, moreover it's possible a pod is rescheduled in another host losing the cache or starting to use an already existing one created in the past on that specific node.

As a side-note, we'll need to make sure the apt-cache is unique if there are two agents running with the same cache folder (as would be the case if the hoatPath is mounted) - if you saw any problems related to this behavior, please make sure there's an open issue on them so we'll make sure this is taken care of.

If apt-cache needs to be unique hostPath will not achieve this since you will get pods scheduled potentially on many nodes.
While this will work on development environments with one node only, on real clusters the right way to go should be to have a PVC in RWX mode.

It makes sense for you our am I missing any specific behaviour?

@valeriano-manassero
Copy link
Contributor Author

valeriano-manassero commented Dec 22, 2020

Since I needed to rebase (and fix a small typo in CI), I restored the hostPath mount so we can eventually try to make it an option in another PR alongside other storagesclass solution.

@valeriano-manassero
Copy link
Contributor Author

Closing so I will propose changes on ClearML chart.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants