Running same container on multiple mixed driver CUDA servers #25
Replies: 1 comment 2 replies
-
Hi, Min, Thanks for your post about using Lxroot over NFS on multiple hosts with various CUDA versions. Quite impressive! I don't think I've ever run Lxroot on a networked filesystem. Nor have I ever needed to tweak A couple minor comments. You run Lxroot like this:
I suspect this can be simplified to:
Similarly, this:
Can probably be replaced with this:
And now on another topic ... If you've read my recent updates about Lxroot, you may be aware that I have (for over a year now), made various changes and improvements to Lxroot, but that I have not published those changes because I wasn't sure if anyone would actually use any of the improvements. These unpublished improvements would allow your container to be created as follows:
So, to explain:
It might also be possible to use this new directory structure to bind mount your The goal of these improvements is two fold:
So this:
Might become this:
If you are just using Lxroot to solve one problem, then the benefit from these improvements is small. However, I personally manage many Lxroot containers, and I find these improvement greatly reduce my cognitive overhead when I create, examine, and modify my containers. Another unpublished improvement is that Lxroot uses tmpfs by default. So by default, |
Beta Was this translation helpful? Give feedback.
-
So here's the problem I have: there are multiple servers in uni, and all run a bit different systems, with different nvidia drivers, different python versions, and other annoying inconsistencies. Running lxroot container on shared NFS across multiple machines works great, but CUDA is a bit of a headache. So I want to share my solution.
The idea is that you want to have matching nvidia driver libs, which can be found in host:
These can be copied to some directory in container, e.g.
/usr/lib/nvidia/$_NVVER
. In my case I created a package in arch that installs multiple versions in the container.Before I start here's some variables:
Each host stores its own ld cache for the container and looks like this:
Initially we build ld cache on container
If nvidia driver version in container and host do not match, remove ld cache and rebuild it with code above.
Before running container we always make sure ld cache stored in host is used. This gets loaded in memory and does not matter if another instance overrides it.
Final bits to run container:
Beta Was this translation helpful? Give feedback.
All reactions