Running same container on multiple mixed driver CUDA servers #25

zceemja · 2024-11-14T11:58:30Z

zceemja
Nov 14, 2024

So here's the problem I have: there are multiple servers in uni, and all run a bit different systems, with different nvidia drivers, different python versions, and other annoying inconsistencies. Running lxroot container on shared NFS across multiple machines works great, but CUDA is a bit of a headache. So I want to share my solution.

The idea is that you want to have matching nvidia driver libs, which can be found in host:

find  -L /usr/lib/ -type f \( -name "libnvidia-*.so*" -o -name "libcuda*.so*" -o -name "libnvcuvid*.so*" -o -name "libEGL_nvidia*.so*" -o -name "libGLESv1_CM_nvidia*.so*" -o -name "libGLESv2_nvidia*.so*" -o -name "libGLX_nv
idia*.so*" -o -name "libnvoptix*.so*" \)

These can be copied to some directory in container, e.g. /usr/lib/nvidia/$_NVVER. In my case I created a package in arch that installs multiple versions in the container.

Before I start here's some variables:

HERE=$(dirname -- "$( readlink -f -- "$0")")
ROOT="$HERE/root"  # container root directory

C0=$(tput -T xterm-256color sgr0)   
C1=$(tput -T xterm-256color setaf 1)
C2=$(tput -T xterm-256color setaf 2)
C3=$(tput -T xterm-256color setaf 3)
C4=$(tput -T xterm-256color setaf 4)
C5=$(tput -T xterm-256color setaf 5)

# Will pass this to lxroot later
BIND=(            
  bind $HOME $HOME
  bind /tmp /tmp  
)

Each host stores its own ld cache for the container and looks like this:

_LDDIR="$HERE/ld/$HOSTNAME"        
mkdir -p "$_LDDIR"                 
BIND+=(bind /usr/host/ld "$_LDDIR")

Initially we build ld cache on container

if [ ! -f "$_LDDIR/ld.so.cache" ]; then                                                      
  echo ${C3}Rebuilding ld cache..${C0}                                                       
  lxroot rw "$ROOT" -rw bind /usr/host/ld "$_LDDIR" -- ldconfig -C /usr/host/ld/ld.so.cache 
fi

If nvidia driver version in container and host do not match, remove ld cache and rebuild it with code above.

if [ -f "/sys/module/nvidia/version" ]; then                                                                  
  _NVVER="$(cat /sys/module/nvidia/version)"  # Host NV driver version                                        
                                                                                                              
  [[ -f "$_LDDIR/nvidia.conf" ]] && _CNVVER=$(cat "$_LDDIR/nvidia.conf") || _CNVVER="/${C5}*no version*${C0}" 
  _CNVVER=${_CNVVER##*/}                      # Container version                                             
  if [ ! -d "$ROOT/usr/lib/firmware/nvidia/$_NVVER" ]; then                                                   
    echo "${C1}CUDA will not work with server driver: container missing driver version ${C2}$_NVVER${C0}!"    
  else                                                                                                        
    if [ "$_NVVER" != "$_CNVVER" ]; then                                                                      
      echo "/usr/lib/nvidia/$_NVVER" > "$_LDDIR/nvidia.conf"                                                  
      echo "${C3}Updated nvidia driver version from ${C2}$_CNVVER${C3} to ${C2}$_NVVER${C0}"                  
      rm -fv "$_LDDIR/ld.so.cache"                                                                            
    fi                                                                                                        
  fi                                                                                                          
fi

Before running container we always make sure ld cache stored in host is used. This gets loaded in memory and does not matter if another instance overrides it.

# Always override                                       
rm -rf "$ROOT/etc/ld.so.cache"                          
ln -s "/usr/host/ld/ld.so.cache" "$ROOT/etc/ld.so.cache"

Final bits to run container:

# Always make sure bind directories exists                             
for (( i = 1; i < ${#BIND[*]}; ++ i )); do                      
  [[ ${BIND[$i-1]} == 'bind' ]] && mkdir -vp ${ROOT}/${BIND[$i]}
done 

lxroot rw "$ROOT" -nwe ${BIND[@]} -- /bin/bash

parke · 2024-11-14T21:43:18Z

parke
Nov 14, 2024
Maintainer

Hi, Min,

Thanks for your post about using Lxroot over NFS on multiple hosts with various CUDA versions. Quite impressive! I don't think I've ever run Lxroot on a networked filesystem. Nor have I ever needed to tweak /etc/ld.so.cache. Also, as far as I know, you are the first person (other than me) to regularly use Lxroot on a real problem. (If anyone else is using Lxroot regularly, I'd love to learn about the use case.)

A couple minor comments.

You run Lxroot like this:

lxroot rw "$ROOT" -rw bind /usr/host/ld "$_LDDIR" -- ldconfig -C /usr/host/ld/ld.so.cache

I suspect this can be simplified to:

lxroot "$ROOT" -r bind /usr/host/ld "$_LDDIR" -- ldconfig -C /usr/host/ld/ld.so.cache

-r implies -w, so -rw is redundant. Also, in most cases, rw "$ROOT and -w both have the same effect. (There may be some very rare edge cases where rw and -w are different, but I don't think your usage of Lxroot will hit any of those edge cases.)

Similarly, this:

lxroot rw "$ROOT" -nwe ${BIND[@]} -- /bin/bash

Can probably be replaced with this:

lxroot "$ROOT" -nwe ${BIND[@]} -- /bin/bash

And now on another topic ...

If you've read my recent updates about Lxroot, you may be aware that I have (for over a year now), made various changes and improvements to Lxroot, but that I have not published those changes because I wasn't sure if anyone would actually use any of the improvements.

These unpublished improvements would allow your container to be created as follows:

$ ls -ld container/
drwxr-xr-x 4 parke parke 100 Nov 14 12:53 container/

$ ls -l container/
total 0
drwxr-xr-x 2 parke parke 60 Nov 14 12:54 home.o
drwxr-xr-x 2 parke parke 40 Nov 14 12:53 newroot
lrwxrwxrwx 1 parke parke  4 Nov 14 12:53 tmp -> /tmp

$ ls -l container/home.o/
total 0
lrwxrwxrwx 1 parke parke 11 Nov 14 12:54 parke -> /home/parke

So, to explain:

container/ is a directory.
container/newroot/ is a directory that contains the chroot directory for the container.
container/home.o/parke is a symlink to /home/parke. When Lxroot runs, it will bind mount /home/parke (on the host) to /home/parke inside the container. (The .o at the end of container/home.o/ tells Lxroot that the directory home.o is an "overlay directory", i.e. that its children should be bind mounted into container/newroot/home/.)
container/tmp is a symlink to /tmp. This will cause Lxroot to bind /tmp on the host to container/newroot/tmp.
In addition to the .o suffix (which denotes an overlay), the suffixes .r, .w, and .a can be used to specify that any given bind mount should be read-only, read-write, or read-auto. The suffixes .or, .ow, and .oa are also possible.

It might also be possible to use this new directory structure to bind mount your /usr/host/ld into the container. (I guess it probably depends on whether you want to store the $_LDDIR files on each host versus on the NFS server itself.)

The goal of these improvements is two fold:

It makes it possible to create, examine, and modify containers composed of multiple bind mounts using the standard command line tools of mkdir, ls, ln, rm, and rmdir.
It shortens the command line invocation of Lxroot.

So this:

lxroot rw "$ROOT" -nwe ${BIND[@]} -- /bin/bash

Might become this:

lxroot "$ROOT" -nwe -- /bin/bash

If you are just using Lxroot to solve one problem, then the benefit from these improvements is small. However, I personally manage many Lxroot containers, and I find these improvement greatly reduce my cognitive overhead when I create, examine, and modify my containers.

Another unpublished improvement is that Lxroot uses tmpfs by default. So by default, /tmp, /run, /dev/shm, and /var/tmp are all backed by an ephemeral tmpfs filesystem. (There is also a way to make those directories persistent, and to share them so that multiple invocations of Lxroot on the same container directory will all share the same tmpfs filesystem.)

2 replies

zceemja Nov 14, 2024
Author

Thanks for the comments. A year ago (or two-ish) been looking how to run containers without root and I really liked your project because of how simple it is to use.

Just to note that everything in servers are mounted on NFS, different caches are just stored at ./ld/$HOSTNAME.

The new changes are neat, especially if you want to setup new containers for many small tasks, but at home I just end up using lxd for these things.

When I was testing this multi-server setup, one of the main problems was that I cannot easily bind driver libs, so like there are many .so and symlinks in /usr/lib e.g.:

$ ls /usr/lib/libcuda.so*
lrwxrwxrwx 1 root root  12 Aug 22 03:41 /usr/lib/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root  20 Aug 22 03:41 /usr/lib/libcuda.so.1 -> libcuda.so.560.35.03
-rwxr-xr-x 1 root root 34M Aug 22 03:41 /usr/lib/libcuda.so.560.35.03

And they differ on some servers. ld cache only cares about the first symlinks:

$ ldconfig -p | grep libcuda.so
	libcuda.so.1 (libc6,x86-64) => /usr/lib/libcuda.so.1
	libcuda.so (libc6,x86-64) => /usr/lib/libcuda.so

So it would make this script a lot simpler without need to rebuilding ld cache if only these .so with specific driver version and symlinks could be bind to the container, but as far as I understand only directories can be bind.

parke Nov 14, 2024
Maintainer

@zceemja wrote:

So it would make this script a lot simpler without need to rebuilding ld cache if only these .so with specific driver version and symlinks could be bind to the container, but as far as I understand only directories can be bind.

Linux does support bind mounting files.

In the early versions of Lxroot, I deliberately chose to forbid binding files. This was an intentional and opinionated choice on my part.

It appears that by August of 2021, I had reconsidered and decided to eventually add bind mounting of files. See here:

lxroot/changelog.txt

Line 53 in 223c160

In anticipation of binding files, '--' is now required before command.

My unpublished version of Lxroot can bind mount files. (For example, I know that I bind /etc/resolv.conf in at least one container.)

The published version of Lxroot might allow bind mounting of files if you commented out these two lines:

lxroot/lxroot.cpp

Line 679 in 223c160

Lib :: directory_require ( source, "source" );

But in your case, what is the downside with just replacing all the CUDA .so files with symlinks pointing to files in /opt/my_libs, and then binding in a single my_libs directory? It seems to me that this would be possible and would also avoid the need to regenerate /etc/ld.so.cache.

The new changes are neat, especially if you want to setup new containers for many small tasks, but at home I just end up using lxd for these things.

Do you run GUI programs in LXD? Or just headless programs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running same container on multiple mixed driver CUDA servers #25

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Running same container on multiple mixed driver CUDA servers #25

zceemja Nov 14, 2024

Replies: 1 comment · 2 replies

parke Nov 14, 2024 Maintainer

zceemja Nov 14, 2024 Author

parke Nov 14, 2024 Maintainer

zceemja
Nov 14, 2024

Replies: 1 comment 2 replies

parke
Nov 14, 2024
Maintainer

zceemja Nov 14, 2024
Author

parke Nov 14, 2024
Maintainer