Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't convert compose service with CDI device #107

Open
rany2 opened this issue Sep 15, 2024 · 8 comments
Open

Can't convert compose service with CDI device #107

rany2 opened this issue Sep 15, 2024 · 8 comments

Comments

@rany2
Copy link

rany2 commented Sep 15, 2024

Consider the following service:

  jellyfin:
    image: docker.io/jellyfin/jellyfin:latest
    container_name: jellyfin
    restart: unless-stopped
    #user: 973:973  # media:media
    group_add:
      - video
    ports:
      - 127.0.0.1:8096:8096
    volumes:
      - ./jellyfin/config:/config
      - ./jellyfin/cache:/cache
      - /mnt/hdd/media:/data/media
    devices:
      - nvidia.com/gpu=all
    security_opt:
      - label=disable

Ignore the fact that the user entry would fail with podlet due to #106, another validation failure is triggered by the devices entry.

Error: 
   0: error converting compose file
   1: error reading compose file
   2: File `/compose.yml` is not a valid compose file
   3: services.jellyfin.devices[0]: device must have a container path at line 45 column 9

Location:
   src/cli/compose.rs:203

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
@rany2
Copy link
Author

rany2 commented Sep 16, 2024

For someone facing this issue, the following workaround seems like it works OK.

Define a new runtime at /etc/containers/containers.conf.d/50-nvidia-runtime.conf:

[engine.runtimes]
nvidia = ["/usr/bin/nvidia-container-runtime"]

Use runtime: nvidia in the compose service instead of the CDI device.

  jellyfin:
    image: docker.io/jellyfin/jellyfin:latest
    container_name: jellyfin
    restart: always
    #user: 973:973  # media:media
    runtime: nvidia
    group_add:
      - video
    ports:
      - 127.0.0.1:8096:8096
    volumes:
      - ./jellyfin/config:/config
      - ./jellyfin/cache:/cache
      - /mnt/hdd/media:/data/media
    security_opt:
      - label=disable
    labels:
      - io.containers.autoupdate=registry

I haven't tested the generate quadlet service but it returns the following which seems correct (ignore the volume paths, I didn't pass --absolute-host-paths):

# jellyfin.container
[Container]
AutoUpdate=registry
ContainerName=jellyfin
Image=docker.io/jellyfin/jellyfin:latest
PodmanArgs=--group-add video
PublishPort=127.0.0.1:8096:8096
SecurityLabelDisable=true
Volume=./jellyfin/config:/config
Volume=./jellyfin/cache:/cache
Volume=/mnt/hdd/media:/data/media
GlobalArgs=--runtime nvidia

[Service]
Restart=always

@k9withabone
Copy link
Member

According to the Compose Specification, devices must be in the form HOST_PATH:CONTAINER_PATH[:CGROUP_PERMISSIONS].

@k9withabone
Copy link
Member

Specifically for Podman, there is podman run --gpus (added in Podman v5.0.0), so you could add PodmanArgs=--gpus all to the generated .container Quadlet file.

@rany2
Copy link
Author

rany2 commented Sep 21, 2024

According to the Compose Specification, devices must be in the form HOST_PATH:CONTAINER_PATH[:CGROUP_PERMISSIONS].

Shouldn't the spec be corrected given that CDI devices exist? I think CDI devices are a relatively recent standard (not older than 5 years) and it's only very recently that Nvidia started recommending it for Podman users. It seems like a case of the spec being out of date.

Docker also supports CDI devices but I'm not sure if their docker-compose is doing this same type of validation.

IMO it should be valid given that both podman run and docker run accept it as valid.

@rany2
Copy link
Author

rany2 commented Sep 21, 2024

Specifically for Podman, there is podman run --gpus (added in Podman v5.0.0), so you could add PodmanArgs=--gpus all to the generated .container Quadlet file.

I actually preferred the runtime approach as it doesn't require me to create some kind of package update hook/systemd service that keeps the CDI yaml file up-to-date. The issue with CDI is that the file needs to be updated everytime Cuda or the Nvidia driver is updated.

Either way, this issue doesn't impact me anymore but I kept the issue open as it seems a simple issue to fix. Someone might need CDI devices for some other vendor and wouldn't be able to use the runtime workaround.

(Edit: --gpus=all just adds the Nvidia CDI devices behind the scenes. containers/podman#21180)

@k9withabone
Copy link
Member

Thanks for the information! I haven't tried to use a GPU in a container myself and hadn't heard of CDI before.

Shouldn't the spec be corrected given that CDI devices exist?

Probably. You should create an issue in the compose-spec repo since you understand this better than I do.

IMO it should be valid given that both podman run and docker run accept it as valid.

Is there documentation on this? I can't find anything about CDI in the docker-run(1) or podman-run(1) man pages.

@rany2
Copy link
Author

rany2 commented Sep 22, 2024

Is there documentation on this? I can't find anything about CDI in the docker-run(1) or podman-run(1) man pages.

In the podman-run man page, the reference to CDI devices is subtle:

--device=host-device[:container-device][:permissions]

With CDI devices, container-device and permissions needs to be omitted. It is strange it isn't mentioned more directly though.

@rany2
Copy link
Author

rany2 commented Sep 22, 2024

I made a ticket here: compose-spec/compose-spec#532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants