Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver upgrade is not possible #200

Open
adityapatadia opened this issue Jul 14, 2021 · 3 comments
Open

Driver upgrade is not possible #200

adityapatadia opened this issue Jul 14, 2021 · 3 comments

Comments

@adityapatadia
Copy link

We are using this guide to install drivers: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

Now, the drivers for COS are locked and it always installs 450.119.04. We want to upgrade driver to version 460.32.03 because https://github.com/FFmpeg/nv-codec-headers needs driver version 455.28 or newer.

How can we upgrade driver version?

@andreasjansson
Copy link

@adityapatadia Did you figure out a way to do upgrade driver versions? I'm having the same issue.

@Endofunctor
Copy link

For anyone that finds themselves with this problem, you can untie your CUDA driver version from your COS version with the following steps:

  1. Download the daemonset-preloaded-latest.
  2. Check the cos-tools bucket: gsutil ls gs://cos-tools/ for a newer cos version than the one in your cluster.
  3. Check under gsutil ls gs://cos-tools/<newer COS version>/extensions/gpu (note if the COS version is 16928.0.0 or newer, this folder does not appear to exist, keep the --version=latest in the next step.
  4. Change the command in the daemonset-preloaded-latest.yaml to command: ['/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--version=<driver version found under extensions/gpu, just the #>', '--gcs-download-prefix=<newer COS version>'. For me I had COS version 16108.604.3 which pinned CUDA 450.119.04, I was able to use the COS 16623.102.4 version's 470.82.01. CUDA driver.

Other notes: It may be possible to directly specify the driver url with versions found under gs://nvidia-drivers-us-public/tesla/. The command would be (for example): command: [ '/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--nvidia-installer-url=https://storage.googleapis.com/nvidia-drivers-us-public/tesla/510.47.03/NVIDIA-Linux-x86_64-510.47.03.run' ]. I tested this once but there is an issue with the precompiled COS toolchain that gets downloaded. It may be possible to fix this or this issue may not occur at all with a different COS version than I have. You also try specifying, --gcs-download-prefix for a different COS toolchain version and see if that works. I did not get a chance to confirm as I timeboxed this driver upgrade to 2 hours.

@DavraYoung
Copy link

Where can we find information on how to control driver and cuda version? This becomes really challenging, given no information for deploying gpus in gke(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants