-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] strict driver and cuda version assignment #1275
base: master
Are you sure you want to change the base?
[gpu] strict driver and cuda version assignment #1275
Conversation
/gcbrun |
1 similar comment
/gcbrun |
/gcbrun |
1 similar comment
/gcbrun |
Tests took 43 minutes to run this time. I think it's because the binary driver build cache was invalidated. Let's see if this one takes more like 14 minutes. |
the intention is for this PR to resolve issue #1268 |
cloudbuild/Dockerfile
Outdated
@@ -9,16 +9,23 @@ COPY --chown=ia-tests:ia-tests . /init-actions | |||
|
|||
# Install Bazel: | |||
# https://docs.bazel.build/versions/master/install-ubuntu.html | |||
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg | |||
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \ | |||
bazel_version=7.4.0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning to 7.4.0 is necessary since the release of and default set to version 8
cloudbuild/Dockerfile
Outdated
RUN /usr/bin/curl https://bazel.build/bazel-release.pub.gpg | \ | ||
gpg --dearmor -o "${bazel_kr_path}" | ||
RUN echo "deb [arch=amd64 signed-by=${bazel_kr_path}] http://storage.googleapis.com/bazel-apt stable jdk1.8" | \ | ||
RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce noise in docker build with the -s (silent) option
cloudbuild/Dockerfile
Outdated
apt-get clean | ||
|
||
# Set bazel-${bazel_version} as the default bazel alternative in this container | ||
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file /usr/bin/bazel does not exist in the bazel-${version} package ; set up an alternative and use it
cloudbuild/create-key-pair.sh
Outdated
@@ -0,0 +1,135 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file copied from https://github.com/GoogleCloudDataproc/custom-images/tree/2024.11/examples/secure-boot
cloudbuild/presubmit.sh
Outdated
@@ -70,6 +70,7 @@ determine_tests_to_run() { | |||
changed_dir="${changed_dir%%/*}/" | |||
# Run all tests if common directories modified | |||
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must remove this before merging
slightly longer than the 14 minutes I predicted, this run completed in 16:29 |
gpu/test_gpu.py
Outdated
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "11.8", 'rocky', '2.0'), | ||
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.4", 'rocky', '2.1'), | ||
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.0", 'rocky', '2.2'), | ||
("KERBEROS", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.6", 'rocky', '2.2'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now exercise Kerberos in this test suite. True, it's only tested on 2.2-rocky9 with CUDA 12.6, but we could change any of the STANDARD definitions above to KERBEROS to improve that set.
/gcbrun |
10 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
okay, taking a little break... |
* changing layout of gcs package folder * install_pytorch function created and called when cuDNN is being installed
…tadata check within the function ; create new function harden_sshd_config and call it
…ber for cuda 11.1.1 url generation
…he platforms call it
/gcbrun |
okay. it's a lot, but it all looks okay and has been thoroughly and repeatedly tested. Nota Bene that 'continue' in presubmit.sh ; I'll remove that once I receive an LGTM. |
/gcbrun |
… the templates/ directory from GoogleCloudDataproc#1282
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
…able spark rapids on images >= 2.1
/gcbrun |
Resolves Issues
gpu/install_gpu_driver.sh
Driver version defaults to version in driver .run file if specified
CUDA version defaults to version in cuda .run file if specified
exclusively using .run file installation method for cuda and driver installation
build nccl from source, since that is the only mechanism which supports all Dataproc OSs
wrap expensive functions in completion checks to reduce re-run time when testing manually
cache build results in GCS
waiting on apt lock when it exists
only install build dependencies if build is necessary
fix problem with ops agent not installing ; using venv
Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
mark task completion by creating a file rather than setting a variable
added functions to check and report secure-boot and os version details
if the correct metadata attributes are specified, pytorch will be installed, but not by default
gpu/manual-test-runner.sh
gpu/test_gpu.py