Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] strict driver and cuda version assignment #1275

Open
wants to merge 105 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 13, 2024

Resolves Issues

gpu/install_gpu_driver.sh

  • Driver version defaults to version in driver .run file if specified

  • CUDA version defaults to version in cuda .run file if specified

  • exclusively using .run file installation method for cuda and driver installation

    • Installing non-open driver from .run file on rocky8
  • build nccl from source, since that is the only mechanism which supports all Dataproc OSs

  • wrap expensive functions in completion checks to reduce re-run time when testing manually

  • cache build results in GCS

  • waiting on apt lock when it exists

  • only install build dependencies if build is necessary

  • fix problem with ops agent not installing ; using venv

  • Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS

  • mark task completion by creating a file rather than setting a variable

  • added functions to check and report secure-boot and os version details

  • if the correct metadata attributes are specified, pytorch will be installed, but not by default

gpu/manual-test-runner.sh

  • order commands correctly
  • point to origin rather than staging repo

gpu/test_gpu.py

  • clearer test skipping logic

@cjac cjac self-assigned this Dec 13, 2024
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 13, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

Tests took 43 minutes to run this time. I think it's because the binary driver build cache was invalidated. Let's see if this one takes more like 14 minutes.

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

the intention is for this PR to resolve issue #1268

@@ -9,16 +9,23 @@ COPY --chown=ia-tests:ia-tests . /init-actions

# Install Bazel:
# https://docs.bazel.build/versions/master/install-ubuntu.html
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
bazel_version=7.4.0 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning to 7.4.0 is necessary since the release of and default set to version 8

RUN /usr/bin/curl https://bazel.build/bazel-release.pub.gpg | \
gpg --dearmor -o "${bazel_kr_path}"
RUN echo "deb [arch=amd64 signed-by=${bazel_kr_path}] http://storage.googleapis.com/bazel-apt stable jdk1.8" | \
RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg | \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce noise in docker build with the -s (silent) option

apt-get clean

# Set bazel-${bazel_version} as the default bazel alternative in this container
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file /usr/bin/bazel does not exist in the bazel-${version} package ; set up an alternative and use it

@@ -0,0 +1,135 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must remove this before merging

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

slightly longer than the 14 minutes I predicted, this run completed in 16:29

gpu/test_gpu.py Outdated
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "11.8", 'rocky', '2.0'),
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.4", 'rocky', '2.1'),
("STANDARD", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.0", 'rocky', '2.2'),
("KERBEROS", ["m", "w-0", "w-1"], GPU_T4, GPU_T4, "12.6", 'rocky', '2.2'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now exercise Kerberos in this test suite. True, it's only tested on 2.2-rocky9 with CUDA 12.6, but we could change any of the STANDARD definitions above to KERBEROS to improve that set.

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

10 similar comments
@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 15, 2024

okay, taking a little break...

@cjac
Copy link
Contributor Author

cjac commented Jan 22, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 22, 2025

okay. it's a lot, but it all looks okay and has been thoroughly and repeatedly tested. Nota Bene that 'continue' in presubmit.sh ; I'll remove that once I receive an LGTM.

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 23, 2025
@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 23, 2025

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gpu] versions installed by gpu/install_gpu_driver.sh do not match requested versions
4 participants