Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon linux support #127

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Amazon linux support #127

wants to merge 1 commit into from

Conversation

shivakunv
Copy link
Contributor

No description provided.

@shivakunv shivakunv self-assigned this Sep 26, 2024
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from f6096c6 to 57fdf0b Compare October 4, 2024 06:46
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
@shivakunv shivakunv marked this pull request as ready for review October 7, 2024 16:31
Copy link

copy-pr-bot bot commented Oct 26, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from dedd781 to 31a2314 Compare October 26, 2024 18:35
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from 6d553ac to bc5998b Compare October 26, 2024 19:16
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 3 times, most recently from 426a2cf to 49b283c Compare October 28, 2024 16:51
amzn2023/Dockerfile Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/install.sh Outdated Show resolved Hide resolved
.common-ci.yml Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/empty Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/install.sh Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/install.sh Outdated Show resolved Hide resolved
amzn2023/install.sh Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
amzn2023/nvidia-driver Outdated Show resolved Hide resolved
amzn2023/nvidia-driver Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from eb0282a to c46bcaf Compare October 28, 2024 19:27
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from 3d7b0bb to d179a93 Compare October 29, 2024 15:38
versions.mk Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 2 times, most recently from 74b2554 to 539af6c Compare October 30, 2024 20:24
@shivakunv
Copy link
Contributor Author

@cdesiniotis and @tariq1890 PTAL

.nvidia-ci.yml Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
# due to cuda repo cache issue , nvidia-fabric-manager refers to 565 version only
# install fabric-manager and nvidia-nscq
RUN if [ "$DRIVER_TYPE" != "vgpu" ] && [ "$TARGETARCH" != "arm64" ]; then \
dnf install -y nvidia-fabric-manager libnvidia-nscq-${DRIVER_BRANCH}-${DRIVER_VERSION}-1; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't commit this as-is. I just looked at the packages uploaded here, so the following should work

Suggested change
dnf install -y nvidia-fabric-manager libnvidia-nscq-${DRIVER_BRANCH}-${DRIVER_VERSION}-1; fi
dnf install -y nvidia-fabricmanager-${DRIVER_BRANCH}-${DRIVER_VERSION}-1 libnvidia-nscq-${DRIVER_BRANCH}-${DRIVER_VERSION}-1; fi

Note that you have nvidia-fabric-manager, when it should be nvidia-fabricmanager*

Copy link
Contributor Author

@shivakunv shivakunv Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed 560.35.03 . 560.35.03 fabric manager not available

dnf list *fabric* 
24.23 cuda-drivers-fabricmanager.x86_64     565.57.01-1           cuda-amzn2023-x86_64
24.23 cuda-drivers-fabricmanager-555.x86_64 555.42.06-1           cuda-amzn2023-x86_64
24.23 cuda-drivers-fabricmanager-560.x86_64 560.35.03-1           cuda-amzn2023-x86_64
24.23 cuda-drivers-fabricmanager-565.x86_64 565.57.01-1           cuda-amzn2023-x86_64
24.23 libfabric.x86_64                      1.14.0-2.amzn2023.0.2 amazonlinux
24.23 libfabric-devel.x86_64                1.14.0-2.amzn2023.0.2 amazonlinux
24.23 nvidia-fabric-manager.x86_64          565.57.01-1           cuda-amzn2023-x86_64
24.23 nvidia-fabric-manager-devel.x86_64    565.57.01-1           cuda-amzn2023-x86_64

added conditional check for both the packages and installation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed if condition and added installation of nvidia-fabric-manager-${DRIVER_VERSION}-1

amzn2023/Dockerfile Outdated Show resolved Hide resolved
amzn2023/Dockerfile Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the amazonlinuxsupport branch 4 times, most recently from 400da5c to ed626af Compare October 31, 2024 07:15
@shivakunv
Copy link
Contributor Author

PTAL @cdesiniotis @tariq1890

# Initialize the fabric manager package variable
FABRIC_PACKAGE=""; \
if dnf list nvidia-fabric-manager-${DRIVER_VERSION}-1 &>/dev/null; then \
FABRIC_PACKAGE="nvidia-fabric-manager-${DRIVER_VERSION}-1"; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/, the only fabric manager packages available are named nvidia-fabric-manager-${DRIVER_VERSION}-1. Let's remove the conditional here and always use that package name. If the name ever changes, our builds will break and we will know right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed if condition and added installation of nvidia-fabric-manager-${DRIVER_VERSION}-1

Comment on lines +402 to +406
if [ -f /sys/module/nvidia_fs/refcnt ]; then
nvidia_fs_refs=$(< /sys/module/nvidia_fs/refcnt)
rmmod_args+=("nvidia-fs")
((++nvidia_deps))
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required? We have a separate sidecar container for loading / unloading nvidia-fs.

Signed-off-by: shiva kumar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants