Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver build fails on AWS g5g.xlarge #570

Open
martin31821 opened this issue Aug 18, 2023 · 4 comments
Open

Driver build fails on AWS g5g.xlarge #570

martin31821 opened this issue Aug 18, 2023 · 4 comments

Comments

@martin31821
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee
  • Kernel Version: 5.15.0-1041-aws
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.25.11
  • GPU Operator Version: 23.6.0

2. Issue or feature description

On AWS g5g.xlarge (smallest gpu node), the driver build fails because it is running out of system memory.
It would maybe be possible to limit concurrency to a much smaller level, in order to be able to run on 8GB of memory.

3. Steps to reproduce the issue

  1. Create EKS Cluster, setup gpu operator
  2. Spawn g5g.xlarge
  3. 💥

4. Information to attach (optional if deemed irrelevant)

image

Is there already a way to limit concurrency in the nvcr.io/nvidia/driver container or is that not possible at the moment?

@martin31821
Copy link
Author

After digging a bit deeper, the root cause seems to be in the nvidia-driver script, where _create_driver_package() contains make -s -j SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null, essentially starting all compile jobs at once and thus overloading the small node.

I'll try to overwrite the script to limit concurrency here.

@shivamerla
Copy link
Contributor

Thanks for reporting this @martin31821 we will look into making max threads as configurable for low memory systems.

@rockholla
Copy link

rockholla commented Nov 27, 2023

@shivamerla created a draft PR at the driver container images project level as a start: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/285.

I'm not sure if we might also want to update this operator to be able to automatically react via NFD data to cases like gpu cores outweighing available mem GB or something similar when automatically generating driver spec and passing in some determined --max-threads args, or just leave it up to user-managed driver CRD. I'm happy to help out further at this level as well, but thought this might be a good checkpoint to discuss.

@rockholla
Copy link

Updated PR after the move for the relevant repo to github: NVIDIA/gpu-driver-container#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants