Driver build fails on AWS g5g.xlarge #570

martin31821 · 2023-08-18T04:43:15Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee
Kernel Version: 5.15.0-1041-aws
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.25.11
GPU Operator Version: 23.6.0

2. Issue or feature description

On AWS g5g.xlarge (smallest gpu node), the driver build fails because it is running out of system memory.
It would maybe be possible to limit concurrency to a much smaller level, in order to be able to run on 8GB of memory.

3. Steps to reproduce the issue

Create EKS Cluster, setup gpu operator
Spawn g5g.xlarge
💥

4. Information to attach (optional if deemed irrelevant)

Is there already a way to limit concurrency in the nvcr.io/nvidia/driver container or is that not possible at the moment?

martin31821 · 2023-08-18T04:57:42Z

After digging a bit deeper, the root cause seems to be in the nvidia-driver script, where _create_driver_package() contains make -s -j SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null, essentially starting all compile jobs at once and thus overloading the small node.

I'll try to overwrite the script to limit concurrency here.

shivamerla · 2023-08-29T06:08:01Z

Thanks for reporting this @martin31821 we will look into making max threads as configurable for low memory systems.

rockholla · 2023-11-27T20:33:10Z

@shivamerla created a ~~draft~~ PR at the driver container images project level as a start: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/285.

I'm not sure if we might also want to update this operator to be able to automatically react via NFD data to cases like gpu cores outweighing available mem GB or something similar when automatically generating driver spec and passing in some determined --max-threads args, or just leave it up to user-managed driver CRD. I'm happy to help out further at this level as well, but thought this might be a good checkpoint to discuss.

rockholla · 2024-03-05T20:56:11Z

Updated PR after the move for the relevant repo to github: NVIDIA/gpu-driver-container#6

shivamerla added the enhancement label Aug 29, 2023

ArangoGutierrez removed the enhancement label Feb 22, 2024

rockholla mentioned this issue Feb 28, 2024

add max-threads argument for nvidia-driver script commands so that th… NVIDIA/gpu-driver-container#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver build fails on AWS g5g.xlarge #570

Driver build fails on AWS g5g.xlarge #570

martin31821 commented Aug 18, 2023

martin31821 commented Aug 18, 2023

shivamerla commented Aug 29, 2023

rockholla commented Nov 27, 2023 •

edited

Loading

rockholla commented Mar 5, 2024

Driver build fails on AWS g5g.xlarge #570

Driver build fails on AWS g5g.xlarge #570

Comments

martin31821 commented Aug 18, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

martin31821 commented Aug 18, 2023

shivamerla commented Aug 29, 2023

rockholla commented Nov 27, 2023 • edited Loading

rockholla commented Mar 5, 2024

rockholla commented Nov 27, 2023 •

edited

Loading