-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require Clarity on Running with SLURM Executor #12615
Comments
Even if I remove the mounts I get the same issue. I'm not sure how do I set up the right permissions |
If I set it to a tmp dir it works but the run itself crashes with no logs -
No such file as the one mentioned above |
@hemildesai can you please opine ? |
Hi @aflah02, are you running the SlurmExecutor from your local workstation or on the Slurm login node? |
Hi @hemildesai |
Hi @hemildesai |
Makes sense, I will plan on merging it soon. |
Thanks, do you have any ETA by any chance? Totally understandable if there isn't one |
Hi
Based on the tutorials, I've written a LLM pretraining script which works when I run it via local executor on non-slurm as well as single-node slurm machines. I now want to run it on multiple nodes. To do the same I adapted the SLURM Exectuor from the tutorial. Here is my code -
Now when I run this I run it inside a docker container as otherwise I get import issues. So I first run this command -
docker run --gpus all --ulimit stack=6718846 --net=host --rm -it -v ${PWD}:/workspace -v /scratch/sws0/user/afkhan/Nemo_Work:/Storage -w /workspace nvcr.io/nvidia/nemo:25.02.rc7 bash
Now when I run the above file using
python file_name.py
, I am prompted for the ssh password. Once I provide that I get the following error -I am not sure what am I doing wrong here? I need to mount to run it inside the slurm session right? Also what is the right method to use slurm executor
The text was updated successfully, but these errors were encountered: