-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when parsing GPUs on a node when only specifying node name --include=node3
vs --include=node3:1,2,4
#6671
Comments
Hi @stephen-nju - can you please update the title to better reflect the issue as the current title is copied from the CI workflow failure script and I tried to write it, but more information could help. Could you add a sample repro case or more information? |
--include=
deepspeed version=0.15.3 The difference between the two scripts below is only the --include params optinonal paramas neftune_noise_alpha is null optinonal paramas neftune_noise_alpha is null |
Hi @stephen-nju - I'm still not sure I follow what the problem is, could you try listing it one more time? You believe there is a bug that when passing in the node to the Would you consider opening a PR to fix to what you believe is the correct parsing? |
--include=
--include=node3
vs --include=node3:1,2,4
Hi @loadams - I think the arguemnts --include=node3 is equal to --include=node3:1,2,3,4,5,6,7,8 when there are 8 GPUS on the node3, But when set the arguments --include=node3, the program raise " IndexError: list index out of range". it not use the default devices(8GPUS) on the 'node3' |
The Nightly CI for {{ env.GITHUB_SERVER_URL }}/{{ env.GITHUB_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }} failed.
when using --include=node3 ,deepspeed parser error ,but --include=node3:1,2,3,4,5,6,7,8 is ok
I checked the runner.py code, when SLOT_LIST_START not in config ,the devices will set to [],but the --include arguements says "If :SLOT is omitted, include all slots on that host"
The text was updated successfully, but these errors were encountered: