-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mount neuron devices for local_docker/aws_batch scheduler #920
Conversation
@d4l3k, @kiukchung could you help review, or get the right poc to review this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me -- just wondering about if there's any backwards compatibility issues
gpu=0, | ||
memMB=32 * GiB, | ||
capabilities={K8S_ITYPE: "trn1.2xlarge"}, | ||
devices={NEURON_DEVICE: 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were these being used before without these device mounts? Was there a init script/host config that handled this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a optional package aws-neuronx-oci-hook which allows for setting “AWS_NEURON_VISIBLE_DEVICES”
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/build-run-neuron-container.html#container-devices
Which is similar to the nvidia container setting CUDA_VISIBLE_DEVICES
And for kubernetes, there is a plugin which uses aws.amazon.com/neurondevice
Also looks like there's some lint issues |
@ryxli still has build failures |
ran lintrunner -a one more time |
unsure what the failing test is. my environment is py310
don't think this is related these changes |
@ryxli I expect that's flaky -- rerunning that build, will land once this is green |
Add neuron device mount for aws trn instances. Mount these for local_docker scheduler
Docker native way to expose neuron devices to containers:
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/build-run-neuron-container.html#container-devices
Test plan:
updated unit tests
with dist.ddp component
torchx run -s local_docker --dryrun dist.ddp -h aws_trn1.32xlarge -j 1