Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mount neuron devices for local_docker/aws_batch scheduler #920

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

ryxli
Copy link
Contributor

@ryxli ryxli commented Jun 14, 2024

Add neuron device mount for aws trn instances. Mount these for local_docker scheduler

Docker native way to expose neuron devices to containers:
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/build-run-neuron-container.html#container-devices

Test plan:

updated unit tests

663 passed, 104 warnings in 224.46s (0:03:44)

with dist.ddp component

torchx run -s local_docker --dryrun dist.ddp -h aws_trn1.32xlarge -j 1

=== SCHEDULER REQUEST ===
- !!python/object:torchx.schedulers.docker_scheduler.DockerContainer
  command:
  - bash
  - -c
  - torchrun --rdzv_backend c10d --rdzv_endpoint localhost:0 --rdzv_id 'abc-fm1vcp71mkn0dc'
    --nnodes 1 --nproc_per_node 1 --tee 3 --role '' -m abc
  image: sha256:e7e0cef667c97bae5bdee516d246459fad63f0651f6735d12075f4775e18893e
  kwargs:
    devices:
    - /dev/infiniband/uverbs0:/dev/infiniband/uverbs0:rwm
    - /dev/infiniband/uverbs1:/dev/infiniband/uverbs1:rwm
    - /dev/infiniband/uverbs2:/dev/infiniband/uverbs2:rwm
    - /dev/infiniband/uverbs3:/dev/infiniband/uverbs3:rwm
    - /dev/infiniband/uverbs4:/dev/infiniband/uverbs4:rwm
    - /dev/infiniband/uverbs5:/dev/infiniband/uverbs5:rwm
    - /dev/infiniband/uverbs6:/dev/infiniband/uverbs6:rwm
    - /dev/infiniband/uverbs7:/dev/infiniband/uverbs7:rwm
    - /dev/neuron0:/dev/neuron0:rwm
    - /dev/neuron1:/dev/neuron1:rwm
    - /dev/neuron2:/dev/neuron2:rwm
    - /dev/neuron3:/dev/neuron3:rwm
    - /dev/neuron4:/dev/neuron4:rwm
    - /dev/neuron5:/dev/neuron5:rwm
    - /dev/neuron6:/dev/neuron6:rwm
    - /dev/neuron7:/dev/neuron7:rwm
    - /dev/neuron8:/dev/neuron8:rwm
    - /dev/neuron9:/dev/neuron9:rwm
    - /dev/neuron10:/dev/neuron10:rwm
    - /dev/neuron11:/dev/neuron11:rwm
    - /dev/neuron12:/dev/neuron12:rwm
    - /dev/neuron13:/dev/neuron13:rwm
    - /dev/neuron14:/dev/neuron14:rwm
    - /dev/neuron15:/dev/neuron15:rwm
    environment:
      LOGLEVEL: WARNING
      TORCHX_JOB_ID: local_docker://torchx/abc-fm1vcp71mkn0dc
      TORCHX_RANK0_HOST: abc-fm1vcp71mkn0dc-abc-0
      TORCHX_TRACKING_EXPERIMENT_NAME: default-experiment
    hostname: abc-fm1vcp71mkn0dc-abc-0
    labels:
      torchx.pytorch.org/app-id: abc-fm1vcp71mkn0dc
      torchx.pytorch.org/replica-id: '0'
      torchx.pytorch.org/role-name: abc
      torchx.pytorch.org/version: 0.7.0dev0
    mem_limit: 503296m
    mounts: []
    name: abc-fm1vcp71mkn0dc-abc-0
    nano_cpus: 128000000000
    network: torchx
    privileged: false
    shm_size: 503296m

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024
@ryxli ryxli changed the title support mounting neuron devices for local_docker scheduler mount neuron devices for local_docker/aws_batch scheduler Jun 14, 2024
@ryxli
Copy link
Contributor Author

ryxli commented Jun 17, 2024

@d4l3k, @kiukchung could you help review, or get the right poc to review this change?
Thanks

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me -- just wondering about if there's any backwards compatibility issues

gpu=0,
memMB=32 * GiB,
capabilities={K8S_ITYPE: "trn1.2xlarge"},
devices={NEURON_DEVICE: 1},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these being used before without these device mounts? Was there a init script/host config that handled this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a optional package aws-neuronx-oci-hook which allows for setting “AWS_NEURON_VISIBLE_DEVICES”
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/build-run-neuron-container.html#container-devices

Which is similar to the nvidia container setting CUDA_VISIBLE_DEVICES
And for kubernetes, there is a plugin which uses aws.amazon.com/neurondevice

@d4l3k
Copy link
Member

d4l3k commented Jun 17, 2024

Also looks like there's some lint issues lintrunner -a should fix them

@d4l3k
Copy link
Member

d4l3k commented Jun 18, 2024

@ryxli still has build failures

@ryxli
Copy link
Contributor Author

ryxli commented Jun 18, 2024

@d4l3k

ran lintrunner -a one more time

@ryxli
Copy link
Contributor Author

ryxli commented Jun 19, 2024

unsure what the failing test is. my environment is py310
https://github.com/pytorch/torchx/blob/main/torchx/schedulers/test/local_scheduler_test.py#L1115

FAILED torchx/schedulers/test/local_scheduler_test.py::LocalDirectorySchedulerTest::test_no_orphan_process_function - AssertionError: OSError not raised

don't think this is related these changes

@d4l3k
Copy link
Member

d4l3k commented Jun 20, 2024

@ryxli I expect that's flaky -- rerunning that build, will land once this is green

@d4l3k d4l3k merged commit bc09b55 into pytorch:main Jun 20, 2024
22 checks passed
@ryxli ryxli deleted the neuron_device branch June 20, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants