[HOWTO] TPU POD Training on a multiple nodes (bits_and_tpu branch) #1237

eliahuhorwitz · 2021-10-28T16:49:19Z

eliahuhorwitz
Oct 28, 2021

Hey,
Is it possible, and if not, is it complicated to support training on a TPUv3 VM of 32 machines?

Thanks,
Eliahu

Mar 26, 2022

Hi! I tried it and it works without issues. I just followed the instruction on XLA's README:

I created a container with a bunch of Python packages I need, and has this shape:

FROM gcr.io/tpu-pytorch/xla:r1.10_3.8_tpuvm

RUN pip install --upgrade pip

RUN mkdir /<your-dir>/
WORKDIR /<your-dir>/

# I have `git+https://github.com/rwightman/pytorch-image-models.git@fafece230b8c8325fd6144efbab25cbc6cf5ca5c`
# in my `requirements.txt`. This is a specific commit from `bits_and_tpu`, but I guess also `@bits_and_tpu` should work.
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
RUN pip install wandb

# Assuming that you are in the directory with your code
COPY . .

I bu…

View full answer

rwightman · 2021-10-28T20:31:19Z

rwightman
Oct 28, 2021
Maintainer

@eliahuhorwitz it's on the todo list but I have not tried it, I assume it will just involve changing how the training launches but I'm not 100% sure.

0 replies

dedeswim · 2022-03-26T09:02:06Z

dedeswim
Mar 26, 2022

Hi! I tried it and it works without issues. I just followed the instruction on XLA's README:

I created a container with a bunch of Python packages I need, and has this shape:

FROM gcr.io/tpu-pytorch/xla:r1.10_3.8_tpuvm

RUN pip install --upgrade pip

RUN mkdir /<your-dir>/
WORKDIR /<your-dir>/

# I have `git+https://github.com/rwightman/pytorch-image-models.git@fafece230b8c8325fd6144efbab25cbc6cf5ca5c`
# in my `requirements.txt`. This is a specific commit from `bits_and_tpu`, but I guess also `@bits_and_tpu` should work.
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
RUN pip install wandb

# Assuming that you are in the directory with your code
COPY . .

I built this docker image and tagged it with some name of my choice. Then I pushed it to the Google Container Registry (the instructions are here.
After setting up docker on all the pods (more info here), I pulled this image from all the workers of my pod with gcloud alpha compute tpus tpu-vm ssh vits-robustness-pod --zone us-central2-b --worker=all --command "docker pull gcr.io/path-to-my-image"
I launched distributed training with

python3 -m torch_xla.distributed.xla_dist --tpu=${TPU_NAME} --restart-tpuvm-pod-server --docker-image=gcr.io/path-to-my-image --docker-run-flag=--rm=true --docker-run-flag=--shm-size=40GB --docker-run-flag=--mount=type=bind,source="$(pwd)"/output,target=/<your-dir>/output -- python launch_xla.py --num-devices 8 train.py <data-dir> --output /<your-dir>/output --use-mp-loader --model <your-model> --other-args

This worked on a v4 TPU (both 32 and 64 cores), and I guess will work also on a v3. In the directory of my code, I have the same launch_xla.py file as the one in the bits_and_tpu branch, and a custom version of the train.py in bits_and_tpu (I had to customize it because I have to do adversarial training). The --use-mp-loader flag is not in the original file, I have added as a bool argument, which, if True does the following at the end of the setup_data function:

if args.use_mp_loader and dev_env.type_xla:
        import torch_xla.distributed.parallel_loader as pl
        assert isinstance(loader_train, fetcher.Fetcher)
        assert isinstance(loader_eval, fetcher.Fetcher)
        loader_train.use_mp_loader = True
        loader_train._loader = pl.MpDeviceLoader(loader_train._loader, dev_env.device)
        loader_eval.use_mp_loader = True
        loader_eval._loader = pl.MpDeviceLoader(loader_eval._loader, dev_env.device)

I am not sure this is necessary. I did it just because in the distributed training examples on PyTorch XLA's repo they were using it

3 replies

Dreamer312 May 28, 2022

Hi, thanks for your useful sript.
But the problem is that TPU Research Cloud allows me to set up 5 TPUv3 VM, rather than a pod with 40 cores.
So, I would got 5 IPs, how could I run distibuted training under this situation?

rwightman May 28, 2022
Maintainer

@MrDaVinci I see people ask about that quite a bit, but it's not possible. 5 separate v3-8 instances won't necessarily be part of the same fabric like the instances within a POD. There is no support for distributed training across separate non-POD instances.

Dreamer312 May 29, 2022

Haha, seems I am a little greedy.
Currently I am using TRC and your great bit, while the document is not that completely.
I hope I can help you.

rwightman · 2022-04-27T21:57:40Z

rwightman
Apr 27, 2022
Maintainer

@dedeswim thanks for the details, re Mp loader, I did some checks and I didn't notice a significant difference on TPU-VM (v3 I tried), I suspect it was more beneficial for the previous two-VM configuration but could depend highly on workload. My checks were not extensive and only a v3-8 config.

1 reply

dedeswim Apr 27, 2022

Yeah, I didn't notice a big difference either when using a single v3 VM. However I didn't have a lot of time to see if it made a difference on V4 TPUs and on pod slices. I think that the main difference lies in the fact that using MPLoader automates calling xm.mark_step(), though I'm not sure about this either

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOWTO] TPU POD Training on a multiple nodes (bits_and_tpu branch) #1237

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[HOWTO] TPU POD Training on a multiple nodes (bits_and_tpu branch) #1237

eliahuhorwitz Oct 28, 2021

Replies: 3 comments · 4 replies

rwightman Oct 28, 2021 Maintainer

dedeswim Mar 26, 2022

Dreamer312 May 28, 2022

rwightman May 28, 2022 Maintainer

Dreamer312 May 29, 2022

rwightman Apr 27, 2022 Maintainer

dedeswim Apr 27, 2022

eliahuhorwitz
Oct 28, 2021

Replies: 3 comments 4 replies

rwightman
Oct 28, 2021
Maintainer

dedeswim
Mar 26, 2022

rwightman May 28, 2022
Maintainer

rwightman
Apr 27, 2022
Maintainer