[HOWTO] TPU POD Training on a multiple nodes (bits_and_tpu branch) #1237
-
Hey, Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
@eliahuhorwitz it's on the todo list but I have not tried it, I assume it will just involve changing how the training launches but I'm not 100% sure. |
Beta Was this translation helpful? Give feedback.
-
Hi! I tried it and it works without issues. I just followed the instruction on XLA's README:
FROM gcr.io/tpu-pytorch/xla:r1.10_3.8_tpuvm
RUN pip install --upgrade pip
RUN mkdir /<your-dir>/
WORKDIR /<your-dir>/
# I have `git+https://github.com/rwightman/pytorch-image-models.git@fafece230b8c8325fd6144efbab25cbc6cf5ca5c`
# in my `requirements.txt`. This is a specific commit from `bits_and_tpu`, but I guess also `@bits_and_tpu` should work.
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
RUN pip install wandb
# Assuming that you are in the directory with your code
COPY . .
python3 -m torch_xla.distributed.xla_dist --tpu=${TPU_NAME} --restart-tpuvm-pod-server --docker-image=gcr.io/path-to-my-image --docker-run-flag=--rm=true --docker-run-flag=--shm-size=40GB --docker-run-flag=--mount=type=bind,source="$(pwd)"/output,target=/<your-dir>/output -- python launch_xla.py --num-devices 8 train.py <data-dir> --output /<your-dir>/output --use-mp-loader --model <your-model> --other-args This worked on a if args.use_mp_loader and dev_env.type_xla:
import torch_xla.distributed.parallel_loader as pl
assert isinstance(loader_train, fetcher.Fetcher)
assert isinstance(loader_eval, fetcher.Fetcher)
loader_train.use_mp_loader = True
loader_train._loader = pl.MpDeviceLoader(loader_train._loader, dev_env.device)
loader_eval.use_mp_loader = True
loader_eval._loader = pl.MpDeviceLoader(loader_eval._loader, dev_env.device) I am not sure this is necessary. I did it just because in the distributed training examples on PyTorch XLA's repo they were using it |
Beta Was this translation helpful? Give feedback.
-
@dedeswim thanks for the details, re Mp loader, I did some checks and I didn't notice a significant difference on TPU-VM (v3 I tried), I suspect it was more beneficial for the previous two-VM configuration but could depend highly on workload. My checks were not extensive and only a v3-8 config. |
Beta Was this translation helpful? Give feedback.
Hi! I tried it and it works without issues. I just followed the instruction on XLA's README: