device_put + jit vs pmap for data parallel training of neural networks #16282

YunfanZhang42 · 2023-06-06T22:35:38Z

YunfanZhang42
Jun 6, 2023

Hi,

I have a question regarding the best practices of data parallel training of complicated neural networks. I am using JAX + Flax to train a ViT based model in a data parallel/SPMD manner on TPUs. After reading the documentation, I see two ways of performing SPMD for neural networks training:

Reshape the inputs to [num_devices, batch_size // num_devices, h, w, c], replicate the model parameters and optimizer states to each device using flax.jax_utils.replicate, use pmean in train_step function to accumulate the gradients and batch statistics, and finally use pmap to parallelize the train_step function. This is what google-research/vision_transformer and google-research/scenic do, so I also implemented my code in this way.
Use device_put to shard the inputs and replicate the model parameters and optimizer states, write the train_step function as usual, and then jit the train_step. This method seems to be more elegant and does not require accumulating gradients and batch statistics manually, but I am not sure it is feasible/recommended for a complex neural network.

I am wondering what would be the preferred way to perform SPMD training of neural nets moving forward. Also, I am wondering if more documentation on this would make sense. Thanks for your help!

YunfanZhang42 · 2023-06-06T23:07:25Z

YunfanZhang42
Jun 6, 2023
Author

After some digging on previous issues, it seems that the difference is mostly in efficiency. jit relies on the compiler to deliver an efficient SPMD execution plan, which might not be optimal compared to pmap + explicit control of the programmer. #11364 #11444

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device_put + jit vs pmap for data parallel training of neural networks #16282

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

device_put + jit vs pmap for data parallel training of neural networks #16282

YunfanZhang42 Jun 6, 2023

Replies: 1 comment

YunfanZhang42 Jun 6, 2023 Author

YunfanZhang42
Jun 6, 2023

YunfanZhang42
Jun 6, 2023
Author