-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for hvd LaidOutVariable #1
Comments
The goal here would be to be able to train the MNIST demo model with the new backend implemenation https://github.com/DifferentiableUniverseInitiative/mesh/blob/master/examples/mnist.py |
Hi @EiffL, I'm leaving this comment as a checkpoint of when we finished the IDRIS hackaton. With @b-remy we were trying to solve this issue but we hadn't had the time to finish. I uploaded my progress in the branch tob_vars and Benjamin on his branch ben-variable. With our current implementation we were able to run a forward pass of a simple dense network. However, when we tried to optimise the network we were having an error and we're not sure of the origin and how to solve it. The error we found was this one, and is always related with the Assign and the optimisation.
To reproduce the error one can run this test script with this job script. The implementation of the LaidOutVariable is here. |
Thanks so much @tobias-liaudat will take a look! |
ok found a couple of problems, the update seems to work now, will create a proper branch ^^ |
I've opened branch
|
probably, we can try to first train the mnist model with commenting out the restore and save parts |
oookkkkk so I tried something here: with this script https://github.com/DifferentiableUniverseInitiative/mesh/blob/u/EiffL/toy_model/examples/toy_model_gpu.sh It runs apparently, can save and restore, but not clear if it's actually training ^^" the loss function doesnt go down much |
In our current prototype of the horovod mesh implementation, we are using copy/pasted code from the TPU SIMD implementation, it is not expected to work...
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py
Line 112 in 70a13d3
This issue is to document the reimplementation of these variables using the horovod backend. As points of reference, we can look at these variables are implemented in both the Device Placemnt Impl and SIMD impl
The text was updated successfully, but these errors were encountered: