Adding support for hvd LaidOutVariable #1

EiffL · 2021-05-12T10:37:35Z

In our current prototype of the horovod mesh implementation, we are using copy/pasted code from the TPU SIMD implementation, it is not expected to work...

mesh/mesh_tensorflow/hvd_simd_mesh_impl.py

Line 112 in 70a13d3

class LaidOutVariable(object):

This issue is to document the reimplementation of these variables using the horovod backend. As points of reference, we can look at these variables are implemented in both the Device Placemnt Impl and SIMD impl

EiffL · 2021-05-12T10:40:10Z

The goal here would be to be able to train the MNIST demo model with the new backend implemenation https://github.com/DifferentiableUniverseInitiative/mesh/blob/master/examples/mnist.py

tobias-liaudat · 2021-05-27T20:29:15Z

Hi @EiffL, I'm leaving this comment as a checkpoint of when we finished the IDRIS hackaton.

With @b-remy we were trying to solve this issue but we hadn't had the time to finish. I uploaded my progress in the branch tob_vars and Benjamin on his branch ben-variable.

With our current implementation we were able to run a forward pass of a simple dense network. However, when we tried to optimise the network we were having an error and we're not sure of the origin and how to solve it.

The error we found was this one, and is always related with the Assign and the optimisation.

Traceback (most recent call last):
  File "optim_demo.py", line 174, in <module>
    (type(obj).__name__, types_str))
TypeError: Can not convert a Assign into a Tensor or Operation.

TypeError: Fetch argument <mesh_tensorflow.ops.Assign object at 0x14d1fcb94c50> has invalid type <class 'mesh_tensorflow.ops.Assign'>, must be a string or Tensor. (Can not convert a Assign into a Tensor or Operation.)

To reproduce the error one can run this test script with this job script. The implementation of the LaidOutVariable is here.

EiffL · 2021-05-27T20:34:43Z

Thanks so much @tobias-liaudat will take a look!

EiffL · 2021-05-27T21:23:32Z

ok found a couple of problems, the update seems to work now, will create a proper branch ^^

EiffL · 2021-05-27T21:54:25Z

I've opened branch variables starting from tob_vars, and cleaned it up a bit.
Problems where:

lowering needs to happen after computing the mesh gradients and stuff, otheriwse the update ops are not registered in the lowering
small problem in the assignement of slices

EiffL · 2021-05-27T22:00:22Z

probably, we can try to first train the mnist model with commenting out the restore and save parts

EiffL · 2021-05-28T14:53:58Z

oookkkkk so I tried something here:
https://github.com/DifferentiableUniverseInitiative/mesh/tree/u/EiffL/toy_model

with this script https://github.com/DifferentiableUniverseInitiative/mesh/blob/u/EiffL/toy_model/examples/toy_model_gpu.sh

It runs apparently, can save and restore, but not clear if it's actually training ^^" the loss function doesnt go down much

EiffL added the horovod Issues related to the horovod backend label May 12, 2021

EiffL mentioned this issue May 12, 2021

Implementation of Horovod backend in Mesh TensorFlow DifferentiableUniverseInitiative/IDRIS-hackathon#3

Open

5 tasks

EiffL mentioned this issue May 27, 2021

Adding support for LaidOutVariables #9

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for hvd LaidOutVariable #1

Adding support for hvd LaidOutVariable #1

EiffL commented May 12, 2021

EiffL commented May 12, 2021

tobias-liaudat commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 28, 2021

Adding support for hvd LaidOutVariable #1

Adding support for hvd LaidOutVariable #1

Comments

EiffL commented May 12, 2021

EiffL commented May 12, 2021

tobias-liaudat commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 27, 2021

EiffL commented May 28, 2021