The code may fail to transfer grads from TPU core memory to CPU host memory #1

shawwn · 2021-06-29T00:08:08Z

Hiya! I was talking with Skye over at jax-ml/jax#2108 (comment)

one issue with the code snippet you have at the top for transferring grads from TPU to CPU memory: device_put is a no-op inside jit (I think there may an issue about making this less of a gotcha, but I can't find it now). Using device_put with a CPU device is the right idea though, it just has to happen outside of a compiled function.

The code snippet Skye is referring to was copied from this codebase. So I thought I should open an issue, because it sounds like that code is nonfunctional.

I’m posting this from my phone, but I’ll add more details in a little while.

shawwn · 2021-06-29T01:28:22Z

Some extra details, as promised.

In swarm_layer.py:

swarm-jax/swarm_jax/swarm_layer.py

Lines 43 to 50 in 62cd943

    
           @partial(jax.jit, static_argnums=3) 
        
           def opt_jit(grad_acc, opt_state, params, optimizer): 
        
               total_grad = jax.tree_map(lambda x: jnp.mean(x, axis=0), grad_acc) 
        
               cpu_device = jax.devices("cpu")[0] 
        
               total_grad = jax.device_put(total_grad, device=cpu_device) 
        
               cpu_params = jax.device_put(jax.tree_map(lambda x: x[0], params), device=cpu_device)

I think that this code isn't functioning properly. According to skye, device_put is a no-op inside jit. So I assume that each of those calls to device_put() has no effect.

I'm not experienced enough with Jax to know the best way to fix the problem. What do you think the right solution is?

Commenting out @partial(jax.jit, static_argnums=3) seems like the most straightforward "solution." But I don't know anything about Jax's JIT (yet), so I don't know if that makes any sense, or what the tradeoffs are.

kingoflolz · 2021-07-05T08:01:15Z

@shawwn hrm thanks for letting me know. I was trying to do CPU offload of the optimizer parameters initially to fit a bigger model on the TPU, but the better way to do that now would be to integrate part of mesh transformer jax to perform model parallel sharding within the 8 TPU devices. I think I'll leave it for now and refactor if/when I get around to it so this entire approach would be unnessasary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The code may fail to transfer grads from TPU core memory to CPU host memory #1

The code may fail to transfer grads from TPU core memory to CPU host memory #1

shawwn commented Jun 29, 2021

shawwn commented Jun 29, 2021

kingoflolz commented Jul 5, 2021

The code may fail to transfer grads from TPU core memory to CPU host memory #1

The code may fail to transfer grads from TPU core memory to CPU host memory #1

Comments

shawwn commented Jun 29, 2021

shawwn commented Jun 29, 2021

kingoflolz commented Jul 5, 2021