gradients_memory require more memory than tf.Optimizer.minimize #11

gchlebus · 2018-01-22T08:37:38Z

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py.
Memory usage:

tf.train.AdamOptimizer().minimize(loss): 75 MB
tf.gradients(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 107 MB
gradients_memory(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 96 MB

I would have two questions:

How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

The text was updated successfully, but these errors were encountered:

yaroslavvb · 2018-01-22T17:44:58Z

RE: Why doesn't gradients_memory save any memory?

Memory strategy heuristic works by selecting articulation points. This seems to be the wrong approach for U-net. The main part of the network doesn't have any articulation points.

There are probably some articulation points on the edges of the network that network chooses. Bad choice of checkpoints can result in a strategy that uses more memory than original graph, so I wouldn't use gradients_memory here, but instead use manual checkpoints. Choice of checkpoints for this network needs some thought.

As to why apply_gradients takes 25MB more memory than minimize, maybe it's something to do with TensorFlow internal optimizers that also rewrite things for improved memory usage. You could use mem_util to plot timeline of tensors and figure out the difference. Also you could turn off optimizers as below

def create_session():
  optimizer_options = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)
  config = tf.ConfigProto(operation_timeout_in_ms=150000, graph_options=tf.GraphOptions(optimizer_options=optimizer_options))
  config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF
  config.graph_options.place_pruned_graph = True
  return tf.Session(config=config)

gchlebus · 2018-01-22T19:17:33Z

I dug a bit into the minimize function. It turns out, that the difference in peak memory consumption between minimize and tf.gradients is caused by the fact, that minimize (or more precisely compute_gradients, which is called internally) calls tf.gradients with gate_gradients=True. This is not the case, when calling directly tf.gradients. Moreover, calling gradients_memory(loss, tf.trainable_variables(), gate_gradients=True) results in 66 MB peak memory usage, which is, indeed, the lowest score.

I would be interested, whether a manual selection of checkpoints in the U-net architecture would allow to reduce the peak memory usage even further. How would you choose the checkpoints?

gchlebus · 2018-01-29T21:16:27Z

I did a peak memory vs batch size plot for the U-Net model using tf.gradientsand gradients_memory. I found the slope increase at batch size=3 for the gradients_memory interesting. Could it be, that the automatic checkpoint selection depends on the batch size?

yaroslavvb · 2018-01-30T05:50:31Z

Nope, automatic selection depends on layout of computation graph, and batch size doesn't change computation graph (it just changes size of individual nodes).

netheril96 · 2018-02-12T13:02:15Z

So why don't OpenAI implement a similar strategy of swap_memory for memory_saving_gradients? I'd wager that swapping GPU memory to and from host is faster than recomputation.

yaroslavvb · 2018-02-14T18:39:37Z

@netheril96 swapping is slow, it's 7-10x faster to recompute on GPU for most ops

danieltudosiu · 2019-03-20T19:20:33Z

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

netheril96 · 2019-03-21T04:58:06Z

@yaroslavvb May I ask a tangential question?

What tool have you used to create that UNet graph? It looks awesome, so I want to learn to use that tool too.

yaroslavvb · 2019-03-21T06:30:23Z

@netheril96 that one I just screenshotted from U-Net paper. Not sure what tool they used for it, but could be done easily in Omnigraffle which is what I used for diagrams in the blog post

netheril96 · 2019-03-21T15:51:05Z

@yaroslavvb Oh. I was hoping for an automatic tool to generate beautiful graph from code. Tensorboard visualizations are too ugly. Thanks anyway.

gchlebus · 2019-03-26T17:22:03Z

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

As far as I remember I put one checkpoint at the lowest u-net level. This had no difference in terms of speed or memory consumption when compared to the default checkpoint locations.

kuonb · 2019-05-09T14:15:18Z

@gchlebus how did you add the checkpoints? I am trying this:

output = Block(nfi=64, fs=(5,5,5))(prev_output) # Block with 3D convolutions
tf.add_to_collection('checkpoints', output)

But when I assign tf.dict["gradients"] = memory_gradients it does not find anything and raises an Exception.

shiyongde mentioned this issue Jan 23, 2018

core dump #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradients_memory require more memory than tf.Optimizer.minimize #11

gradients_memory require more memory than tf.Optimizer.minimize #11

gchlebus commented Jan 22, 2018

yaroslavvb commented Jan 22, 2018

gchlebus commented Jan 22, 2018 •

edited

Loading

gchlebus commented Jan 29, 2018

yaroslavvb commented Jan 30, 2018

netheril96 commented Feb 12, 2018

yaroslavvb commented Feb 14, 2018

danieltudosiu commented Mar 20, 2019

netheril96 commented Mar 21, 2019

yaroslavvb commented Mar 21, 2019

netheril96 commented Mar 21, 2019

gchlebus commented Mar 26, 2019

kuonb commented May 9, 2019 •

edited

Loading

gradients_memory require more memory than tf.Optimizer.minimize #11

gradients_memory require more memory than tf.Optimizer.minimize #11

Comments

gchlebus commented Jan 22, 2018

yaroslavvb commented Jan 22, 2018

gchlebus commented Jan 22, 2018 • edited Loading

gchlebus commented Jan 29, 2018

yaroslavvb commented Jan 30, 2018

netheril96 commented Feb 12, 2018

yaroslavvb commented Feb 14, 2018

danieltudosiu commented Mar 20, 2019

netheril96 commented Mar 21, 2019

yaroslavvb commented Mar 21, 2019

netheril96 commented Mar 21, 2019

gchlebus commented Mar 26, 2019

kuonb commented May 9, 2019 • edited Loading

gchlebus commented Jan 22, 2018 •

edited

Loading

kuonb commented May 9, 2019 •

edited

Loading