Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gradients_memory require more memory than tf.Optimizer.minimize #11

Open
gchlebus opened this issue Jan 22, 2018 · 12 comments
Open

gradients_memory require more memory than tf.Optimizer.minimize #11

gchlebus opened this issue Jan 22, 2018 · 12 comments

Comments

@gchlebus
Copy link

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py.
Memory usage:

  • tf.train.AdamOptimizer().minimize(loss): 75 MB
  • tf.gradients(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 107 MB
  • gradients_memory(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 96 MB

I would have two questions:

  1. How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
  2. Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

@yaroslavvb
Copy link
Collaborator

RE: Why doesn't gradients_memory save any memory?

Memory strategy heuristic works by selecting articulation points. This seems to be the wrong approach for U-net. The main part of the network doesn't have any articulation points.

screenshot 2018-01-22 09 40 17

There are probably some articulation points on the edges of the network that network chooses. Bad choice of checkpoints can result in a strategy that uses more memory than original graph, so I wouldn't use gradients_memory here, but instead use manual checkpoints. Choice of checkpoints for this network needs some thought.

As to why apply_gradients takes 25MB more memory than minimize, maybe it's something to do with TensorFlow internal optimizers that also rewrite things for improved memory usage. You could use mem_util to plot timeline of tensors and figure out the difference. Also you could turn off optimizers as below

def create_session():
  optimizer_options = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)
  config = tf.ConfigProto(operation_timeout_in_ms=150000, graph_options=tf.GraphOptions(optimizer_options=optimizer_options))
  config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF
  config.graph_options.place_pruned_graph = True
  return tf.Session(config=config)

@gchlebus
Copy link
Author

gchlebus commented Jan 22, 2018

I dug a bit into the minimize function. It turns out, that the difference in peak memory consumption between minimize and tf.gradients is caused by the fact, that minimize (or more precisely compute_gradients, which is called internally) calls tf.gradients with gate_gradients=True. This is not the case, when calling directly tf.gradients. Moreover, calling gradients_memory(loss, tf.trainable_variables(), gate_gradients=True) results in 66 MB peak memory usage, which is, indeed, the lowest score.

I would be interested, whether a manual selection of checkpoints in the U-net architecture would allow to reduce the peak memory usage even further. How would you choose the checkpoints?

@shiyongde shiyongde mentioned this issue Jan 23, 2018
@gchlebus
Copy link
Author

I did a peak memory vs batch size plot for the U-Net model using tf.gradientsand gradients_memory. I found the slope increase at batch size=3 for the gradients_memory interesting. Could it be, that the automatic checkpoint selection depends on the batch size?
memory_batchsize

@yaroslavvb
Copy link
Collaborator

Nope, automatic selection depends on layout of computation graph, and batch size doesn't change computation graph (it just changes size of individual nodes).

@netheril96
Copy link

So why don't OpenAI implement a similar strategy of swap_memory for memory_saving_gradients? I'd wager that swapping GPU memory to and from host is faster than recomputation.

@yaroslavvb
Copy link
Collaborator

@netheril96 swapping is slow, it's 7-10x faster to recompute on GPU for most ops

@danieltudosiu
Copy link

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

@netheril96
Copy link

@yaroslavvb May I ask a tangential question?

What tool have you used to create that UNet graph? It looks awesome, so I want to learn to use that tool too.

@yaroslavvb
Copy link
Collaborator

@netheril96 that one I just screenshotted from U-Net paper. Not sure what tool they used for it, but could be done easily in Omnigraffle which is what I used for diagrams in the blog post

@netheril96
Copy link

@yaroslavvb Oh. I was hoping for an automatic tool to generate beautiful graph from code. Tensorboard visualizations are too ugly. Thanks anyway.

@gchlebus
Copy link
Author

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

As far as I remember I put one checkpoint at the lowest u-net level. This had no difference in terms of speed or memory consumption when compared to the default checkpoint locations.

@kuonb
Copy link

kuonb commented May 9, 2019

@gchlebus how did you add the checkpoints? I am trying this:

output = Block(nfi=64, fs=(5,5,5))(prev_output) # Block with 3D convolutions
tf.add_to_collection('checkpoints', output)

But when I assign tf.dict["gradients"] = memory_gradients it does not find anything and raises an Exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants