-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient checkpointing seems to conflict with Keras batch norm #47
Comments
It seems like a combination of
Currently that list filters by op name, a better thing may be to filter by op type. IE, exclude anything like |
Thanks for your insight, I will definitely give it a shoot on the weekend. I have a layer with comparatively small input and output tensors but large intermediate tensors, which are kept in memory for backpropagation. This is what I tried recreating in my minimal example (see code below). I also attached an image of the graph. What I am not understanding is, that even with a high number of layers ( import shutil
import os.path as osp
import numpy as np
import tensorflow as tf
def massive_layer(t):
with tf.name_scope('massive_layer'):
upsample = tf.tile(t[None, None, None], [32, 1024, 1024])
upsample = upsample + tf.reduce_mean(upsample)
reduce = tf.reduce_mean(upsample * 0.5)
return reduce
var = tf.Variable(np.random.normal(size=()), dtype=tf.float32)
d = var
n = 3
for i in range(n):
d = massive_layer(d)
grads = tf.gradients(d, var)[0]
shutil.rmtree(osp.join('/tmp', 'custom_gradients_testtb'), ignore_errors=True)
tb_saver = tf.summary.FileWriter(osp.join(
'/tmp', 'custom_gradients_testtb',
))
with tf.Session() as s:
s.run(tf.global_variables_initializer())
tb_saver.add_graph(s.graph)
print(s.run(d))
print(s.run([d, grads])) |
TF only stores activations if you request gradients, if you don't request them in your session.run, it'll discard activations |
I know. Still even when requesting the gradients (see last line of code example) it is not running out of memory for arbitrary sizes. |
TensorFlow can store input of the grad function instead of output, ie it does this for ReLU. Not sure if that's the case for tile, but your input is quite input. Try stacking a couple of layers on top of each other |
i have the same problem using tensorflow 1.15 and 1.14. it all behaves pretty weird between versions. eg installing it via pip does not work (i think this is due how conda handles cuda). To reproduce: this works while gradient_memory only gives sth between 3-4 memory increase (OOM with batch size 4 for my image size, while 1 works without memgrad) Setup b)
to
No errors, but there is no effect on memory size. This probably means it is not used |
@yaroslavvb could you have a look and see what could be done (don't want to switch back to original keras since they will stop updating in april 2020) |
@shsshs sorry, haven't kept up with TF backend changes (mostly stay in PyTorch-land nowadays), I'll be happy to merge any PR's that fix it though |
It seems that change the word “/read” to “/Read” in line 90 and line92 works.
|
I tried this out but get an Error when computing the gradients with the provided function using manually selected checkpoints. I get three different errors at the same time, and am not sure what of my graph is actually causing them, so I would appreciate some hints so that I could come up with a minimal non-working example. I currently use TF1.13.1 and especially the
tf.keras.layers.BatchNormalization
(just saying this because it pops up along the Error message). Is there any hope that this would be an easy fix?The text was updated successfully, but these errors were encountered: