-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][all_reduce] INVALID_ARGUMENT: You must feed a value for placeholder tensor #19246
Comments
The current situation is:
|
Hi @edwardyehuang , If you want to use Keras with TF2.x backend, you need to install tf-keras package using Also could you please confirm the behaviour with TF2.16v which is compatible with Keras3? Thanks! |
Thanks for the reply. I am discussing solving the bug of keras 3, and my code is working fine in keras 2. This issue also happened in TensorFlow 2.16 |
Update: After removing @SuryanarayanaY Note that, this is a bug, not a |
Hi @edwardyehuang ,If possible could we have a reproducible code snippet for this? If this happened with TF2.16v with keras3 then it may need investigation. If you working on this I will leave it as it is for now. |
A reproducible code snippet is presented below. Make sure you test it on at least 2 GPUs (and set batch_size >= num_gpus). Remove either The code below is working fine with Keras 2.15, it only has the error in Keras 3 It looks like this is caused by the However, given my limited knowledge and time, I'm unable to provide a quick fix. Thus, I need help import keras
import tensorflow as tf
BATCH_SIZE = 4
tf.get_logger().setLevel('INFO')
strategy = tf.distribute.MirroredStrategy()
# Make model ##########################################################################
with strategy.scope():
class SimpleModel (keras.Model):
def __init__(self, name=None):
super().__init__(name=name)
def build(self, input_shape):
self.l = keras.layers.Conv2D(3, (1, 1), padding='same')
super().build(input_shape)
def call (self, inputs, training=False):
x = inputs
if training:
x = tf.distribute.get_replica_context().all_reduce(tf.distribute.ReduceOp.SUM, x)
x = self.l(x)
return x
m = SimpleModel()
m.compile(
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
loss=keras.losses.MeanSquaredError(),
)
# Make dataset ##########################################################################
def simple_data_generator(num_samples=-1, size=(17, 17)):
counter = 0
while True:
random_data = tf.random.uniform(
shape=size,
minval=-2,
maxval=2,
dtype=tf.float32
)
yield random_data, random_data
counter += 1
if num_samples > 0 and counter >= num_samples:
break
io_shape = (17, 17, 3)
train_dataset = tf.data.Dataset.from_generator(
simple_data_generator,
args=(-1, io_shape),
output_signature=(
tf.TensorSpec(shape=io_shape, dtype=tf.float32),
tf.TensorSpec(shape=io_shape, dtype=tf.float32),
)
)
train_dataset = train_dataset.batch(BATCH_SIZE)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
train_dataset = train_dataset.with_options(options)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)
# Train #################################################################################
m.fit(train_dataset, epochs=1, verbose=1, steps_per_epoch=1000) Note that I found Keras 3 results 2 more |
same bug when i implement gradient accumulate feature, sb. can help me ? plz ..... |
@SuryanarayanaY I just noticed the Kaggle provides 2×T4, so here is the code in Kaggle: |
Hi @edwardyehuang , Thanks for the reminder.I have replicated the issue with multi gpu VM environment and attached logs below. |
Any update about this? It is the only barrier for me to move to Keras 3, and I can contribute lot of other functions after that. |
As an additional update, I believe that this bug is triggered whenever applying the |
@fchollet @sachinprasadhs @SuryanarayanaY A humble suggestion: I believe this issue should be the top priority for the Keras team to solve. The existence of this issue makes it impossible for Keras 3 to perform correct distributed training. |
Here's a smaller repro that doesn't require GPU: |
I looked into this for a bit - my hunch is there could be a subtle difference between the way the training step keras/keras/src/backend/tensorflow/trainer.py Line 307 in d910dcb
@grasskin also mentioned conditionals might not work in replica contexts that call merge_call: keras/keras/src/backend/tensorflow/trainer.py Line 307 in d910dcb
|
Any updates? I just ran the CoLab you provided with the latest tf + keras nightly version, and a new error appeared instead of the old one.
|
Hi, any updates regarding this issue? @fchollet @jeffcarp @SamanehSaadat @qlzh727 |
Is it possible that the error would be related to the generators instead? See the notebooks running
Note that I'm just a user and not a multi-GPU user, so it may be wrong. For GPU there are warnings but seems to complete all epochs. |
Your code might have a mistake, for example, you defined a Conv layer but never used it. Please refer to my comments from March 7th for the details. |
The example I took from your Kaggle uses the layer buddy. class SimpleModel (keras.Model):
def __init__(self, name=None):
super().__init__(name=name)
self.l = keras.layers.Conv2D(3, (1, 1), padding='same')
def call (self, inputs, training=False):
x = inputs
if training:
x = tf.distribute.get_replica_context().all_reduce(tf.distribute.ReduceOp.SUM, x)
x = self.l(x) # <---
return x In the other example I linked, it works just fine adding that line as well, as you can see here. |
It may not be related to the generators because just now, I found that my original Kaggle code also works. |
Now the warning message is about "Skipping the delay kernel, which will reduce measurement accuracy," following recent TensorFlow or Keras commits. I will re-evaluate this issue later this month. |
Note that this bug may also exist on TPUs. I just tested it on TPU Pod |
Well, currently, I have no idea how to debug this issue because there is no useful information.
The text was updated successfully, but these errors were encountered: