-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: merge_call
called while defining a new graph or a tf.function.
#301
Comments
Hello, @innat! I have not had the time to add multi-GPU support to GradientAccumulator, but can make an attempt at it today. However batch training + gradient accumulation + mixed preicision works seemlessly. I have been using it for various projects already. |
Thanks for your response. I noticed that @stefan-falk also faced similar error tensorflow/tensorflow#50454 that I reported above. He tried many ways, HERE, it may give some insight. Regarding the mixed precison, as I said, I was wondersing if we need to call cc. @MrForExample |
Hmm, that's interesting. However, can't it be argued that overloading the Will start on the multi-GPU support now. Did you have a gist I could use for debugging/testing, @innat? Also note that the GradientAccumulator (without multi-GPU) also works with TPUs. But I am only able to run tests locally, as I doubt I am allowed to use multi-GPUs in a single colab session. |
Here is a gist, (also mentioned above). |
As mentioned in the other ticket Graphcore had a design as optimizer wrapper including cross replica: /Cc @georgepaw |
As the error suggests aggregating gradients inside nested tf.function which is not yet supported as per the error.
@innat Whether eager mode is OK for you though it has performance issue but it seems this works fine here. |
@SuryanarayanaY |
[Info add]
Optimizer.apply_gradients(
grads_and_vars, name=None, skip_gradients_aggregation=False, **kwargs
)
|
@innat the root cause of this error is the One option is to work around using Here's a modified version of your colab which uses this approach and seems to be working. It's probably marginally less performant than if the graph could be fully compiled with the conditional in it, but merging a subgraph which has a conditional on a synchronized variable is (I think) a fundamental limitation of running TF in distributed mode. |
@ianstenbit thanks for the reply. From my gist, Epoch 1/3
10000/10000 - 23s - loss: 0.2041 - accuracy: 0.9387
Epoch 2/3
10000/10000 - 23s - loss: 0.0937 - accuracy: 0.9708
Epoch 3/3
10000/10000 - 23s - loss: 0.0667 - accuracy: 0.9791
<keras.callbacks.History at 0x7f983006fe50> with yours Epoch 1/3
10000/10000 - 68s - loss: 0.6961 - accuracy: 0.8416
Epoch 2/3
10000/10000 - 22s - loss: 0.6387 - accuracy: 0.8541
Epoch 3/3
10000/10000 - 22s - loss: 0.6387 - accuracy: 0.8541
<keras.callbacks.History at 0x7f97d41fd1d0> |
@innat looks like I had a silly mistake in the line of code where I was zeroing out gradients after applying them I had After making these changes, I got much closer results to your original results. It occurred to me, though, that to avoid any rounding errors it's probably better to use
It's still not precisely the same numerically as your original implementation. I think this may be because calling |
Thanks for the update. Could you please check with multiple epoch (ie.10). I observe that the loss and accuracy don't chnage after 2 epoch. Tested with |
Yes I see this behavior, and I think it's probably due to calling |
I think in order to correctly perform gradient accumulation, you'd likely need to subclass This seems like a constraint of @rchao to confirm |
Thanks Ian. Yes, this appears not supported by tf.distribute at this time, and I would recommend filing an issue on tf.distribute if you would like such support. |
Check #301 |
Here, the aim is to make it possible to execute in within custom fit (overriding the
@rchao could you please create an issue . Or, this technique should be supprted #107 cc @chenmoneygithub @4uiiurz1 I read on SO that you extened this technique for multi-gpu support. Could you please give some feedback regarding that? Thanks. |
Is there a specific reason why you don't want to wrap the optimizer? The main reason why I never did that was that I failed to find a working implementation. I found quite a few attempts, some even run (to an extent), but when running a simple benchmark, training results were quite different from regular batch training. Just now, I managed to get a optimizer wrapper working (see here). This was based on the work by @stefan-falk and @fsx950223. At least it yields extremely similar results to regular batch training. If you wish to try it out, there is a test script here, in the GradientAccumulator repo. I was unable to test multi-GPU support, as I do not have access to one until tomorrow. But I could update you on the manner, likely tomorrow. Note that right now, only SGD is supported. Will need to debug why dynamic optimizers such as Adam are not working as well as SGD. I'm not observing the same with the |
I don't mind to use that but I strongly prefer to override train step. Adding new ticket tensorflow/tensorflow#59487 |
No worries. If anyone is interested in playing around with the optimizer wrapper solution, here is a gist demonstrating that the optimizer wrapping solution works with I don't have access to multiple GPUs atm, but perhaps someone else has and is interested to try. |
I quicky tested on kaggle (2x T4 GPU) with TF 2.6.4, got the following error.
You will not face this error in colab (with tf 2.6.4). |
Oh, OK. Nice to know! Will have to do some further debugging. Cheers :] Anyways, the gist serves as a nice foundation for making a proper solution. |
I was able to reproduce the bug in Kaggle, @innat. Love that you have access to two GPUs for free on Kaggle! I've shared my Kaggle notebook here, if anyone wishes to debug this further. Any ideas would be much obliged! It seems to work just fine with one GPU, but fails during gradient update with multiple in MirroredStrategy. Note that switching to tf 2.8.0 yields a different error, which might be easier for some of you to unravel:
|
@ianstenbit
Is it possible alternative of gradient accumulation techniques? What does it mean when it says number of batches to run during each tf.function call. For each batch, do the corresponding gradient accumulated? class CustomModel(keras.Model):
def train_step(self, data):
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True) # Forward pass
loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
print()
print(x.shape, y.shape, tf.shape(x)[0].numpy())
print()
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
self.compiled_metrics.update_state(y, y_pred)
return {m.name: m.result() for m in self.metrics}
x = np.random.random((100, 32))
y = np.random.random((100, 1))
with strategy.scope():
# Construct and compile an instance of CustomModel
inputs = keras.Input(shape=(32,))
outputs = keras.layers.Dense(1)(inputs)
model = CustomModel(inputs, outputs)
model.compile(
optimizer="adam",
loss="mse",
metrics=["mae"],
steps_per_execution = 1,
run_eagerly=1
)
model.fit(
x, y,
validation_data=(x,y),
epochs=1,
batch_size=32
) With
And with
It looks like a possible alternative of gradient accumulation technique. I like to know what happen s when Also, does |
Hi @innat
If |
I think the main reason for the problem is that tensorflow does not allow control flow containing any synchronization op in the replica context wrapped by tf.function. I guess that tf.function will build graph for each branch of control flow, so that each replica may enter different branches which will cause conflicts in synchronization. The key is "replica context", so switching to executing Here is an example: def apply_accumulated_gradients(grads_and_vars):
# actually apply gradients logic
pass
should_apply = ... # a boolean flag
def apply_gradients_cross_replica(strategy, grads_and_vars):
def _apply_fn():
strategy.extended.call_for_each_replica(
apply_accumulated_gradients, args=(grads_and_vars,))
tf.cond(should_apply, _apply_fn, lambda: None)
# execute control flow with synchronization op in the cross-replica context
tf.distribute.get_replica_context().merge_call(
apply_gradients_cross_replica, args=(grads_and_vars,)) |
@AIGideon |
@innat Yes, I tested that it can work perfectly with I don't know whether keras3 implementation solve this problem, but switching to a cross-replica context in a replica context is a very common usage in tf. I just wonder why the keras2 (tf-keras) community has been troubled with the implementation of gradient accumulation for a long time and no solid solution has ever been given. I've seen other implementations from the community, and most of them are based on the following three approaches to avoid control flow:
Backup to topic, I think the best way to implement gradient accumulation in keras2 (tf-keras) is to organize my above example code into a generic OptimizerWrapper that can receive any |
Could you please share a complete gist with your approach? |
@innat OK, I will give an example based on tensorflow==2.12.0 (which take keras new optimizer api under keras/optimizers/optimizer_experimental/ as the default optimizer instead of optimizer_v2) import tensorflow as tf
from typing import Iterable, List, Tuple
class GradientAccumulationOptimizer(tf.keras.optimizers.Optimizer):
def __init__(
self,
optimizer: tf.keras.optimizers.Optimizer,
gradient_accumulation_steps: int = 1,
name: str = 'GradientAccumulationOptimizer',
**kwargs
):
super().__init__(name=name, **kwargs)
self.optimizer = optimizer
self.gradient_accumulation_steps = gradient_accumulation_steps
def apply_gradients(
self,
grads_and_vars: Iterable[Tuple[tf.Tensor, tf.Variable]],
*args,
**kwargs
):
grads_and_vars = list(grads_and_vars)
vars = [var for _, var in grads_and_vars]
if not hasattr(self, '_built') or not self._built:
self.build(vars)
self.step.assign_add(1)
should_apply = tf.equal(self.step % self.gradient_accumulation_steps, 0)
# update accumulated gradients
self._update_accumulated_grads(grads_and_vars)
# apply gradients
def _cross_replica_apply_gradients(strategy, grads_and_vars):
def _apply_fn():
strategy.extended.call_for_each_replica(
self._apply_accumulated_grads,
args=(grads_and_vars, *args), kwargs=kwargs)
tf.cond(should_apply, _apply_fn, lambda: None)
tf.distribute.get_replica_context().merge_call(
_cross_replica_apply_gradients, args=(grads_and_vars,))
# reset accumulated gradients if necessary
tf.cond(should_apply, self._reset_accumulated_grads, lambda: None)
return self.optimizer.iterations
def _update_accumulated_grads(
self,
grads_and_vars: List[Tuple[tf.Tensor, tf.Variable]]
):
for i, (grad, _) in enumerate(grads_and_vars):
self.accumulated_grads[i].assign_add(grad)
def _apply_accumulated_grads(
self,
grads_and_vars: List[Tuple[tf.Tensor, tf.Variable]],
*args,
**kwargs
):
accumulated_grads_and_vars = [
(
self.accumulated_grads[i] / tf.cast(
self.gradient_accumulation_steps,
self.accumulated_grads[i].dtype),
var
)
for i, (_, var) in enumerate(grads_and_vars)
]
self.optimizer.apply_gradients(
accumulated_grads_and_vars, *args, **kwargs)
def _reset_accumulated_grads(self):
for grad in self.accumulated_grads:
grad.assign(tf.zeros_like(grad))
def build(self, var_list: List[tf.Variable]):
super().build(var_list)
self.optimizer.build(var_list)
self.accumulated_grads = [
tf.Variable(
initial_value=tf.zeros_like(var),
trainable=False,
aggregation=tf.VariableAggregation.NONE)
for var in var_list
]
self.step = tf.Variable(
initial_value=0, trainable=False, dtype=tf.int64,
aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA)
self._built = True You can use it to wrap any optimizer like I haven't tried later tensorflow versions, but if you use an earlier verison, some modifications may be needed:
|
@AIGideon |
Thanks to @AIGideon, @innat and @andreped I could implement GAOptimizer by modifying @AIGideon and referring to @andreped's implementation. <ga.py>
<test_ga.py>
|
System information.
Describe the problem
I have code that works fine but gives the following error if I use
with strategy.scope()
.Describe the expected behavior
I think, It should work.
Standalone code to reproduce the issue
The code is for gradient accumulation techniques. Here it is done by overriding the
trian_step
withfit
method. This code works fine (as said above) withoutwith strategy.scope()
. Now, I like to use it for multi-gpu cases, and so I use strategy scope but ened up the the above mentioned error.Gist.
Follow-up Questions
BATCH_SIZE = 32 * strategy.num_replicas_in_sync
inside thetrain_step
method? Or it will be handled auto?LossScaleOptimizer
and useoptimizer.get_scaled_loss(loss)
andoptimizer.get_unscaled_gradients(gradients)
.But the official documentation talks about normal
fit
and custom loop training cases. In case of custom loop, it's suggested to wrap the optimizer and scale the loss and gradient but what about the combination offit
and custom loop (overridingtrain_step
)? Does it sill need to wrap the optimizer and scale the loss and gradient or it will be handled by the API?Others: #107 cc @chenmoneygithub @nikitamaia @bhack
The text was updated successfully, but these errors were encountered: