Add gradient accumulation support for all backends, and enable optimizer EMA for JAX and torch #18951

fchollet · 2023-12-17T00:05:51Z

No description provided.

fchollet · 2023-12-17T00:07:21Z

@qlzh727 please review this PR -- in particular, check whether the EMA variables and the gradient accumulators are updated in a way that is correct in a tf.distribute context

codecov-commenter · 2023-12-17T00:10:09Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (d550552) 79.55% compared to head (ac794c9) 79.63%.
Report is 4 commits behind head on master.

Files	Patch %	Lines
keras/optimizers/base_optimizer.py	92.68%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #18951      +/-   ##
==========================================
+ Coverage   79.55%   79.63%   +0.07%     
==========================================
  Files         337      338       +1     
  Lines       35056    35182     +126     
  Branches     6872     6908      +36     
==========================================
+ Hits        27890    28017     +127     
+ Misses       5587     5585       -2     
- Partials     1579     1580       +1

Flag	Coverage Δ
keras	`79.48% <96.00%> (+0.07%)`	⬆️
keras-jax	`61.28% <52.00%> (+0.09%)`	⬆️
keras-numpy	`55.97% <39.00%> (+<0.01%)`	⬆️
keras-tensorflow	`63.12% <47.00%> (-0.08%)`	⬇️
keras-torch	`63.80% <44.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fchollet · 2023-12-17T00:25:55Z

@haifeng-jin can you please review the changes in Nadam and provide your analysis on why the torch tests are failing? I think it has to do with the fact that torch has its own custom Nadam update step.

EDIT: actually, I fixed it...

haifeng-jin · 2023-12-18T18:12:36Z

Nice!

qlzh727 · 2023-12-18T18:27:11Z

keras/backend/tensorflow/optimizer.py

-    def _distributed_apply_gradients_fn(
-        self, distribution, grads_and_vars, **kwargs
+    def _distributed_tf_update_step(
+        self, distribution, grads_and_vars, learning_rate


Seems that the learning_rate is not used here, the value in the apply_grad_to_updatew_var was retrieved from self._get_current_learning_rate(). Did I miss anything?

qlzh727 · 2023-12-18T18:30:33Z

keras/backend/jax/optimizer.py

+                grads, trainable_variables, self.learning_rate
+            )
+
+        if self.use_ema:


Just curious, does the existing jax optimizer support ema? or its from the base_optimizer?

Now JAX supports the feature -- the new unit tests check it. The previous JAX optimizer (before this PR) didn't -- only TF did.

qlzh727 · 2023-12-18T18:32:04Z

keras/optimizers/optimizer_test.py

@@ -7,6 +7,36 @@


 class OptimizerTest(testing.TestCase):


Can we also update the test in tensorflow/optimizer_distribute_test.py for distribution related test case? (for ema and gradient accumulation)

qlzh727 · 2023-12-18T18:37:48Z

keras/optimizers/base_optimizer.py

+            for i in range(len(grads))
+        ]
+        for n_g_acc, g_acc in zip(new_g_accs, self._accumulated_gradients):
+            g_acc.assign(n_g_acc)


In the tf.distribute context, I think this probably should use https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update

Ok, but what about self.iterations? We update it with a simple assign. When is it ok to use assign and when is it not?

I think the assign works when each of the replica suppose to get same value, eg iterations are always same across all the replica. I don't think that's case for accumulated grad (eg each replica should get different value, and eventually the overall value should get a mean reduce?)

qlzh727 · 2023-12-18T19:27:19Z

keras/backend/tensorflow/optimizer_distribute_test.py

@@ -163,30 +163,24 @@ def test_ema(self):
    def test_gradient_accumulation(self):
        with self.strategy.scope():
            v = backend.Variable([[1.0, 2.0], [3.0, 4.0]])
-            grads = backend.convert_to_tensor([[1.0, 1.0], [1.0, 1.0]])
+            grads = backend.convert_to_tensor([[1.0, 1.0], [2.0, 2.0]])


Thanks for adding the test, can we test with the grad that has different value on each replica?

You can create a distribute value via https://www.tensorflow.org/api_docs/python/tf/distribute/DistributedValues

Best I can tell, it is not possible to call an optimizer (or any other Keras function) with DistributeValues. The reason is that DistributeValues is a stand-in for a tensor, but it does not implement the tensor API (no .shape, no .dtype, etc).

fchollet added 2 commits December 15, 2023 20:08

Gradient accumulation code draft

b3d39bc

Refactor optimizers and add support for gradient accumulation.

eab08ec

google-ml-butler bot added the size:L label Dec 17, 2023

google-ml-butler bot assigned gbaned Dec 17, 2023

fchollet requested a review from qlzh727 December 17, 2023 00:07

google-ml-butler bot added the awaiting review label Dec 17, 2023

Add missing file

120ed9d

fchollet mentioned this pull request Dec 17, 2023

EMA doesn't function properly in jax and torch backend #18949

Closed

fchollet added 2 commits December 16, 2023 19:21

Small simplification

3495f6d

Refactor JAX implementation to minimize computation (with 3 conds)

40b9874

qlzh727 reviewed Dec 18, 2023

View reviewed changes

fchollet added 2 commits December 18, 2023 11:02

Add TF tests

90788ed

Update tests

ac794c9

qlzh727 reviewed Dec 18, 2023

View reviewed changes

qlzh727 approved these changes Dec 18, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Dec 18, 2023

kokoro-team removed the kokoro:force-run label Dec 18, 2023

Update test

0021a3e

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Dec 18, 2023

fchollet added 2 commits December 18, 2023 14:13

draft

cccd9aa

Revert test

6ff2a72

fchollet merged commit c3d269b into master Dec 18, 2023
8 checks passed

google-ml-butler bot removed the awaiting review label Dec 18, 2023

innat mentioned this pull request Mar 28, 2024

RuntimeError: merge_call called while defining a new graph or a tf.function. keras-team/tf-keras#301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gradient accumulation support for all backends, and enable optimizer EMA for JAX and torch #18951

Add gradient accumulation support for all backends, and enable optimizer EMA for JAX and torch #18951

fchollet commented Dec 17, 2023

fchollet commented Dec 17, 2023

codecov-commenter commented Dec 17, 2023 •

edited

Loading

fchollet commented Dec 17, 2023 •

edited

Loading

haifeng-jin commented Dec 18, 2023

qlzh727 Dec 18, 2023

qlzh727 Dec 18, 2023

fchollet Dec 18, 2023

qlzh727 Dec 18, 2023

qlzh727 Dec 18, 2023

fchollet Dec 18, 2023 •

edited

Loading

qlzh727 Dec 18, 2023

qlzh727 Dec 18, 2023

fchollet Dec 18, 2023

Add gradient accumulation support for all backends, and enable optimizer EMA for JAX and torch #18951

Add gradient accumulation support for all backends, and enable optimizer EMA for JAX and torch #18951

Conversation

fchollet commented Dec 17, 2023

fchollet commented Dec 17, 2023

codecov-commenter commented Dec 17, 2023 • edited Loading

Codecov Report

fchollet commented Dec 17, 2023 • edited Loading

haifeng-jin commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 17, 2023 •

edited

Loading

fchollet commented Dec 17, 2023 •

edited

Loading

fchollet Dec 18, 2023 •

edited

Loading