Perform gradient clipping on global batch when using gradient accumulation #9

ashors1 · 2023-02-14T18:42:16Z

Refactoring to allow gradient clipping to be performed on full batch rather than subbatches when using ShardedStaticAccumulator. Note that this refactor allows us to maintain support for enable_skip_step_on_gradient_anomalies and requires x+1 grad norm calculations per global batch when using ShardedStaticAccumulator with x subbatches (once per subbatch to determine whether step should be skipped, once when applying gradient clipping in base optimizer update) and requires one grad clip per global batch.

This PR should be taken together with the corresponding Praxis PR.

…n using ShardedStaticAccumulator

zhangqiaorjc · 2023-03-05T23:19:06Z

@ashors1 sorry for the late review, could rebase to head? i want to import it and run some internal CI, thanks!

zhangqiaorjc · 2023-03-06T21:01:16Z

There's quite a few redundant whitespaces. Could you run some python linter to remove those?

zhangqiaorjc · 2023-03-18T17:49:57Z

paxml/learners.py

+    if optimizer_name is None:
+      optimizer_name = ''
+    else:
+      optimizer_name = optimizer_name + '/'


i think you are missing the following code block from the original scale_gradient?

if clip_gradient_norm_to_value is None: clip_gradient_norm_to_value = p.optimizer.clip_gradient_norm_to_value if clip_gradient_single_norm_to_value is None: clip_gradient_single_norm_to_value = ( p.optimizer.clip_gradient_single_norm_to_value )

zhangqiaorjc · 2023-03-18T17:53:29Z

paxml/learners.py

+    else:
+      optimizer_name = optimizer_name + '/'
+    self.get_individual_grad_norms(raw_grads,
+                              optimizer_name)


nit: let's not line break here, optimizer_name can be on previous line

actually can we move get_individual_grad_norms back inline? it's not used anywhere else, and it seems more consistent with the inlined global grad norm below

zhangqiaorjc · 2023-03-18T17:59:16Z

paxml/learners.py

+    if p.check_valid_step:
+      # Mark the step as invalid if any gradient anomaly is detected (e.g. Nan
+      # or Inf, or excessively big gradient norm).
+      valid_step = self.keep_step(raw_grad_norm)


let's move keep_step back as a free function inside get_grad_norm_valid_step rather than a new instance method?

the original code is a bit complicated; let's avoid refactoring too much because it might make it harder to spot whether the existing logic still holds

zhangqiaorjc · 2023-03-18T18:00:54Z

paxml/learners.py

-    grads, valid_step = self.scale_gradients(grads)
+    grad_norm, valid_step = self.get_grad_norm_valid_step(grads)
+
+    using_ga = hasattr(p.optimizer, 'num_sub_batches')


nit: let's use using_grad_accum

most readers might not know what ga means

zhangqiaorjc · 2023-03-18T18:17:15Z

paxml/learners.py

@@ -588,8 +631,16 @@ def scale_gradients_by_optimizer(
  ) -> Tuple[NestedMap, JTensor]:
    optimizer_mask, default_mask = self.get_masks(var_weight_hparams)

-    all_grads, all_valid_step = self.scale_gradients(
-        jax.tree_map(lambda x, y: x * y, raw_grads, default_mask),
+    raw_grads = jax.tree_map(lambda x, y: x * y, raw_grads, default_mask)


let's not reuse raw_grads, let's call this grads_after_mask because you've introduced a subtle bug here if you look at line line 659 inside the auxiliary_optimizers loop, you are now combining this outer mask with inner mask

i would not overwrite raw_grads variable, just

grads_after_mask = jax.tree_map(lambda x, y: x * y, raw_grads, default_mask) grad_norm, all_valid_step = self.get_grad_norm_valid_step( grads_after_mask, optimizer_name='main', )

so that inside auxiliary_optimizers loop, raw_grads is only added to each auxiliary optimizer mask

nluehr · 2023-06-30T14:45:06Z

@zhangqiaorjc is there a reason this has been approved by not merged yet?

…kage/tensorflow-2.11.1 PiperOrigin-RevId: 524892551

ashors1 added 2 commits February 13, 2023 12:00

refactor learners.py to perform gradient clipping on global batch whe…

f6e6618

…n using ShardedStaticAccumulator

add AUTHORS file

f57f77a

ashors1 mentioned this pull request Feb 14, 2023

Perform gradient clipping on global batch when using gradient accumulation google/praxis#6

Open

remove AUTHORS file

b69771e

zhangqiaorjc self-assigned this Mar 3, 2023

zhangqiaorjc added the pull ready Used to import PR as CL label Mar 5, 2023

rebase to head

fd49b3e

fix formatting

57e567c

zhangqiaorjc added pull ready Used to import PR as CL and removed pull ready Used to import PR as CL labels Mar 8, 2023

zhangqiaorjc requested changes Mar 18, 2023

View reviewed changes

address PR comments

2303ad8

zhangqiaorjc approved these changes Mar 19, 2023

View reviewed changes

ashors1 pushed a commit to ashors1/paxml that referenced this pull request Jul 18, 2023

Merge pull request google#9 from google:dependabot/pip/praxis/pip_pac…

c6554c5

…kage/tensorflow-2.11.1 PiperOrigin-RevId: 524892551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform gradient clipping on global batch when using gradient accumulation #9

Perform gradient clipping on global batch when using gradient accumulation #9

ashors1 commented Feb 14, 2023

zhangqiaorjc commented Mar 5, 2023

zhangqiaorjc commented Mar 6, 2023

zhangqiaorjc Mar 18, 2023

zhangqiaorjc Mar 18, 2023

zhangqiaorjc Mar 18, 2023

zhangqiaorjc Mar 18, 2023

zhangqiaorjc Mar 18, 2023

zhangqiaorjc Mar 18, 2023

nluehr commented Jun 30, 2023

Perform gradient clipping on global batch when using gradient accumulation #9

Are you sure you want to change the base?

Perform gradient clipping on global batch when using gradient accumulation #9

Conversation

ashors1 commented Feb 14, 2023

zhangqiaorjc commented Mar 5, 2023

zhangqiaorjc commented Mar 6, 2023

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

zhangqiaorjc Mar 18, 2023

Choose a reason for hiding this comment

nluehr commented Jun 30, 2023