RuntimeError: hit nan for variance_normalized #30

gcp · 2021-08-31T18:33:01Z

Calling Ranger21 with mostly default parameters:

    optimizer = ranger21.Ranger21(
        net.parameters(), lr=0.001, num_epochs=50, weight_decay=1e-5,
        num_batches_per_epoch=len(train_loader)
    )

Training seems fine for half a day with decent progress on all loss metrics, but then halts:

File "./train_pt.py", line 727, in <module>
    main(sys.argv[1:])
  File "./train_pt.py", line 612, in main
    optimizer.step()
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/morbo/git/Ranger21/ranger21/ranger21.py", line 714, in step
    raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized

The text was updated successfully, but these errors were encountered:

swarmt · 2021-09-12T21:11:50Z

Am also seeing this.

gcp · 2021-09-13T09:00:21Z

To be fair I'm also seeing this on Facebook's MADGRAD now, so I wonder if Adam/madgrad are just more likely to trigger this kind of divergence or if a bug slipped into the training data.

Basically one of the loss values NaN's and this causes the optimizer to instantly fail (I guess SGD just recovers if that happens).

swarmt · 2021-09-13T09:08:31Z

Reducing my learning rate solved it.

TomStarshak · 2021-09-20T22:24:00Z

I've had the same issue. Reducing the learning rate did help, but I'm at 1e-5 with default parameters and 1e-6 with madgrad still gave NaN on loss values. Curious if there's something else I can do.

dnhkng · 2021-09-24T05:59:01Z

I've just hit it too :(

swarmt · 2021-09-29T20:51:17Z

I found my error. I had some training data with values way outside me expected range of 0-1 which I found by adding an assert in my dataloader.

Sopel97 · 2022-03-17T14:09:51Z

I integrated ranger21 into https://github.com/glinscott/nnue-pytorch and exploring different parameters. I'm hitting this issue always after first step of training.

This is what I'm using:

    optimizer = ranger21.Ranger21(train_params,
      lr=8.75e-4, betas=(.9, 0.999), eps=1.0e-7,
      using_gc=False, using_normgc=False,
      weight_decay=0,
      num_batches_per_epoch=int(self.epoch_size/self.batch_size), num_epochs=self.max_epochs,
      warmdown_active=False, use_warmup=False,
      use_adaptive_gradient_clipping=False,
      softplus=False,
      use_madgrad=True,
      pnm_momentum_factor=0.0)

changing lr, eps, weight_decay, use_adaptive_gradient_clipping, use_warmup appears to have no effect. The NaN comes from the forward pass in the second step, so some weights become NaN. Adam and AdaBelief cores work fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: hit nan for variance_normalized #30

RuntimeError: hit nan for variance_normalized #30

gcp commented Aug 31, 2021 •

edited

Loading

swarmt commented Sep 12, 2021

gcp commented Sep 13, 2021 •

edited

Loading

swarmt commented Sep 13, 2021

TomStarshak commented Sep 20, 2021

dnhkng commented Sep 24, 2021

swarmt commented Sep 29, 2021

Sopel97 commented Mar 17, 2022

RuntimeError: hit nan for variance_normalized #30

RuntimeError: hit nan for variance_normalized #30

Comments

gcp commented Aug 31, 2021 • edited Loading

swarmt commented Sep 12, 2021

gcp commented Sep 13, 2021 • edited Loading

swarmt commented Sep 13, 2021

TomStarshak commented Sep 20, 2021

dnhkng commented Sep 24, 2021

swarmt commented Sep 29, 2021

Sopel97 commented Mar 17, 2022

gcp commented Aug 31, 2021 •

edited

Loading

gcp commented Sep 13, 2021 •

edited

Loading