Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: hit nan for variance_normalized #30

Open
gcp opened this issue Aug 31, 2021 · 7 comments
Open

RuntimeError: hit nan for variance_normalized #30

gcp opened this issue Aug 31, 2021 · 7 comments

Comments

@gcp
Copy link

gcp commented Aug 31, 2021

Calling Ranger21 with mostly default parameters:

    optimizer = ranger21.Ranger21(
        net.parameters(), lr=0.001, num_epochs=50, weight_decay=1e-5,
        num_batches_per_epoch=len(train_loader)
    )

Training seems fine for half a day with decent progress on all loss metrics, but then halts:

File "./train_pt.py", line 727, in <module>
    main(sys.argv[1:])
  File "./train_pt.py", line 612, in main
    optimizer.step()
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/morbo/git/Ranger21/ranger21/ranger21.py", line 714, in step
    raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized
@swarmt
Copy link

swarmt commented Sep 12, 2021

Am also seeing this.

@gcp
Copy link
Author

gcp commented Sep 13, 2021

To be fair I'm also seeing this on Facebook's MADGRAD now, so I wonder if Adam/madgrad are just more likely to trigger this kind of divergence or if a bug slipped into the training data.

Basically one of the loss values NaN's and this causes the optimizer to instantly fail (I guess SGD just recovers if that happens).

@swarmt
Copy link

swarmt commented Sep 13, 2021

Reducing my learning rate solved it.

@TomStarshak
Copy link

I've had the same issue. Reducing the learning rate did help, but I'm at 1e-5 with default parameters and 1e-6 with madgrad still gave NaN on loss values. Curious if there's something else I can do.

@dnhkng
Copy link

dnhkng commented Sep 24, 2021

I've just hit it too :(

@swarmt
Copy link

swarmt commented Sep 29, 2021

I found my error. I had some training data with values way outside me expected range of 0-1 which I found by adding an assert in my dataloader.

@Sopel97
Copy link

Sopel97 commented Mar 17, 2022

I integrated ranger21 into https://github.com/glinscott/nnue-pytorch and exploring different parameters. I'm hitting this issue always after first step of training.

This is what I'm using:

    optimizer = ranger21.Ranger21(train_params,
      lr=8.75e-4, betas=(.9, 0.999), eps=1.0e-7,
      using_gc=False, using_normgc=False,
      weight_decay=0,
      num_batches_per_epoch=int(self.epoch_size/self.batch_size), num_epochs=self.max_epochs,
      warmdown_active=False, use_warmup=False,
      use_adaptive_gradient_clipping=False,
      softplus=False,
      use_madgrad=True,
      pnm_momentum_factor=0.0)

changing lr, eps, weight_decay, use_adaptive_gradient_clipping, use_warmup appears to have no effect. The NaN comes from the forward pass in the second step, so some weights become NaN. Adam and AdaBelief cores work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants