Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on CPU: "invalid on input" #4

Open
fdietze opened this issue Aug 27, 2019 · 5 comments
Open

Training on CPU: "invalid on input" #4

fdietze opened this issue Aug 27, 2019 · 5 comments

Comments

@fdietze
Copy link

fdietze commented Aug 27, 2019

Hi, I'm trying to run the parity experiment locally on my CPU:

python exps/parity.py --seq=20

But at Epoch 18 I get the error invalid on input:

Epoch 0 Test  Loss 1.3451 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.53it/s]
TESTING SET RESULTS: Average loss: 1.3595 Err: 0.5100
Epoch 1 Train Loss 0.6839 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:05<00:00, 17.56it/s]
Epoch 1 Test  Loss 0.7007 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  5.67it/s]
TESTING SET RESULTS: Average loss: 0.7005 Err: 0.5100
Epoch 2 Train Loss 0.6832 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:04<00:00, 18.22it/s]
Epoch 2 Test  Loss 0.7004 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  5.39it/s]
TESTING SET RESULTS: Average loss: 0.6999 Err: 0.5100
Epoch 3 Train Loss 0.6813 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:05<00:00, 17.29it/s]
Epoch 3 Test  Loss 0.7003 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.58it/s]
TESTING SET RESULTS: Average loss: 0.6994 Err: 0.5100
Epoch 4 Train Loss 0.6814 Err: 0.3900: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.00it/s]
Epoch 4 Test  Loss 0.7008 Err: 0.5120: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.53it/s]
TESTING SET RESULTS: Average loss: 0.7012 Err: 0.5140
Epoch 5 Train Loss 0.6823 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.10it/s]
Epoch 5 Test  Loss 0.6999 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.49it/s]
TESTING SET RESULTS: Average loss: 0.6996 Err: 0.5100
Epoch 6 Train Loss 0.6763 Err: 0.3800: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.06it/s]
Epoch 6 Test  Loss 0.7024 Err: 0.5120: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.52it/s]
TESTING SET RESULTS: Average loss: 0.7028 Err: 0.5130
Epoch 7 Train Loss 0.6836 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.74it/s]
Epoch 7 Test  Loss 0.6986 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.26it/s]
TESTING SET RESULTS: Average loss: 0.6986 Err: 0.5100
Epoch 8 Train Loss 0.6854 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.82it/s]
Epoch 8 Test  Loss 0.6983 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.53it/s]
TESTING SET RESULTS: Average loss: 0.6979 Err: 0.5100
Epoch 9 Train Loss 0.6882 Err: 0.4500: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.97it/s]
Epoch 9 Test  Loss 0.6986 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00,  4.12it/s]
TESTING SET RESULTS: Average loss: 0.6974 Err: 0.5100
Epoch 10 Train Loss 0.6878 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.93it/s]
Epoch 10 Test  Loss 0.6985 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.57it/s]
TESTING SET RESULTS: Average loss: 0.6970 Err: 0.5100
Epoch 11 Train Loss 0.6875 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 14.09it/s]
Epoch 11 Test  Loss 0.6981 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.25it/s]
TESTING SET RESULTS: Average loss: 0.6974 Err: 0.5100
Epoch 12 Train Loss 0.6830 Err: 0.3900: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.90it/s]
Epoch 12 Test  Loss 0.6983 Err: 0.5120: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.35it/s]
TESTING SET RESULTS: Average loss: 0.6988 Err: 0.5130
Epoch 13 Train Loss 0.6857 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.26it/s]
Epoch 13 Test  Loss 0.6980 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.44it/s]
TESTING SET RESULTS: Average loss: 0.6977 Err: 0.5100
Epoch 14 Train Loss 0.6796 Err: 0.4500: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.64it/s]
Epoch 14 Test  Loss 0.6982 Err: 0.4860: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.52it/s]
TESTING SET RESULTS: Average loss: 0.6989 Err: 0.5030
Epoch 15 Train Loss 0.6886 Err: 0.4800: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.89it/s]
Epoch 15 Test  Loss 0.6974 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.44it/s]
TESTING SET RESULTS: Average loss: 0.6960 Err: 0.5100
Epoch 16 Train Loss 0.6856 Err: 0.4500: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.99it/s]
Epoch 16 Test  Loss 0.6979 Err: 0.5080: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.22it/s]
TESTING SET RESULTS: Average loss: 0.6997 Err: 0.5080
Epoch 17 Train Loss 0.6784 Err: 0.3800: 100%|█████████████████████████████| 90/90 [00:06<00:00, 14.12it/s]
Epoch 17 Test  Loss 0.7000 Err: 0.5120: 100%|███████████████████████████████| 2/2 [00:00<00:00,  4.32it/s]
TESTING SET RESULTS: Average loss: 0.7011 Err: 0.5130
Epoch 18 Train Loss 0.0310 Err: 0.0000:  49%|██████████████▏              | 44/90 [00:03<00:04, 11.23it/s]invalid on input
invalid on input
invalid on input
invalid on input
invalid on input
invalid on input
Epoch 18 Train Loss 0.0234 Err: 0.0000:  51%|██████████████▊              | 46/90 [00:03<00:03, 11.16it/s]invalid on input
invalid on input
invalid on input
[...]

What could be wrong?

@xflash96
Copy link
Member

Sorry for the delayed reply. The "invalid on input" warning (satnet_cpp:194) means that there are Nan or Inf in the gradient, which didn't happen during our tests. Could you describe your environment (CPU spec, numpy/pytorch version) for generating the bug?

@fdietze
Copy link
Author

fdietze commented Nov 9, 2019

No worries, sorry for my late reply now :)

Didn't find the time yet to try again. I'll report back when I do.

@fdietze
Copy link
Author

fdietze commented Dec 30, 2019

So I found the time to try again. Still the same problem, but at a later epoch.

Manjaro Linux, Linux 5.3.18-1 (Running in Virtualbox)
CPU: Intel i7-8550U
Python 3.8.1
numpy 1.18.0
torch 1.3.1

Tell me if you need more information.

Thanks for your help!

@xflash96
Copy link
Member

Sorry for the late update. I've updated the APIs to work with Pytorch:1.7.0.
Also, I fixed the bug on the CPU version.
May you confirm that it also works on your side?

@fdietze
Copy link
Author

fdietze commented Dec 30, 2020

Thank you for the update. I'll report back, when I try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants