Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train loss equal to 0 #2

Open
suzy0223 opened this issue Feb 14, 2023 · 5 comments
Open

train loss equal to 0 #2

suzy0223 opened this issue Feb 14, 2023 · 5 comments

Comments

@suzy0223
Copy link

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X.
Specifically,
(1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell'
(2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows:
(1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work
(2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

@suzy0223
Copy link
Author

Besides, add 'tf.compat.disable_eager_execution()' at the beginning of the def placeholder(h).

@wujiangzhu
Copy link

I met the same problem that the loss became 0 after several epochs, could you help me, appeciated!

@Aminsheykh98
Copy link

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X. Specifically, (1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell' (2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows: (1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work (2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

I translated their code into PyTorch. I also encountered the same issue you mentioned. And I think the problem is that they didn't normalize the inputs (so that masking NaN values in the loss function would not be difficult). However, it is causing the gradient to explode after 4 or 5 epochs.

@zhusuwen
Copy link

zhusuwen commented Jun 5, 2024

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

@Reset-quick
Copy link

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

Hello, may I ask if you have solved this problem now? Can you run the effect in the paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants