train loss equal to 0 #2

suzy0223 · 2023-02-14T09:21:54Z

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X.
Specifically,
(1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell'
(2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows:
(1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work
(2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

suzy0223 · 2023-02-14T09:33:07Z

Besides, add 'tf.compat.disable_eager_execution()' at the beginning of the def placeholder(h).

wujiangzhu · 2023-05-10T14:50:52Z

I met the same problem that the loss became 0 after several epochs, could you help me, appeciated!

Aminsheykh98 · 2024-02-26T14:22:37Z

Hi, I try to run the train.py on METR-LA. Due to the TensorFlow version, I use the tf_upgrad_v2 to migrate the model.py to TF 2. X. Specifically, (1) line 76 'tf.nn.rnn_cell.GRUCell' to 'tf.compat.v1.nn.rnn_cell.GRUCell' (2)line 80 and 87 'tf.layers.dense' to 'tf.compat.v1.layers.dense'

Then, when I run train.py and test.py, there're several issues, as follows: (1) line 47 in test.py and line 58 in train.py, x.value() error, x is int. After I change the "x.value for x in xxx" to "x for x in xxx" it could work (2) After 4-5 epochs, the training and validation losses become 0. The test result becomes nan. I run the code several times, the issue does not disappear. Meanwhile, the test.py is normal and outputs the results. Since the dataset doesn't contain the metr-la.h5, I use the document download from IGNNK.

Now, I am not sure of the reasons for the issues. Hope to here some suggestions. Much appreciated.

I translated their code into PyTorch. I also encountered the same issue you mentioned. And I think the problem is that they didn't normalize the inputs (so that masking NaN values in the loss function would not be difficult). However, it is causing the gradient to explode after 4 or 5 epochs.

zhusuwen · 2024-06-05T00:43:46Z

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

Reset-quick · 2024-07-15T12:06:01Z

When I got the learning rate down to 0.0001, it worked fine, but the results were not as good as in the paper

Hello, may I ask if you have solved this problem now? Can you run the effect in the paper?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train loss equal to 0 #2

train loss equal to 0 #2

suzy0223 commented Feb 14, 2023

suzy0223 commented Feb 14, 2023

wujiangzhu commented May 10, 2023

Aminsheykh98 commented Feb 26, 2024

zhusuwen commented Jun 5, 2024

Reset-quick commented Jul 15, 2024

train loss equal to 0 #2

train loss equal to 0 #2

Comments

suzy0223 commented Feb 14, 2023

suzy0223 commented Feb 14, 2023

wujiangzhu commented May 10, 2023

Aminsheykh98 commented Feb 26, 2024

zhusuwen commented Jun 5, 2024

Reset-quick commented Jul 15, 2024