GPU issues? #8

Feheragyar · 2023-04-12T07:20:48Z

I have been using your extension on CPUs and it runs perfectly. I recently moved over to a using a GPU and the loss calculation looks completely chaotic now. Are there some issues in the implementation that prohibit the use of GPUs?
Here is a snippet for you to see the loss calculation issues (Best loss is 'none'; The recovered weights result in previously unseen loss values after early stopping; and due to 'none' best loss value the Best hyperparameters remain as the were set for the very first trial):

Inner Cross-Validation 5/5

Epoch 1/50
6/6 [==============================] - 5s 575ms/step - loss: 0.5369 - mean_squared_error: 0.5369 - mean_absolute_error: 0.6359 - mean_absolute_percentage_error: 263.9126 - root_mean_squared_error: 0.7327 - val_loss: 0.0721 - val_mean_squared_error: 0.0721 - val_mean_absolute_error: 0.2148 - val_mean_absolute_percentage_error: 22.1264 - val_root_mean_squared_error: 0.2685
Epoch 2/50
6/6 [==============================] - 3s 475ms/step - loss: 0.1652 - mean_squared_error: 0.1652 - mean_absolute_error: 0.3106 - mean_absolute_percentage_error: 323.5719 - root_mean_squared_error: 0.4065 - val_loss: 0.0850 - val_mean_squared_error: 0.0850 - val_mean_absolute_error: 0.2492 - val_mean_absolute_percentage_error: 25.4391 - val_root_mean_squared_error: 0.2915
Epoch 3/50
6/6 [==============================] - 3s 478ms/step - loss: 0.1079 - mean_squared_error: 0.1079 - mean_absolute_error: 0.2405 - mean_absolute_percentage_error: 256.0751 - root_mean_squared_error: 0.3284 - val_loss: 0.0103 - val_mean_squared_error: 0.0103 - val_mean_absolute_error: 0.0714 - val_mean_absolute_percentage_error: 7.3397 - val_root_mean_squared_error: 0.1013
Epoch 4/50
6/6 [==============================] - 3s 478ms/step - loss: 0.1035 - mean_squared_error: 0.1035 - mean_absolute_error: 0.1980 - mean_absolute_percentage_error: 354.6868 - root_mean_squared_error: 0.3217 - val_loss: 0.0538 - val_mean_squared_error: 0.0538 - val_mean_absolute_error: 0.2179 - val_mean_absolute_percentage_error: 22.2260 - val_root_mean_squared_error: 0.2319
Epoch 5/50
6/6 [==============================] - 3s 481ms/step - loss: 0.1149 - mean_squared_error: 0.1149 - mean_absolute_error: 0.2556 - mean_absolute_percentage_error: 254.6845 - root_mean_squared_error: 0.3389 - val_loss: 0.0229 - val_mean_squared_error: 0.0229 - val_mean_absolute_error: 0.1178 - val_mean_absolute_percentage_error: 12.0714 - val_root_mean_squared_error: 0.1513
Epoch 6/50
6/6 [==============================] - 2s 381ms/step - loss: 0.0978 - mean_squared_error: 0.0978 - mean_absolute_error: 0.2223 - mean_absolute_percentage_error: 208.5932 - root_mean_squared_error: 0.3127 - val_loss: 0.0734 - val_mean_squared_error: 0.0734 - val_mean_absolute_error: 0.2140 - val_mean_absolute_percentage_error: 22.2007 - val_root_mean_squared_error: 0.2710
Epoch 7/50
6/6 [==============================] - 1s 225ms/step - loss: 0.0789 - mean_squared_error: 0.0789 - mean_absolute_error: 0.2038 - mean_absolute_percentage_error: 213.5430 - root_mean_squared_error: 0.2808 - val_loss: 0.0186 - val_mean_squared_error: 0.0186 - val_mean_absolute_error: 0.0969 - val_mean_absolute_percentage_error: 10.0373 - val_root_mean_squared_error: 0.1364
Epoch 8/50
6/6 [==============================] - 1s 228ms/step - loss: 0.0708 - mean_squared_error: 0.0708 - mean_absolute_error: 0.1652 - mean_absolute_percentage_error: 276.1188 - root_mean_squared_error: 0.2662 - val_loss: 0.0087 - val_mean_squared_error: 0.0087 - val_mean_absolute_error: 0.0701 - val_mean_absolute_percentage_error: 7.1587 - val_root_mean_squared_error: 0.0935
Epoch 9/50
6/6 [==============================] - 1s 219ms/step - loss: 0.0676 - mean_squared_error: 0.0676 - mean_absolute_error: 0.1503 - mean_absolute_percentage_error: 282.9794 - root_mean_squared_error: 0.2600 - val_loss: 0.0090 - val_mean_squared_error: 0.0090 - val_mean_absolute_error: 0.0536 - val_mean_absolute_percentage_error: 5.5848 - val_root_mean_squared_error: 0.0950
Epoch 10/50
6/6 [==============================] - 2s 409ms/step - loss: 0.0663 - mean_squared_error: 0.0663 - mean_absolute_error: 0.1536 - mean_absolute_percentage_error: 242.2759 - root_mean_squared_error: 0.2574 - val_loss: 0.0151 - val_mean_squared_error: 0.0151 - val_mean_absolute_error: 0.0738 - val_mean_absolute_percentage_error: 7.7006 - val_root_mean_squared_error: 0.1227
Epoch 11/50
6/6 [==============================] - 3s 481ms/step - loss: 0.0696 - mean_squared_error: 0.0696 - mean_absolute_error: 0.1742 - mean_absolute_percentage_error: 183.5706 - root_mean_squared_error: 0.2638 - val_loss: 0.0395 - val_mean_squared_error: 0.0395 - val_mean_absolute_error: 0.1167 - val_mean_absolute_percentage_error: 12.3000 - val_root_mean_squared_error: 0.1986
Epoch 12/50
6/6 [==============================] - 2s 269ms/step - loss: 0.0635 - mean_squared_error: 0.0635 - mean_absolute_error: 0.1620 - mean_absolute_percentage_error: 193.5781 - root_mean_squared_error: 0.2520 - val_loss: 0.0258 - val_mean_squared_error: 0.0258 - val_mean_absolute_error: 0.0838 - val_mean_absolute_percentage_error: 8.8847 - val_root_mean_squared_error: 0.1606
Epoch 13/50
6/6 [==============================] - 2s 409ms/step - loss: 0.0594 - mean_squared_error: 0.0594 - mean_absolute_error: 0.1509 - mean_absolute_percentage_error: 208.7011 - root_mean_squared_error: 0.2438 - val_loss: 0.0404 - val_mean_squared_error: 0.0404 - val_mean_absolute_error: 0.1378 - val_mean_absolute_percentage_error: 14.4424 - val_root_mean_squared_error: 0.2011
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
1/1 [==============================] - 1s 579ms/step
1/1 [==============================] - 0s 500ms/step
1/1 [==============================] - 1s 1s/step - loss: 0.0499 - mean_squared_error: 0.0499 - mean_absolute_error: 0.1130 - mean_absolute_percentage_error: 234.8392 - root_mean_squared_error: 0.2234
1/1 [==============================] - 1s 609ms/step - loss: 0.1864 - mean_squared_error: 0.1864 - mean_absolute_error: 0.2046 - mean_absolute_percentage_error: 106.4081 - root_mean_squared_error: 0.4317
Trial 1 Complete [00h 02m 55s]

Best val_loss So Far: None
Total elapsed time: 00h 02m 55s

VZoche-Golob · 2023-04-12T09:39:26Z

When using the version included in pull request #5, I did not run into any issues during hyperparameter optimization on a GPU. Did you use the exact same code on a CPU and on a GPU?

Feheragyar · 2023-04-13T08:56:22Z

Yes I used the #5 version. I ran the identical code on CPU and GPU.

VZoche-Golob · 2023-04-14T10:39:03Z

Unfortunately, I cannot reproduce your issues. Could you please provide a hypermodel (e.g. for MNIST) and the HPO and training procedure as code snippet (e.g. in a gist) that produces the issue?

Feheragyar · 2023-04-17T05:51:41Z

Here is the Gist for the full code I am running (It's an LSTM tuning via ByasianOptimization) along with the Data I use for training. The code directly grabs the numpy format data, so you can just run the code directly.

Thanks a lot for the help! Hope you can see the issues this way.

VZoche-Golob · 2023-04-18T11:21:36Z

Using your data, I tried to reproduce your issues again after minor modifications of your code (Gist) and in branch fixGPUissues. I used the same computer with Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.

Again, I could not reproduce your issues (see out-files in Gist):

Never occurred a "Best val_loss So Far: None" - I have no idea where it might come from. According to the keras-tuner code, it would only be printed if no trials were completed.
val_loss of a cv-split within a trial was always exactly the best epoch's val_loss.

However, I used batch_size='full-batch' to ensure that keras-tuner-cv used the same batch sizes during training and evaluation of a cv-split in a trial. Please be aware, that inner_cv() of keras-tuner-cv always uses the length of the training and the validation data, respectively, when evaluating the trained model in a cv-split.

Feheragyar · 2023-04-20T04:25:19Z

Thank you for the help! I will look around my environment. There must be something disagreeing with your package.

Feheragyar · 2023-04-20T07:33:23Z

Which version of tensorflow-gpu is recommended to use with keras-tuner-cv?

VZoche-Golob · 2023-04-21T06:39:52Z

I did not test it with another version than Tensorflow 2.11.

Feheragyar · 2023-04-21T06:49:31Z

So you run the script on Linux or WSL? Perhaps, that's my issue. Running it on native Win.

VZoche-Golob · 2023-04-21T06:58:13Z

Using WSL2.

VZoche-Golob · 2023-08-31T07:32:17Z

It seems that I get the same issue after an update from TensorFlow 2.11 and Kerastuner 1.1.3 to TensorFlow 2.12 and Kerastuner 1.3.5.
@Feheragyar : How did you handle this issue?

Feheragyar · 2023-08-31T08:21:41Z

I have simply migrated to Linux. Gave up on virtual environments as I couldn't make the library run in a full day of fiddlin'. I used the TF and tuner versions cited by you in a previous comment (Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.). I had no issues on native Linux (Ubuntu) using anaconda.

VZoche-Golob · 2023-08-31T08:38:36Z

@Feheragyar : Thanks for answering so quickly. Most probably, that will be my solution as well...

Using TensorFlow 2.12 and Kerastuner 1.3.5, test_randomsearch, test_bayesianoptimization and test_hyperband in keras_tuner_cv/test_inner_cv.py (https://github.com/VZoche-Golob/keras-tuner-cv) fail.

Feheragyar · 2023-08-31T08:41:00Z

No worries. I believe that's all I did, let me know if you run into troubles I'll try to retrace my steps for you. I believe I tested it with the most up to date TF, and kept the old tuner version and it still worked perfectly.

VZoche-Golob · 2023-09-01T12:10:10Z

I tried different versions of Tensorflow and KerasTuner. It seems that keras-tuner-cv currently only works with KerasTuner 1.1.3 and numpy 1.20.

When using KerasTuner 1.1.3 with Tensorflow >2.11, you will get several deprecation warnings. However, even with Tensorflow 2.11, KerasTuner 1.1.3 and numpy 1.20, you get:

lib/python3.9/site-packages/keras_tuner/tuners/bayesian.py:123: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

VZoche-Golob · 2023-09-01T15:00:31Z

I think, I found the issue: in KerasTuner 1.1.3, the status of a trail was set to "completed" by Oracle.end_trial() but in v1.3.5, the status is set earlier by BaseTuner._try_run_and_update_trial() which did not exist in v1.1.3.

VZoche-Golob · 2023-09-04T11:05:42Z

I fixed the issue for KerasTuner 1.3.5 in https://github.com/VZoche-Golob/keras-tuner-cv

VZoche-Golob · 2023-09-25T07:45:02Z

After merging #5, this issue should be fixed.

VZoche-Golob mentioned this issue Sep 4, 2023

Fix gp uissues VZoche-Golob/keras-tuner-cv#5

Merged

VZoche-Golob mentioned this issue Sep 4, 2023

Suggested version 1.1 supporting models with multiple inputs and fixing the issues with KerasTuner>1.1.3. #5

Open

VZoche-Golob mentioned this issue Sep 25, 2023

TypeError with convert_to_metrics_dict() due to Argument Mismatch #10

Closed

VZoche-Golob mentioned this issue Apr 2, 2024

cannot import name 'tuner_utils' from 'keras_tuner.engine' #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU issues? #8

GPU issues? #8

Feheragyar commented Apr 12, 2023

VZoche-Golob commented Apr 12, 2023

Feheragyar commented Apr 13, 2023

VZoche-Golob commented Apr 14, 2023

Feheragyar commented Apr 17, 2023

VZoche-Golob commented Apr 18, 2023

Feheragyar commented Apr 20, 2023

Feheragyar commented Apr 20, 2023

VZoche-Golob commented Apr 21, 2023

Feheragyar commented Apr 21, 2023

VZoche-Golob commented Apr 21, 2023

VZoche-Golob commented Aug 31, 2023

Feheragyar commented Aug 31, 2023

VZoche-Golob commented Aug 31, 2023

Feheragyar commented Aug 31, 2023

VZoche-Golob commented Sep 1, 2023

VZoche-Golob commented Sep 1, 2023

VZoche-Golob commented Sep 4, 2023 •

edited

Loading

VZoche-Golob commented Sep 25, 2023

GPU issues? #8

GPU issues? #8

Comments

Feheragyar commented Apr 12, 2023

Inner Cross-Validation 5/5

VZoche-Golob commented Apr 12, 2023

Feheragyar commented Apr 13, 2023

VZoche-Golob commented Apr 14, 2023

Feheragyar commented Apr 17, 2023

VZoche-Golob commented Apr 18, 2023

Feheragyar commented Apr 20, 2023

Feheragyar commented Apr 20, 2023

VZoche-Golob commented Apr 21, 2023

Feheragyar commented Apr 21, 2023

VZoche-Golob commented Apr 21, 2023

VZoche-Golob commented Aug 31, 2023

Feheragyar commented Aug 31, 2023

VZoche-Golob commented Aug 31, 2023

Feheragyar commented Aug 31, 2023

VZoche-Golob commented Sep 1, 2023

VZoche-Golob commented Sep 1, 2023

VZoche-Golob commented Sep 4, 2023 • edited Loading

VZoche-Golob commented Sep 25, 2023

VZoche-Golob commented Sep 4, 2023 •

edited

Loading