Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU issues? #8

Open
Feheragyar opened this issue Apr 12, 2023 · 18 comments · Fixed by VZoche-Golob/keras-tuner-cv#5
Open

GPU issues? #8

Feheragyar opened this issue Apr 12, 2023 · 18 comments · Fixed by VZoche-Golob/keras-tuner-cv#5

Comments

@Feheragyar
Copy link
Contributor

I have been using your extension on CPUs and it runs perfectly. I recently moved over to a using a GPU and the loss calculation looks completely chaotic now. Are there some issues in the implementation that prohibit the use of GPUs?
Here is a snippet for you to see the loss calculation issues (Best loss is 'none'; The recovered weights result in previously unseen loss values after early stopping; and due to 'none' best loss value the Best hyperparameters remain as the were set for the very first trial):


Inner Cross-Validation 5/5

Epoch 1/50
6/6 [==============================] - 5s 575ms/step - loss: 0.5369 - mean_squared_error: 0.5369 - mean_absolute_error: 0.6359 - mean_absolute_percentage_error: 263.9126 - root_mean_squared_error: 0.7327 - val_loss: 0.0721 - val_mean_squared_error: 0.0721 - val_mean_absolute_error: 0.2148 - val_mean_absolute_percentage_error: 22.1264 - val_root_mean_squared_error: 0.2685
Epoch 2/50
6/6 [==============================] - 3s 475ms/step - loss: 0.1652 - mean_squared_error: 0.1652 - mean_absolute_error: 0.3106 - mean_absolute_percentage_error: 323.5719 - root_mean_squared_error: 0.4065 - val_loss: 0.0850 - val_mean_squared_error: 0.0850 - val_mean_absolute_error: 0.2492 - val_mean_absolute_percentage_error: 25.4391 - val_root_mean_squared_error: 0.2915
Epoch 3/50
6/6 [==============================] - 3s 478ms/step - loss: 0.1079 - mean_squared_error: 0.1079 - mean_absolute_error: 0.2405 - mean_absolute_percentage_error: 256.0751 - root_mean_squared_error: 0.3284 - val_loss: 0.0103 - val_mean_squared_error: 0.0103 - val_mean_absolute_error: 0.0714 - val_mean_absolute_percentage_error: 7.3397 - val_root_mean_squared_error: 0.1013
Epoch 4/50
6/6 [==============================] - 3s 478ms/step - loss: 0.1035 - mean_squared_error: 0.1035 - mean_absolute_error: 0.1980 - mean_absolute_percentage_error: 354.6868 - root_mean_squared_error: 0.3217 - val_loss: 0.0538 - val_mean_squared_error: 0.0538 - val_mean_absolute_error: 0.2179 - val_mean_absolute_percentage_error: 22.2260 - val_root_mean_squared_error: 0.2319
Epoch 5/50
6/6 [==============================] - 3s 481ms/step - loss: 0.1149 - mean_squared_error: 0.1149 - mean_absolute_error: 0.2556 - mean_absolute_percentage_error: 254.6845 - root_mean_squared_error: 0.3389 - val_loss: 0.0229 - val_mean_squared_error: 0.0229 - val_mean_absolute_error: 0.1178 - val_mean_absolute_percentage_error: 12.0714 - val_root_mean_squared_error: 0.1513
Epoch 6/50
6/6 [==============================] - 2s 381ms/step - loss: 0.0978 - mean_squared_error: 0.0978 - mean_absolute_error: 0.2223 - mean_absolute_percentage_error: 208.5932 - root_mean_squared_error: 0.3127 - val_loss: 0.0734 - val_mean_squared_error: 0.0734 - val_mean_absolute_error: 0.2140 - val_mean_absolute_percentage_error: 22.2007 - val_root_mean_squared_error: 0.2710
Epoch 7/50
6/6 [==============================] - 1s 225ms/step - loss: 0.0789 - mean_squared_error: 0.0789 - mean_absolute_error: 0.2038 - mean_absolute_percentage_error: 213.5430 - root_mean_squared_error: 0.2808 - val_loss: 0.0186 - val_mean_squared_error: 0.0186 - val_mean_absolute_error: 0.0969 - val_mean_absolute_percentage_error: 10.0373 - val_root_mean_squared_error: 0.1364
Epoch 8/50
6/6 [==============================] - 1s 228ms/step - loss: 0.0708 - mean_squared_error: 0.0708 - mean_absolute_error: 0.1652 - mean_absolute_percentage_error: 276.1188 - root_mean_squared_error: 0.2662 - val_loss: 0.0087 - val_mean_squared_error: 0.0087 - val_mean_absolute_error: 0.0701 - val_mean_absolute_percentage_error: 7.1587 - val_root_mean_squared_error: 0.0935
Epoch 9/50
6/6 [==============================] - 1s 219ms/step - loss: 0.0676 - mean_squared_error: 0.0676 - mean_absolute_error: 0.1503 - mean_absolute_percentage_error: 282.9794 - root_mean_squared_error: 0.2600 - val_loss: 0.0090 - val_mean_squared_error: 0.0090 - val_mean_absolute_error: 0.0536 - val_mean_absolute_percentage_error: 5.5848 - val_root_mean_squared_error: 0.0950
Epoch 10/50
6/6 [==============================] - 2s 409ms/step - loss: 0.0663 - mean_squared_error: 0.0663 - mean_absolute_error: 0.1536 - mean_absolute_percentage_error: 242.2759 - root_mean_squared_error: 0.2574 - val_loss: 0.0151 - val_mean_squared_error: 0.0151 - val_mean_absolute_error: 0.0738 - val_mean_absolute_percentage_error: 7.7006 - val_root_mean_squared_error: 0.1227
Epoch 11/50
6/6 [==============================] - 3s 481ms/step - loss: 0.0696 - mean_squared_error: 0.0696 - mean_absolute_error: 0.1742 - mean_absolute_percentage_error: 183.5706 - root_mean_squared_error: 0.2638 - val_loss: 0.0395 - val_mean_squared_error: 0.0395 - val_mean_absolute_error: 0.1167 - val_mean_absolute_percentage_error: 12.3000 - val_root_mean_squared_error: 0.1986
Epoch 12/50
6/6 [==============================] - 2s 269ms/step - loss: 0.0635 - mean_squared_error: 0.0635 - mean_absolute_error: 0.1620 - mean_absolute_percentage_error: 193.5781 - root_mean_squared_error: 0.2520 - val_loss: 0.0258 - val_mean_squared_error: 0.0258 - val_mean_absolute_error: 0.0838 - val_mean_absolute_percentage_error: 8.8847 - val_root_mean_squared_error: 0.1606
Epoch 13/50
6/6 [==============================] - 2s 409ms/step - loss: 0.0594 - mean_squared_error: 0.0594 - mean_absolute_error: 0.1509 - mean_absolute_percentage_error: 208.7011 - root_mean_squared_error: 0.2438 - val_loss: 0.0404 - val_mean_squared_error: 0.0404 - val_mean_absolute_error: 0.1378 - val_mean_absolute_percentage_error: 14.4424 - val_root_mean_squared_error: 0.2011
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
1/1 [==============================] - 1s 579ms/step
1/1 [==============================] - 0s 500ms/step
1/1 [==============================] - 1s 1s/step - loss: 0.0499 - mean_squared_error: 0.0499 - mean_absolute_error: 0.1130 - mean_absolute_percentage_error: 234.8392 - root_mean_squared_error: 0.2234
1/1 [==============================] - 1s 609ms/step - loss: 0.1864 - mean_squared_error: 0.1864 - mean_absolute_error: 0.2046 - mean_absolute_percentage_error: 106.4081 - root_mean_squared_error: 0.4317
Trial 1 Complete [00h 02m 55s]

Best val_loss So Far: None
Total elapsed time: 00h 02m 55s

@VZoche-Golob
Copy link

When using the version included in pull request #5, I did not run into any issues during hyperparameter optimization on a GPU. Did you use the exact same code on a CPU and on a GPU?

@Feheragyar
Copy link
Contributor Author

Yes I used the #5 version. I ran the identical code on CPU and GPU.

@VZoche-Golob
Copy link

Unfortunately, I cannot reproduce your issues. Could you please provide a hypermodel (e.g. for MNIST) and the HPO and training procedure as code snippet (e.g. in a gist) that produces the issue?

@Feheragyar
Copy link
Contributor Author

Here is the Gist for the full code I am running (It's an LSTM tuning via ByasianOptimization) along with the Data I use for training. The code directly grabs the numpy format data, so you can just run the code directly.

Thanks a lot for the help! Hope you can see the issues this way.

@VZoche-Golob
Copy link

Using your data, I tried to reproduce your issues again after minor modifications of your code (Gist) and in branch fixGPUissues. I used the same computer with Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.

Again, I could not reproduce your issues (see out-files in Gist):

  • Never occurred a "Best val_loss So Far: None" - I have no idea where it might come from. According to the keras-tuner code, it would only be printed if no trials were completed.
  • val_loss of a cv-split within a trial was always exactly the best epoch's val_loss.

However, I used batch_size='full-batch' to ensure that keras-tuner-cv used the same batch sizes during training and evaluation of a cv-split in a trial. Please be aware, that inner_cv() of keras-tuner-cv always uses the length of the training and the validation data, respectively, when evaluating the trained model in a cv-split.

@Feheragyar
Copy link
Contributor Author

Thank you for the help! I will look around my environment. There must be something disagreeing with your package.

@Feheragyar
Copy link
Contributor Author

Which version of tensorflow-gpu is recommended to use with keras-tuner-cv?

@VZoche-Golob
Copy link

I did not test it with another version than Tensorflow 2.11.

@Feheragyar
Copy link
Contributor Author

So you run the script on Linux or WSL? Perhaps, that's my issue. Running it on native Win.

@VZoche-Golob
Copy link

Using WSL2.

@VZoche-Golob
Copy link

It seems that I get the same issue after an update from TensorFlow 2.11 and Kerastuner 1.1.3 to TensorFlow 2.12 and Kerastuner 1.3.5.
@Feheragyar : How did you handle this issue?

@Feheragyar
Copy link
Contributor Author

I have simply migrated to Linux. Gave up on virtual environments as I couldn't make the library run in a full day of fiddlin'. I used the TF and tuner versions cited by you in a previous comment (Tensorflow version 2.11.0 and Keras-Tuner version 1.1.3.). I had no issues on native Linux (Ubuntu) using anaconda.

@VZoche-Golob
Copy link

@Feheragyar : Thanks for answering so quickly. Most probably, that will be my solution as well...

Using TensorFlow 2.12 and Kerastuner 1.3.5, test_randomsearch, test_bayesianoptimization and test_hyperband in keras_tuner_cv/test_inner_cv.py (https://github.com/VZoche-Golob/keras-tuner-cv) fail.

@Feheragyar
Copy link
Contributor Author

No worries. I believe that's all I did, let me know if you run into troubles I'll try to retrace my steps for you. I believe I tested it with the most up to date TF, and kept the old tuner version and it still worked perfectly.

@VZoche-Golob
Copy link

I tried different versions of Tensorflow and KerasTuner. It seems that keras-tuner-cv currently only works with KerasTuner 1.1.3 and numpy 1.20.

When using KerasTuner 1.1.3 with Tensorflow >2.11, you will get several deprecation warnings. However, even with Tensorflow 2.11, KerasTuner 1.1.3 and numpy 1.20, you get:

lib/python3.9/site-packages/keras_tuner/tuners/bayesian.py:123: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

@VZoche-Golob
Copy link

I think, I found the issue: in KerasTuner 1.1.3, the status of a trail was set to "completed" by Oracle.end_trial() but in v1.3.5, the status is set earlier by BaseTuner._try_run_and_update_trial() which did not exist in v1.1.3.

@VZoche-Golob
Copy link

VZoche-Golob commented Sep 4, 2023

I fixed the issue for KerasTuner 1.3.5 in https://github.com/VZoche-Golob/keras-tuner-cv

@VZoche-Golob
Copy link

After merging #5, this issue should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants