Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training not working in Cloudina #419

Open
pilarnavarro opened this issue Jul 6, 2024 · 6 comments
Open

Model training not working in Cloudina #419

pilarnavarro opened this issue Jul 6, 2024 · 6 comments
Labels
bug Something isn't working Spyfish Support

Comments

@pilarnavarro
Copy link
Collaborator

I attempted to run the Train_models.ipynb notebook for model training in Cloudina for the Spyfish Aotearoa project. I used data from Spyfish Aotearoa, which I downloaded using the Process_classifications.ipynb notebook. However, when I tried to download the model weights, I encountered an error:

Screenshot from 2024-07-06 16-22-26

Despite this error, I continued running the notebook. However, I think the training did not even start. The cell for training kept executing indefinitely, with the output shown in the attached screenshot.

Screenshot from 2024-07-06 17-59-52

After approximately half an hour, the kernel died.

@pilarnavarro pilarnavarro added the bug Something isn't working label Jul 6, 2024
@victor-wildlife
Copy link
Collaborator

@jannesgg any thoughts on this?

@jannesgg
Copy link
Collaborator

jannesgg commented Aug 9, 2024

@victor-wildlife I believe that this is related to uploading artifacts to MLFlow, which unfortunately can take a long time depending on the number of files that need to be uploaded. One way to avoid this would be to stop uploading the entire dataset and instead upload the names of the files which are used in the training and validation sets.

@jannesgg
Copy link
Collaborator

jannesgg commented Sep 5, 2024

@pilarnavarro Have you tried this again since?

@pilarnavarro
Copy link
Collaborator Author

Hello @jannesgg, I thought @victor-wildlife was going to try that, apologies for the misunderstanding. I’m not sure how to upload only the file names instead of the entire dataset. Could you guide me on how to do that?
I tried deleting the images and labels folders from the data folder, and at least the execution no longer gets stuck at the same point. Now, the execution continues until it fails to find the data, which makes sense.

image

@jannesgg
Copy link
Collaborator

@pilarnavarro We have now updated the code so that the images and labels are put into a zip folder and then uploaded to MLFlow as two files instead of two folders (containing many small files), which was slowing down the process.

I suggest you now repeat the process without removing the images and labels folders as it should go much faster than before. Please let me know how it goes.

@pilarnavarro
Copy link
Collaborator Author

pilarnavarro commented Sep 28, 2024

@jannesgg thank you so much for making the changes to the code!
I ran the notebook with the recent updates in Cloudina, and it looks like the training is working properly. However, when I executed mlp.choose_baseline_model before the training, I encountered the following error:

ERROR:root:Failed to download the baseline model from MLFlow. The default baseline model will be used. module 'mlflow' has no attribute 'download_artifacts'.

It seems the training is being tracked in WandB instead of MLflow (or maybe in both?). Is that the expected behavior?

Sometimes, when running the training function, I encounter the following error:

INFO:root:Training failed due to: RESOURCE_DOES_NOT_EXIST: Run with id=89692da50d4e47688bc40994bfd0be04 not found.

Screenshot from 2024-09-28 17-14-04

That is, the training runs successfully sometimes, but other times, I receive this error either before the training starts or after some epochs. Do you know what might be causing this issue?

When running the notebook for training locally, I ran into a different issue. The registry is not set (it shows as None), so when selecting the baseline model, I get this error:

ERROR:root:Registry not supported.

I believe the environment variable MLFLOW_TRACKING_URI isn't set when running locally. I tried setting it to 127.0.0.1:5000, but the connection was refused. Could you clarify where I need to set this variable and what URI I should use to get it working locally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spyfish Support
Projects
None yet
Development

No branches or pull requests

3 participants