-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model training not working in Cloudina #419
Comments
@jannesgg any thoughts on this? |
@victor-wildlife I believe that this is related to uploading artifacts to MLFlow, which unfortunately can take a long time depending on the number of files that need to be uploaded. One way to avoid this would be to stop uploading the entire dataset and instead upload the names of the files which are used in the training and validation sets. |
@pilarnavarro Have you tried this again since? |
Hello @jannesgg, I thought @victor-wildlife was going to try that, apologies for the misunderstanding. I’m not sure how to upload only the file names instead of the entire dataset. Could you guide me on how to do that? |
@pilarnavarro We have now updated the code so that the images and labels are put into a zip folder and then uploaded to MLFlow as two files instead of two folders (containing many small files), which was slowing down the process. I suggest you now repeat the process without removing the images and labels folders as it should go much faster than before. Please let me know how it goes. |
@jannesgg thank you so much for making the changes to the code!
It seems the training is being tracked in WandB instead of MLflow (or maybe in both?). Is that the expected behavior? Sometimes, when running the training function, I encounter the following error:
That is, the training runs successfully sometimes, but other times, I receive this error either before the training starts or after some epochs. Do you know what might be causing this issue? When running the notebook for training locally, I ran into a different issue. The
I believe the environment variable MLFLOW_TRACKING_URI isn't set when running locally. I tried setting it to 127.0.0.1:5000, but the connection was refused. Could you clarify where I need to set this variable and what URI I should use to get it working locally? |
I attempted to run the
Train_models.ipynb
notebook for model training in Cloudina for the Spyfish Aotearoa project. I used data from Spyfish Aotearoa, which I downloaded using theProcess_classifications.ipynb
notebook. However, when I tried to download the model weights, I encountered an error:Despite this error, I continued running the notebook. However, I think the training did not even start. The cell for training kept executing indefinitely, with the output shown in the attached screenshot.
After approximately half an hour, the kernel died.
The text was updated successfully, but these errors were encountered: