Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload failure after node restart #179

Open
lcjohnso opened this issue Oct 18, 2024 · 1 comment
Open

Upload failure after node restart #179

lcjohnso opened this issue Oct 18, 2024 · 1 comment

Comments

@lcjohnso
Copy link
Member

Situation: the 17 Oct GZ retraining job encountered an issue -- the job preparation script failed leaving node in unusable state. Cliff rebooted the node, and that allowed the job (which was still in "active" state in training pool) to start up and run correctly. However, at end of job, there was a FileUploadAccessDenied error when the node tried to push the stdout & stderr log files up to kadeactivelearning blob storage account. See error details:
Screenshot 2024-10-18 at 10 22 14 AM

Hypotheses:

  1. The signature provided to the training job to use for upload expired before upload could occur. Solution: check signature and increase valid duration (to allow for delays due to human-triggered node reboots).
  2. ???
@lcjohnso
Copy link
Member Author

Another solution: prevent the job preparation task failure so everything runs as expected. Possible solution: add a sleep pause to allow node to establish network connectivity before trying to mount directories / drives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant