Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap train-taskcluster.sh in train_taskcluster.py #546

Merged
merged 1 commit into from
May 8, 2024

Conversation

bhearsum
Copy link
Collaborator

In order to support automatic continuation of spot terminated training runs (#270) I'll need to do some non-trivial things before handing off to train.sh. Rather than try to do them in bash, I'd like to do them in Python. This PR should have no behavioru differences, but it lays a foundation for this upcoming work. It would also be a good basis for replacing train-taskcluster.sh with Python entirely in the future.

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 30, 2024

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...

I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

@bhearsum
Copy link
Collaborator Author

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...

I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

I'm a bit hesistant to do take that on as something that blocks turning spot instances back on, since we're on a tight timeline, and it's difficult to test all of the training continuation scenarios, but I can if you'd like.

Alternatively, I can follow up with it later, after we've branched for the upcoming trainings.

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 30, 2024

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...
I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

I'm a bit hesistant to do take that on as something that blocks turning spot instances back on, since we're on a tight timeline, and it's difficult to test all of the training continuation scenarios, but I can if you'd like.

Alternatively, I can follow up with it later, after we've branched for the upcoming trainings.

Sure, it's up to you, fixing this as a follow-up works for me.

@bhearsum bhearsum force-pushed the refactor branch 2 times, most recently from cb207f3 to 43bbc78 Compare May 7, 2024 00:52
@bhearsum bhearsum marked this pull request as ready for review May 7, 2024 00:58
@bhearsum bhearsum requested a review from a team as a code owner May 7, 2024 00:58
@bhearsum bhearsum requested review from jcristau, eu9ene and gregtatum and removed request for a team and jcristau May 7, 2024 00:58
Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with merging this to unblock spot instances but please add a TODO and an issue to rewrite this in Python to reduce wrapping and complexity.

@bhearsum bhearsum merged commit 99015e1 into mozilla:main May 8, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants