Wrap train-taskcluster.sh in train_taskcluster.py #546

bhearsum · 2024-04-30T00:21:33Z

In order to support automatic continuation of spot terminated training runs (#270) I'll need to do some non-trivial things before handing off to train.sh. Rather than try to do them in bash, I'd like to do them in Python. This PR should have no behavioru differences, but it lays a foundation for this upcoming work. It would also be a good basis for replacing train-taskcluster.sh with Python entirely in the future.

eu9ene · 2024-04-30T17:20:18Z

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...

I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

bhearsum · 2024-04-30T17:40:20Z

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...

I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

I'm a bit hesistant to do take that on as something that blocks turning spot instances back on, since we're on a tight timeline, and it's difficult to test all of the training continuation scenarios, but I can if you'd like.

Alternatively, I can follow up with it later, after we've branched for the upcoming trainings.

eu9ene · 2024-04-30T18:45:55Z

I'm concerned about the number of wrappings in our training. We have: Marian wrapped by OpusTrainer wrapped by train.sh wrapped by train-taskcluster.sh that will be wrapped by train_taskcluster.py...
I strongly recommend rewriting train-taskcluster.sh in python instead of adding another wrapping level to reduce complexity.

I'm a bit hesistant to do take that on as something that blocks turning spot instances back on, since we're on a tight timeline, and it's difficult to test all of the training continuation scenarios, but I can if you'd like.

Alternatively, I can follow up with it later, after we've branched for the upcoming trainings.

Sure, it's up to you, fixing this as a follow-up works for me.

eu9ene

I'm ok with merging this to unblock spot instances but please add a TODO and an issue to rewrite this in Python to reduce wrapping and complexity.

bhearsum force-pushed the refactor branch from 2eaca54 to 7772e20 Compare April 30, 2024 13:56

gabrielBusta mentioned this pull request May 1, 2024

Add file overrides to the training continuation, and refactor the implementation #543

Closed

bhearsum force-pushed the refactor branch 2 times, most recently from cb207f3 to 43bbc78 Compare May 7, 2024 00:52

bhearsum marked this pull request as ready for review May 7, 2024 00:58

bhearsum requested a review from a team as a code owner May 7, 2024 00:58

bhearsum requested review from jcristau, eu9ene and gregtatum and removed request for a team and jcristau May 7, 2024 00:58

jcristau approved these changes May 7, 2024

View reviewed changes

eu9ene approved these changes May 8, 2024

View reviewed changes

Wrap train-taskcluster.sh in train_taskcluster.py

946501e

bhearsum force-pushed the refactor branch from 43bbc78 to 946501e Compare May 8, 2024 17:29

bhearsum merged commit 99015e1 into mozilla:main May 8, 2024
7 checks passed

bhearsum mentioned this pull request May 8, 2024

Get rid of train-taskcluster.sh #579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap train-taskcluster.sh in train_taskcluster.py #546

Wrap train-taskcluster.sh in train_taskcluster.py #546

bhearsum commented Apr 30, 2024

eu9ene commented Apr 30, 2024

bhearsum commented Apr 30, 2024

eu9ene commented Apr 30, 2024

eu9ene left a comment

Wrap train-taskcluster.sh in train_taskcluster.py #546

Wrap train-taskcluster.sh in train_taskcluster.py #546

Conversation

bhearsum commented Apr 30, 2024

eu9ene commented Apr 30, 2024

bhearsum commented Apr 30, 2024

eu9ene commented Apr 30, 2024

eu9ene left a comment

Choose a reason for hiding this comment