Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support training continuation for the failed or preempted tasks #270

Closed
Tracked by #443
eu9ene opened this issue Nov 20, 2023 · 7 comments
Closed
Tracked by #443

Support training continuation for the failed or preempted tasks #270

eu9ene opened this issue Nov 20, 2023 · 7 comments
Assignees
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Nov 20, 2023

It's especially important for pre-emption of spot instances

@eu9ene eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Nov 20, 2023
@bhearsum
Copy link
Collaborator

The work that @gabrielBusta is doing in #226 will be a good basis for this. The difference with spot terminations is that the tasks will automatically rerun, and we won't be able to adjust the parameters. We'll need some sort of enhancement to detect that case and automatically find the previous attempt's artifacts. Eg: a check at the start of the job if this is run#1 or higher, and if it is, attempt to pull artifacts from the previous run. (We might want more sanity checking in there, too.)

@gregtatum gregtatum changed the title Support training continuation for the failed tasks Support training continuation for the failed or preempted tasks Dec 18, 2023
@gregtatum
Copy link
Member

(copying over my thoughts from #315).

For reference, this is the definition of a preemtible instance.

During the Catalan run the teacher training would often take 2 or 3 times as long to run to completion since the task would get preempted. Here is an example profile of several preemption tasks happening: https://share.firefox.dev/3Rw5u5g

Also, I believe @bhearsum said that this was a blocker for this work: https://mozilla-hub.atlassian.net/browse/RELOPS-782

@marco-c
Copy link
Collaborator

marco-c commented Dec 20, 2023

@bhearsum
Copy link
Collaborator

As of today, all of the instances we use in Taskcluster should handle spot terminations gracefully, and publish whatever artifacts exist at the time of the shutdown.

@gregtatum
Copy link
Member

The next step here is to load in the previous artifacts and restart the training.

@gabrielBusta
Copy link
Member

gabrielBusta commented Jan 26, 2024

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

@bhearsum
Copy link
Collaborator

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

I learned recently that you can just press the "stop" button in the GCP console to do this :). (I believe you have the necessary access to do so.)

eu9ene pushed a commit that referenced this issue May 21, 2024
…f a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.
eu9ene added a commit that referenced this issue May 22, 2024
* Update bicleaner

* Lock requirements

* Use larger multilingual models

* Fix requirements

* Fix typo

* Add libs for hunspell

* Fix model downloading

* Fix model downloading

* Use a toolchain for cyhunspell

* Remove a hard github dependency

* Fix test

* Add COMET to the evaluation (#587)

* Custom cleaning (#547)

* Update default config

* Pre-download fast text model

* Add custom filters

* Add unit tests for config generation

* Make using custom filtering configs configurable

* Fix substitution

* Parse tasks with label finetune-student (#609)

* Add test case

* Update regex

* Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Support news-crawl importer (#608)

* Ensure all of the task labels can be parsed in the task graph (#612)

* Show the stack trace when there is an error in tracking

* Rename parse_tag to parse_task_label

* Document the regexes

* Turn the parsed label into a NamedTuple for better typing hints

* Get the tag parsing working with the full task graph

* Allow for 3 letter language codes

* Temporarily disable an evaluate task that is failing

* Update code docs a bit

* Fix tests for finetune-student

* Use shorter names for URLs (#611)

* Change the caching strategy for teachers (#606)

* Fix comment

---------

Co-authored-by: Greg Tatum <[email protected]>
Co-authored-by: Valentin Rigal <[email protected]>
Co-authored-by: Ben Hearsum (he/him) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

5 participants