Support training continuation for the failed or preempted tasks #270

eu9ene · 2023-11-20T18:06:31Z

It's especially important for pre-emption of spot instances

bhearsum · 2023-11-20T19:47:28Z

The work that @gabrielBusta is doing in #226 will be a good basis for this. The difference with spot terminations is that the tasks will automatically rerun, and we won't be able to adjust the parameters. We'll need some sort of enhancement to detect that case and automatically find the previous attempt's artifacts. Eg: a check at the start of the job if this is run#1 or higher, and if it is, attempt to pull artifacts from the previous run. (We might want more sanity checking in there, too.)

gregtatum · 2023-12-18T20:38:14Z

(copying over my thoughts from #315).

For reference, this is the definition of a preemtible instance.

During the Catalan run the teacher training would often take 2 or 3 times as long to run to completion since the task would get preempted. Here is an example profile of several preemption tasks happening: https://share.firefox.dev/3Rw5u5g

Also, I believe @bhearsum said that this was a blocker for this work: https://mozilla-hub.atlassian.net/browse/RELOPS-782

marco-c · 2023-12-20T14:14:50Z

https://mozilla-hub.atlassian.net/browse/RELOPS-782 is in progress (mozilla-platform-ops/monopacker#121).

bhearsum · 2024-01-11T15:11:52Z

As of today, all of the instances we use in Taskcluster should handle spot terminations gracefully, and publish whatever artifacts exist at the time of the shutdown.

gregtatum · 2024-01-12T20:34:05Z

The next step here is to load in the previous artifacts and restart the training.

gabrielBusta · 2024-01-26T20:49:40Z

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

bhearsum · 2024-02-14T20:51:56Z

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

I learned recently that you can just press the "stop" button in the GCP console to do this :). (I believe you have the necessary access to do so.)

…f a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <[email protected]> Co-authored-by: Valentin Rigal <[email protected]> Co-authored-by: Ben Hearsum (he/him) <[email protected]>

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Nov 20, 2023

This was referenced Dec 16, 2023

[meta] Make the pipeline reliable enough to train many languages #311

Open

Training tasks should survive preemption #315

Closed

gregtatum changed the title ~~Support training continuation for the failed tasks~~ Support training continuation for the failed or preempted tasks Dec 18, 2023

bhearsum mentioned this issue Dec 20, 2023

do not land: try out new gpu worker changes #320

Closed

eu9ene mentioned this issue Jan 11, 2024

Disable spot instances for training #356

Closed

bhearsum assigned gabrielBusta Jan 11, 2024

bhearsum mentioned this issue Feb 14, 2024

[meta] issues blocking us from using spot instance for training tasks #443

Closed

bhearsum mentioned this issue Apr 30, 2024

Wrap train-taskcluster.sh in train_taskcluster.py #546

Merged

bhearsum closed this as completed in 7e97421 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support training continuation for the failed or preempted tasks #270

Support training continuation for the failed or preempted tasks #270

eu9ene commented Nov 20, 2023

bhearsum commented Nov 20, 2023

gregtatum commented Dec 18, 2023

marco-c commented Dec 20, 2023

bhearsum commented Jan 11, 2024

gregtatum commented Jan 12, 2024

gabrielBusta commented Jan 26, 2024 •

edited

Loading

bhearsum commented Feb 14, 2024

Support training continuation for the failed or preempted tasks #270

Support training continuation for the failed or preempted tasks #270

Comments

eu9ene commented Nov 20, 2023

bhearsum commented Nov 20, 2023

gregtatum commented Dec 18, 2023

marco-c commented Dec 20, 2023

bhearsum commented Jan 11, 2024

gregtatum commented Jan 12, 2024

gabrielBusta commented Jan 26, 2024 • edited Loading

bhearsum commented Feb 14, 2024

gabrielBusta commented Jan 26, 2024 •

edited

Loading