-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support training continuation for the failed or preempted tasks #270
Comments
The work that @gabrielBusta is doing in #226 will be a good basis for this. The difference with spot terminations is that the tasks will automatically rerun, and we won't be able to adjust the parameters. We'll need some sort of enhancement to detect that case and automatically find the previous attempt's artifacts. Eg: a check at the start of the job if this is run#1 or higher, and if it is, attempt to pull artifacts from the previous run. (We might want more sanity checking in there, too.) |
(copying over my thoughts from #315). For reference, this is the definition of a preemtible instance. During the Catalan run the teacher training would often take 2 or 3 times as long to run to completion since the task would get preempted. Here is an example profile of several preemption tasks happening: https://share.firefox.dev/3Rw5u5g Also, I believe @bhearsum said that this was a blocker for this work: https://mozilla-hub.atlassian.net/browse/RELOPS-782 |
As of today, all of the instances we use in Taskcluster should handle spot terminations gracefully, and publish whatever artifacts exist at the time of the shutdown. |
The next step here is to load in the previous artifacts and restart the training. |
This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination? |
I learned recently that you can just press the "stop" button in the GCP console to do this :). (I believe you have the necessary access to do so.) |
* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <[email protected]> Co-authored-by: Valentin Rigal <[email protected]> Co-authored-by: Ben Hearsum (he/him) <[email protected]>
It's especially important for pre-emption of spot instances
The text was updated successfully, but these errors were encountered: