-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep Last N Checkpoints #718
Conversation
- Introduced `keep_last_n_checkpoints` parameter in configuration and training scripts to manage the number of recent checkpoints retained. - Updated `finetune_cli.py`, `finetune_gradio.py`, and `trainer.py` to support this new parameter. - Implemented logic to remove older checkpoints beyond the specified limit during training. - Adjusted settings loading and saving to include the new checkpoint management option. This enhancement improves the training process by preventing excessive storage usage from old checkpoints.
…and scripts - Set `keep_last_n_checkpoints` to 0 in E2TTS and F5TTS training YAML files to disable checkpoint retention. - Modify `trainer.py` to handle `keep_last_n_checkpoints` as None or 0 to keep all checkpoints. - Update `finetune_cli.py` and `finetune_gradio.py` to reflect the new default value and provide user guidance. - Ensure `train.py` retrieves the checkpoint setting correctly from the configuration. These changes streamline checkpoint management and enhance user experience by clarifying retention options.
- Set `keep_last_n_checkpoints` to 0 in `finetune_gradio.py` and `E2TTS_Small_train.yaml` to disable retention of recent checkpoints. - Ensure consistency across settings to streamline checkpoint handling during training. These changes enhance the clarity and functionality of checkpoint management.
- Updated `keep_last_n_checkpoints` parameter descriptions in `E2TTS` and `F5TTS` YAML files to clarify that setting it to 0 disables retention of recent checkpoints. - Modified `trainer.py` to validate `keep_last_n_checkpoints`, ensuring it must be 0 or positive. - Adjusted help text in `finetune_cli.py` to reflect the new validation rules. - Enhanced user interface in `finetune_gradio.py` to enforce minimum value for checkpoint retention. These changes improve the usability and understanding of checkpoint management settings.
Hi @hcsolakoglu thanks for pr. could refer to #392 |
Hello Yushen, this PR adds a feature for those like me who want to manage the number of checkpoints without reducing the save frequency. I don't see any reason for it not to be included; it's backward compatible and works properly. If I may kindly ask, could you reconsider accepting it? I'd be happy to make any changes you suggest. I don't want to maintain a separate fork just for this feature. @SWivid |
Hi @hcsolakoglu yes for sure. could do a 0.4.1 version with 83efc3f #711 thought your previous pr is fine,
if it's more convenient than solving conflicts here, thanks~ |
- Changed `keep_last_n_checkpoints` default value to -1 in YAML configuration files to keep all checkpoints by default. - Enhanced validation in `trainer.py` to ensure `keep_last_n_checkpoints` is an integer and within acceptable limits. - Updated help text in `finetune_cli.py` and user interface in `finetune_gradio.py` to reflect the new default behavior and provide clearer guidance on checkpoint retention options. - Ensured consistent handling of checkpoint settings across training scripts. These changes improve usability and understanding of checkpoint management.
Hi @SWivid , I resolved the merge conflicts, made the changes you requested and took care of the formatting. I would appreciate it if you could review and merge it when you have time. |
This pull request introduces a feature to retain only the last N checkpoints during training. This change helps in managing disk space efficiently by automatically deleting older checkpoints beyond the specified limit.