Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep Last N Checkpoints #718

Merged
merged 6 commits into from
Jan 15, 2025
Merged

Conversation

hcsolakoglu
Copy link
Contributor

This pull request introduces a feature to retain only the last N checkpoints during training. This change helps in managing disk space efficiently by automatically deleting older checkpoints beyond the specified limit.

- Introduced `keep_last_n_checkpoints` parameter in configuration and training scripts to manage the number of recent checkpoints retained.
- Updated `finetune_cli.py`, `finetune_gradio.py`, and `trainer.py` to support this new parameter.
- Implemented logic to remove older checkpoints beyond the specified limit during training.
- Adjusted settings loading and saving to include the new checkpoint management option.

This enhancement improves the training process by preventing excessive storage usage from old checkpoints.
…and scripts

- Set `keep_last_n_checkpoints` to 0 in E2TTS and F5TTS training YAML files to disable checkpoint retention.
- Modify `trainer.py` to handle `keep_last_n_checkpoints` as None or 0 to keep all checkpoints.
- Update `finetune_cli.py` and `finetune_gradio.py` to reflect the new default value and provide user guidance.
- Ensure `train.py` retrieves the checkpoint setting correctly from the configuration.

These changes streamline checkpoint management and enhance user experience by clarifying retention options.
- Set `keep_last_n_checkpoints` to 0 in `finetune_gradio.py` and `E2TTS_Small_train.yaml` to disable retention of recent checkpoints.
- Ensure consistency across settings to streamline checkpoint handling during training.

These changes enhance the clarity and functionality of checkpoint management.
- Updated `keep_last_n_checkpoints` parameter descriptions in `E2TTS` and `F5TTS` YAML files to clarify that setting it to 0 disables retention of recent checkpoints.
- Modified `trainer.py` to validate `keep_last_n_checkpoints`, ensuring it must be 0 or positive.
- Adjusted help text in `finetune_cli.py` to reflect the new validation rules.
- Enhanced user interface in `finetune_gradio.py` to enforce minimum value for checkpoint retention.

These changes improve the usability and understanding of checkpoint management settings.
@SWivid
Copy link
Owner

SWivid commented Jan 14, 2025

Hi @hcsolakoglu thanks for pr.

could refer to #392

@SWivid SWivid closed this Jan 14, 2025
@hcsolakoglu
Copy link
Contributor Author

Hello Yushen, this PR adds a feature for those like me who want to manage the number of checkpoints without reducing the save frequency. I don't see any reason for it not to be included; it's backward compatible and works properly. If I may kindly ask, could you reconsider accepting it? I'd be happy to make any changes you suggest. I don't want to maintain a separate fork just for this feature. @SWivid

@SWivid SWivid reopened this Jan 14, 2025
@SWivid
Copy link
Owner

SWivid commented Jan 14, 2025

Hi @hcsolakoglu

yes for sure. could do a 0.4.1 version with 83efc3f #711

thought your previous pr is fine,
some little suggestion:

  1. -1 for turn off (default), 0 for literally not saving any intermediate ckpt.
    # number of recent checkpoints to keep (excluding model_last.pt). Set to 0 to disable (keep all checkpoints). Positive numbers limit the number of checkpoints kept -> # last n intermediate checkpoints saved, set -1 to disable
  2. remember to run formatting, see https://github.com/SWivid/F5-TTS?tab=readme-ov-file#development
    necessary to keep consistency of code format and to pass github workflow auto check

if it's more convenient than solving conflicts here,
you could make a new pr, and we could check together there.

thanks~

- Changed `keep_last_n_checkpoints` default value to -1 in YAML configuration files to keep all checkpoints by default.
- Enhanced validation in `trainer.py` to ensure `keep_last_n_checkpoints` is an integer and within acceptable limits.
- Updated help text in `finetune_cli.py` and user interface in `finetune_gradio.py` to reflect the new default behavior and provide clearer guidance on checkpoint retention options.
- Ensured consistent handling of checkpoint settings across training scripts.

These changes improve usability and understanding of checkpoint management.
@hcsolakoglu
Copy link
Contributor Author

Hi @SWivid , I resolved the merge conflicts, made the changes you requested and took care of the formatting. I would appreciate it if you could review and merge it when you have time.

@SWivid SWivid merged commit 76b1b03 into SWivid:main Jan 15, 2025
1 check passed
SWivid added a commit that referenced this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants