Keep Last N Checkpoints #718

hcsolakoglu · 2025-01-14T00:11:43Z

This pull request introduces a feature to retain only the last N checkpoints during training. This change helps in managing disk space efficiently by automatically deleting older checkpoints beyond the specified limit.

- Introduced `keep_last_n_checkpoints` parameter in configuration and training scripts to manage the number of recent checkpoints retained. - Updated `finetune_cli.py`, `finetune_gradio.py`, and `trainer.py` to support this new parameter. - Implemented logic to remove older checkpoints beyond the specified limit during training. - Adjusted settings loading and saving to include the new checkpoint management option. This enhancement improves the training process by preventing excessive storage usage from old checkpoints.

…and scripts - Set `keep_last_n_checkpoints` to 0 in E2TTS and F5TTS training YAML files to disable checkpoint retention. - Modify `trainer.py` to handle `keep_last_n_checkpoints` as None or 0 to keep all checkpoints. - Update `finetune_cli.py` and `finetune_gradio.py` to reflect the new default value and provide user guidance. - Ensure `train.py` retrieves the checkpoint setting correctly from the configuration. These changes streamline checkpoint management and enhance user experience by clarifying retention options.

- Set `keep_last_n_checkpoints` to 0 in `finetune_gradio.py` and `E2TTS_Small_train.yaml` to disable retention of recent checkpoints. - Ensure consistency across settings to streamline checkpoint handling during training. These changes enhance the clarity and functionality of checkpoint management.

- Updated `keep_last_n_checkpoints` parameter descriptions in `E2TTS` and `F5TTS` YAML files to clarify that setting it to 0 disables retention of recent checkpoints. - Modified `trainer.py` to validate `keep_last_n_checkpoints`, ensuring it must be 0 or positive. - Adjusted help text in `finetune_cli.py` to reflect the new validation rules. - Enhanced user interface in `finetune_gradio.py` to enforce minimum value for checkpoint retention. These changes improve the usability and understanding of checkpoint management settings.

SWivid · 2025-01-14T07:47:08Z

Hi @hcsolakoglu thanks for pr.

could refer to #392

hcsolakoglu · 2025-01-14T13:02:30Z

Hello Yushen, this PR adds a feature for those like me who want to manage the number of checkpoints without reducing the save frequency. I don't see any reason for it not to be included; it's backward compatible and works properly. If I may kindly ask, could you reconsider accepting it? I'd be happy to make any changes you suggest. I don't want to maintain a separate fork just for this feature. @SWivid

SWivid · 2025-01-14T15:23:03Z

Hi @hcsolakoglu

yes for sure. could do a 0.4.1 version with 83efc3f #711

thought your previous pr is fine,
some little suggestion:

-1 for turn off (default), 0 for literally not saving any intermediate ckpt.
# number of recent checkpoints to keep (excluding model_last.pt). Set to 0 to disable (keep all checkpoints). Positive numbers limit the number of checkpoints kept -> # last n intermediate checkpoints saved, set -1 to disable
remember to run formatting, see https://github.com/SWivid/F5-TTS?tab=readme-ov-file#development
necessary to keep consistency of code format and to pass github workflow auto check

if it's more convenient than solving conflicts here,
you could make a new pr, and we could check together there.

thanks~

- Changed `keep_last_n_checkpoints` default value to -1 in YAML configuration files to keep all checkpoints by default. - Enhanced validation in `trainer.py` to ensure `keep_last_n_checkpoints` is an integer and within acceptable limits. - Updated help text in `finetune_cli.py` and user interface in `finetune_gradio.py` to reflect the new default behavior and provide clearer guidance on checkpoint retention options. - Ensured consistent handling of checkpoint settings across training scripts. These changes improve usability and understanding of checkpoint management.

hcsolakoglu · 2025-01-14T19:37:29Z

Hi @SWivid , I resolved the merge conflicts, made the changes you requested and took care of the formatting. I would appreciate it if you could review and merge it when you have time.

hcsolakoglu added 4 commits January 12, 2025 21:08

SWivid closed this Jan 14, 2025

SWivid reopened this Jan 14, 2025

hcsolakoglu added 2 commits January 14, 2025 19:50

Merge remote-tracking branch 'origin/main' into keep-last-n-checkpoints

2d55eca

SWivid merged commit 76b1b03 into SWivid:main Jan 15, 2025
1 check passed

SWivid added a commit that referenced this pull request Jan 15, 2025

0.4.1 #718 add keep_last_n_checkpoints option

12d6970

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep Last N Checkpoints #718

Keep Last N Checkpoints #718

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025 •

edited

Loading

hcsolakoglu commented Jan 14, 2025

Keep Last N Checkpoints #718

Keep Last N Checkpoints #718

Conversation

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025 • edited Loading

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025 •

edited

Loading