Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Sort out the training process file storage structure #1585

Open
BayMaxBHL opened this issue Oct 20, 2024 · 0 comments
Open

[Feature] Sort out the training process file storage structure #1585

BayMaxBHL opened this issue Oct 20, 2024 · 0 comments

Comments

@BayMaxBHL
Copy link
Contributor

What is the feature?

In the training process, commonly used to save the file structure:
config dump -> workdir
log.txt -> _log_dir(workdir+tempstemp)
checkpoint(best) -> workdir
checkpoint(iter、epoch)and txt -> workdir
vis(tensorboard) -> workdir

In addition, I'll customize hooks to save the project code and validate the visualizations.
I have tried several ways to make each training file saved in a folder.Will encounter some problems.

Method 1: Before creating runner, change workdir to workdir+experment_name(tempstemp). However, tempstemp may be inconsistent due to multiple cards. You need to put dist_init in front of the runner, modify it, and then create the runner.

Method 2: Inherit runner and unify the save path to _log_dir(workdir+tempstemp). Because _log_dir is written dead. config dump and save checkpoint need to be rewritten to achieve this with minimal changes. However, if you save the checkpoint (iter, epoch) and txt, the txt will be stored in workdir. As a result, only the last three checkpoints cannot be saved.

Although the above two methods can indirectly complete the purpose, but the feeling of sewing is very uncomfortable.
From the save logic, workdir should be self.workdir+self._experment_name. For example, the XX experiment I want to do has been done many times. For example, the save path is XX, experiment A, experiment B... . Although runner's init has _experment_name, the effect is not represented.
I also understand that multiple save files may be scattered into different paths, but some are written as fixed, which can only be modified by the inherited runner, and some are written in the runner's init, which cannot be modified even if I add hooks. It kills me with Obsessive-compulsive disorder.

Any other context?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant